nodejs / node

Node.js JavaScript runtime ✨🐢🚀✨
https://nodejs.org
Other
107.32k stars 29.46k forks source link

streaming / iterative fs.readdir #583

Closed jonathanong closed 5 years ago

jonathanong commented 9 years ago

since we're in ES6 territory now, i'm thinking the sync version should be an iterable

var dirs = fs.readdirIter(__dirname);
for (dir of dirs) {

}

and have the async version be an object stream:

var stream = fs.readdirStream(__dirname);
stream.on('data', dir => )

See: https://github.com/joyent/node/issues/388

novacrazy commented 9 years ago

Using generators/iterators in io.js where they make sense would be a good addition.

cjb commented 9 years ago

Does this depend on the work in joyent/libuv#1574 to be merged first? I don't think it's in libuv yet.

vkurchatkin commented 9 years ago

:+1:

in the future asynchronous version could return an observable:

for (dir on fs.readdir(__dirname)) {

}
hemanth commented 9 years ago

:+1:

Qard commented 9 years ago

I'm all for the new named versions. We don't want to break compatibility with existing code by changing behaviour of existing functions, but new stuff could work. I'm not sure what the policy is yet on new interfaces that deviate from node.js though.

@iojs/tc Thoughts?

rvagg commented 9 years ago

Where else does an interface like this make sense? Would this be a one-off of would there be a flood of requests to add the same style interface to other core APIs?

jonathanong commented 9 years ago

@rvagg I looked for any other APIs that would look like this but I couldn't find any. I think this may be a one-off.

rvagg commented 9 years ago

I think the way forward on this is for someone here to propose a change in a PR and then we'll escalate that to a TC discussion because it would be good to cover the question of whether adopting a new style of API is desirable. If it's too much work to come up with an initial implementation then we could just elevate this issue, it'd just be more of a hypothetical discussion then.

Qard commented 9 years ago

I think the streams version makes sense. I would test the iterator design in userland though. It can apply to any sort of stream, really. And it's easy enough to abstract a stream as an iterator in userland.

Fishrock123 commented 9 years ago

The eventual observables thing sounds pretty rad.

jonathanong commented 9 years ago

@Qard how would you create an iterator from a stream? streams are async... iterators are not...

@cjb i don't think it's merged. i'm going to open an issue in libuv now

Qard commented 9 years ago

Just use co? (or something similar)

domenic commented 9 years ago

OK, a few things:

So I don't think this should be a new style of API. Just an object mode stream is fine.

timoxley commented 9 years ago

Whoa, yeah this is a totally bogus suggestion as iterators and generators can't be used for iterating over async. This tripped me up at first as well.

But note that though it's not exactly what you had in mind fs.readFileSync(file, 'utf8') is already iterable by way of a String being iterable, and soon fs.readFileSync(file) will too be iterable as Buffers implement the iterable interface: https://github.com/iojs/io.js/pull/525

domenic commented 9 years ago

So re-reading the OP: readdirSync already returns an array which is iterable

bnoordhuis commented 9 years ago

It does, but readdirSync() is (too) eager: it slurps the directory in one pass. If you apply it to a directory with 1M+ entries, it will likely bring down the process.

An iterator that walks the entries piecemeal would be great but it requires a number of changes to libuv. @misterdjules already did some of the prep work but I recall that there were still some loose ends. I'm not even sure if we reached consensus that his approach was really the best way forward.

vkurchatkin commented 9 years ago

as iterators and generators can't be used for iterating over async

they can!

for (let promise of iterator) {
  let val = await promise;
}
domenic commented 9 years ago

@bnoordhuis I see, so you'd block on each call to .next(), instead of all at once.

@vkurchatkin yes, you can invent contracts on top of iterators. Especially ones that only work in ES2016. But they won't necessarily be respected. Someone could "forget" an await, or a .then, or just call .next() twice in a row. Now what? How do you know if it's end-of-iterable or I/O error or ...?

domenic commented 9 years ago

@vkurchatkin essentially you are proposing async iterable as next() : -> Iteration<Promise<T>>, whereas a more coherent design is next(): -> Promise<Iteration<T>>.

For example in your example the producer can't decide when to return or throw from the generator side, because the producer has to wait for the async process to complete before doing so, but the only way to signal completion or errors is synchronous return/throw.

bnoordhuis commented 9 years ago

you'd block on each call to .next(), instead of all at once.

Yes, but I'm mostly concerned with memory usage; an iterator brings it down from O(n) to O(1). Currently, trying to scan a large directory exhausts the heap and terminates the process.

vkurchatkin commented 9 years ago

@domenic agreed, ugly and far from ideal.

a more coherent design is next(): -> Promise<Iteration>

I can't see how this can be achieved using existing ES2015/2016 primitives. Thanks for the link, probably it has explanation.

jonathanong commented 9 years ago

Yeah the iterative version in my head was a synchronous version that doesn't load the entire array in memory at once. But i care way more about the async version.

Qard commented 9 years ago

@domenic While generators on their own are sync, they can be used in an async way via co. Maybe you don't like the weird promise trampoline hack it uses, but it works and many people are already using it in that way. I suggested using co because I had already suggested doing the iterator thing in userland, which means those sort of tools are available. If you prefer some different async iterator module, you can write a streams abstraction on that.

I wouldn't mind this being in core, I just suggested a userland approach because I expected some members of the core team might disagree.

jonathanong commented 9 years ago

btw let's not do the iterator design. i realized that if something like this happens:

var i = 0;
for (name of fs.readdirIter(__dirname)) {
  if (i++ > 5) throw new Error('boom');
}

assuming fs.readdirIter() creates a file descriptor, the above may cause a file descriptor leak as the iterator may remain in a paused state. this is assuming fs.readdirIter() opens a file descriptor until it's complete. having users manually close the file descriptor may be too much to ask (not worth people complaining about leaks when they don't know about this).

so... let's just do the stream version for now :D

domenic commented 9 years ago

@jonathanong as long as the iterator implements iterator.return() that closes the FD, there will be no leak.

vkurchatkin commented 9 years ago

@domenic why isn't return here: https://people.mozilla.org/~jorendorff/es6-draft.html#sec-iterator-interface

domenic commented 9 years ago

@vkurchatkin same reason throw isn't.

vkurchatkin commented 9 years ago

not working:

var arr = [1,2,3,4,5];
var it = arr[Symbol.iterator]();

it.return = function() {
  cobsole.log('return');
}

for (var v of it) {
  if (v > 3) break;
  console.log(v)
}
domenic commented 9 years ago

Ah yeah, forgot it wasn't implemented in V8 yet (bug).

vkurchatkin commented 9 years ago

aha, here it is: https://people.mozilla.org/~jorendorff/es6-draft.html#sec-iteratorclose

Still don't understand why it is not a part of iterator interface.

jonathanong commented 9 years ago

Interesting. Good to know we have an es master here!

YurySolovyov commented 9 years ago

plain node stream style looks good too:

var stream = fs.readdirStream(__dirname);
stream.on('data', function(dir) { ... })

is it possible to create iterable version on top of it?

rvagg commented 9 years ago

+1 to an object stream for this, that could probably be shoehorned into an iterator (?)

calvinmetcalf commented 9 years ago

So backing up to the object mode stream for fs.readDir (e.g. fs.createReadDirStream ?) would it make sense for there to be a deep/recursive mode for it (e.g. similar to what I made with spider-stream) at the moment doing this is tricky as you either have to stat all the things in the directory or do what rimraf doesn't do which would be assume everything is a directory and on error know it's a file (what rimraf does do doesn't really apply here).

rvagg commented 9 years ago

Sounds like something for userland still. Perhaps made easier / safer by a streaming readdir

calvinmetcalf commented 9 years ago

Perhaps made easier / safer by a streaming readdir

that is what I was going for

Fishrock123 commented 9 years ago

Looks like there might be some more movement on this: https://github.com/libuv/libuv/pull/416

To my understanding, it will require a major bump of libuv, so this might take a while to actually come to io.js / node.

rsms commented 9 years ago

With libuv/libuv#416 we could simply expose an API like this:

let dir = await fs.opendir("foo"), entries;
do {
  entries = await dir.read(100)
  // deal with entries
} while (entries.length === 100);
// when dir is finalized it calls closedir

Or with simple callbacks, if promises are too crazy.

FTR, Go's readdir requires you to limit the number of entries read. By default it's 100.

Also it hasn't been discussed much here, but not limiting readdir is a security risk — allows virtually unlimited memory allocation initiated by user input. I.e. a file server implementation (or a web server, because everyone seem to like web servers) that allow uploading/adding files and listing files could be DoS'd by creating many files and list the entries.

domenic commented 9 years ago

when dir is finalized it calls closedir

Not this. JS doesn't have finalization. It has garbage collection, which is never guaranteed to run. Scarce system resources like directory handles should never depend on garbage collection to collect them.

The stream design seems sane. Or yours, but with dir.close(). Alternately a lower-level API like const d = await fs.opendir(dir); const entry = await fs.readdir(d); fs.closedir(d) would be reasonable for core IMO.

Fishrock123 commented 9 years ago

Also it hasn't been discussed much here, but not limiting readdir is a security risk — allows virtually unlimited memory allocation initiated by user input. I.e. a file server implementation (or a web server, because everyone seem to like web servers) that allow uploading/adding files and listing files could be DoS'd by creating many files and list the entries.

This shouldn't be an issue with the streaming proposal, but should be with the current implementations I think.

rsms commented 9 years ago

@domenic I'm well aware how v8 works ;)

(BTW, object streams seems to be favoured by individuals who tend to seek out opportunities to create new abstractions — it doesn't add clarity and suffers in performance IMHO.)

TL;DR: async bridge toll costs would make reading single dirents unfeasible.

libuv/libuv#416 exposes the following interface:

int uv_fs_opendir(uv_loop_t*, uv_fs_t*, const char* path, uv_fs_cb);
int uv_fs_readdir(uv_loop_t*, uv_fs_t*, const uv_dir_t*, uv_dirent_t[], size_t limit, uv_fs_cb);
int uv_fs_closedir(uv_loop_t*, uv_fs_t*, const uv_dir_t*, uv_fs_cb);

Where the readdir call reads up to limit amount of entries. There's a bridge toll for any uv async call (which can be quite dramatic if you perform many calls, like i.e. a compiler would do.) It makes sense to read a sensible amount of dirents per uv request, to avoid paying high bridge tolls.

Revised proposal for exposure in node

let fs = require("fs");
fs.openDir("foo", function (err, dir) {
  if (err) { throw err; }
  let readNext = function() {
    dir.read(100, function (err, entries) {
      if (err) { throw err; }
      // deal with entries
      if (entries.length === 100) {
        readNext();
      } else {
        dir.close(function(err) {
          if (err) { throw err; }
          // done reading directory
        });
      }
    });
  }
  readNext();
});

Addition to lib/fs.js:

let fsb = process.binding("fs")

export class Dir {
  read(limit :number, cb :(err:Error,entries:DirEnt[])=>void) :void {}
  close(cb: (err:Error)=>void) :void {}
}

interface DirEnt {
  name :string;
  type :string; // "dir"|"file"|"link"|"fifo"|"socket"|"char"|"block"|"?"
}

export function openDir(path :string, cb :(err:Error,dir:Dir)=>void) {
  fsb.opendir(path, (err, handle) => {
    if (!err) {
      let dir = new Dir;
      dir._handle = handle;
    }
    cb(err, dir);
  })
}

Dir.prototype.read = function(limit, cb) {
  if (!limit || limit < 1) { limit = 100 }
  fsb.readdir(this._handle, limit, (err, entries) => {
    if (err) {
      fsb.closedir(this._handle)
      this._handle = null
    }
    cb(err, entries);
  })
}

Dir.prototype.close = function(cb) {
  if (!this._handle) { throw new Error('already closed') }
  fsb.closedir(this._handle, cb)
  this._handle = null;
}

Alternative w/ promises

let fs = require("fs");
let dir = await fs.openDir("foo"), entries;
do {
  entries = await dir.read(100)
  // deal with entries
} while (entries.length === 100);
await dir.close();
// done reading directory

Addition to lib/fs.js:

let fsb = process.binding("fs")

export class Dir {
  async read(limit :number) :DirEnt[] {}
  async close() :void {}
}

interface DirEnt {
  name :string;
  type :string; // "dir"|"file"|"link"|"fifo"|"socket"|"char"|"block"|"?"
}

export function openDir(path :string) :Promise<Dir> {
  return new Promise((resolve, reject) => {
    fsb.opendir(path, (err, handle) => {
      if (err) {
        reject(err);
      } else {
        let dir = new Dir;
        dir._handle = handle;
        resolve(dir);
      }
    })
  })
}

Dir.prototype.read = function(limit) {
  if (!limit || limit < 1) { limit = 100 }
  return new Promise((resolve, reject) => {
    fsb.readdir(this._handle, limit, (err, entries) => {
      if (err) {
        fsb.closedir(this._handle)
        this._handle = null
        reject(err);
      } else {
        resolve(entries);
      }
    })
  })
}

Dir.prototype.close = function() {
  return new Promise((resolve, reject) => {
    if (!this._handle) { throw new Error('already closed') }
    fsb.closedir(this._handle, err => {
      if (err) { reject(err); } else { resolve(); }
    })
    this._handle = null;
  })
}
hollowdoor commented 9 years ago

How are things going with this?

Fishrock123 commented 9 years ago

@hollowdoor This is dependent on this libuv PR: https://github.com/libuv/libuv/pull/416

Trott commented 8 years ago

This has been a while and will probably be a while longer. From https://github.com/libuv/libuv/pull/416#issuecomment-218483712:

There is no defined roadmap as of yet. v2 won't happen overnight, that's for sure. Time is a scarce resource, I'm afraid. Personally I haven't had the time to work on any of the things I wanted for v2 yet.

So for this to happen, if I understand everything correctly:

I'm not sure if this is a good use of the stalled label or not, but I'll just leave this comment for anyone curious if there's been progress since October...

DomVinyard commented 8 years ago

Any progress on this?

Trott commented 8 years ago

@DomVinyard It would appear no. If you want to see this happen, it requires https://github.com/libuv/libuv/pull/416 so maybe head over there to see if there's a way you can help. /cc @whitlockjc

DomVinyard commented 8 years ago

@Trott Appreciate the fwd. For anybody else awaiting this to solve the same subset of problems as me - a combination of native readdir and stream iterator gets you some of the way.

DomVinyard commented 7 years ago

Almost three months later, still waiting eagerly on this. native-readdir is a nice shim but it's not compatible with windows. Does anybody have any other solutions?

jasnell commented 7 years ago

There's really nothing we can do on this until the appropriate support lands in libuv and that is updated here. Best would be to try to go shake the tree on that side.

whitlockjc commented 7 years ago

I'll be picking this up tonight, I apologize for the delay. Job changes and such have pulled me away from Node.js for a few months. I'm back and will get libuv support in ASAP.