nodejs / node

Node.js JavaScript runtime ✨🐢🚀✨
https://nodejs.org
Other
107.33k stars 29.46k forks source link

streaming / iterative fs.readdir #583

Closed jonathanong closed 5 years ago

jonathanong commented 9 years ago

since we're in ES6 territory now, i'm thinking the sync version should be an iterable

var dirs = fs.readdirIter(__dirname);
for (dir of dirs) {

}

and have the async version be an object stream:

var stream = fs.readdirStream(__dirname);
stream.on('data', dir => )

See: https://github.com/joyent/node/issues/388

calvinmetcalf commented 7 years ago

just fyi the streams wg is broadly in favor of adding async iterator support to streams once v8 supports it.

zevero commented 7 years ago

Why can't we have just a simple promise if callback is omitted, like for example

fs.readDir(path[, options])
  .then(files=>console.log(files))
  .catch(err=>console.log(err));

I know I can use https://github.com/thenables/thenify but I assumed that since node supports Promises it would use them...

hollowdoor commented 7 years ago

@zevero Wouldn't that present the same problem as a callback for reading large masses of file names? The benefit of streams are that they save memory. A stream can be abstracted as a promise to get the benefits of a promise. A promise can't be abstracted as a stream to get the benefits of a stream.

Qard commented 7 years ago

Indeed. Promises are a bit off-topic here.

On a related note though, there has been discussion about streams supporting async iterators in the future, enabling async for-of to work. https://github.com/nodejs/readable-stream/issues/254

Trott commented 7 years ago

Should this remain open?

clee commented 7 years ago

@Trott Is there a streaming API for fs.readdir yet?

Trott commented 7 years ago

Here's what has to happen for this, now in a new and exciting CHECKLIST. Progress!

Then this can be implemented.

clee commented 7 years ago

Fair enough! Crazy to me that this still isn't possible yet.

jasnell commented 7 years ago

We know what we need to make it happen, we just can't it yet. Unfortunately. It's definitely something I'd like to have

hollowdoor commented 7 years ago

@whitlockjc How is progress? The PR on libuv looks complete, but failing tests.

MylesBorins commented 6 years ago

I wonder if this is something Async iterators could be used for on top of a promise based api

/cc @jasnell @mcollina

mcollina commented 6 years ago

@MylesBorins I think so, but we need libuv support first. Without the ability to receive all the dir content in a chunked fashion, we can't do much here.

After https://github.com/nodejs/node/pull/17755 lands, we have async iterators support here for free if we expose it as a stream.

jasnell commented 6 years ago

Specifically, without the libuv support, there's absolutely zero performance advantage to this at all.

MylesBorins commented 6 years ago

For reference, here is the libuv issue

https://github.com/libuv/libuv/pull/416

Fishrock123 commented 6 years ago

Running into a situation atm where this would be nice...

jasnell commented 6 years ago

Indeed. I think it's time to revisit.

paragi commented 5 years ago

IMHO This could be remedied by introducing a new dir/glob function to node_file:

fs.dir(pattern [, options])

It only needs to implement the simplest wildcards ? and *, using a C function like nftw()

I think it would be acceptable for this function to return a promise of an array, since the programmer can control the memory usages by choosing the appropriate pattern.

The advantage of this approach is that it preserves backwards compatibility, while enabling use in large scale operations, which can easily require handling of 10^6+ files, without gobbling up memory.

guymguym commented 5 years ago

Hey @paragi

It would be nice to have that fs.dir() function. Notice this question - it might be that nftw is not the best starting point.

Anyway I don't think it really removes the need for directory streaming. I.e. when walking entire filesystem tree you might stumble a large folder, or when the filtering is not based on name patterns i.e. by file type (dir/file/other). It is unfortunate that the PR https://github.com/libuv/libuv/pull/416 got stuck for a while now.

paragi commented 5 years ago

@guymguym You are rigth. Boost.Filesystem library is probably the right choice for c++

The advantage of a fs.dir() over a stream version of fs.readdir() would be that the matching is done at the lowest possible level. IMO its two different use cases. But there is of cause no good reason not to mix them :)

paragi commented 5 years ago

This looks promising! Looking forward to test :1st_place_medal:

frank-dspeed commented 5 years ago

@paragi i have created a nice wrapper around the nativ Operating System Find Command that works on mac, linux, windows, and is returning a Real Observable Stream this way till lib uv is ready

rijnhard commented 5 years ago
  • [X] libuv/libuv#2057 has to land in libuv
  • [ ] libuv has to release v1.x.
  • [ ] Node.js has to start using the v1.x release of libuv.

Then this can be implemented.

made changes to the tasks

has landed, and to the V1.X branch so it just has to be released and Node.js needs to update to use it. There are some caveats mentioned in the PR that should be read before trying to figure how the API should look in node.

YurySolovyov commented 5 years ago

Should this API also support returning Dirent values?

paragi commented 5 years ago

Should this API also support returning Dirent values? To me, filename and a boolean is_directory would suffice.

You might consider implement it as an added directory separator, at the end of a directory name, so that it just returns a string pr. entry.

Tank you guys for effort!

Fishrock123 commented 5 years ago

has landed, and to the V1.X branch so it just has to be released and Node.js needs to update to use it. There are some caveats mentioned in the PR that should be read before trying to figure how the API should look in node.

That's not entirely necessary, updating libuv to a non-release is pretty easy and it's what I did.

I'm working on this for now, and will hand off to Colin if I get stuck. The whole uv_dir_t thing is a bit of an internals paradigm change so there's some extra work to do under the hood making a new handle class (DirHandle::) and whatnot.

Should this API also support returning Dirent values?

My plan is to have it only return Dirent values. Hooray future.

Fwiw I'm presently focused on making this work directly as just an iterator (both sync and async). I'll see how it pans out.

I'm having a hard time imagining anyone would actually want a stream of directory entries...

devsnek commented 5 years ago

i think stream was only suggested because async iterables weren't really a thing back in 2015.

paragi commented 5 years ago

i think stream was only suggested because async iterables weren't really a thing back in 2015.

I agree.

I can imagine 3 use cases, where you want to iterate over:

  1. a list of all entries
  2. a list of some selected files. (glob functionality)
  3. (rare) a list of files selected on very complex criteria.

I don't think you need to support the 3th You need to be able to do it on a vary large dataset. (10 mill+ entries) Preferably doing wildcard selection on library level. I really don't see the need to return anything other than the filename (Dirs appended with a dir separator)

Fishrock123 commented 5 years ago

We're getting there — here's a taste: https://twitter.com/Fishrock123/status/1159337458091167745

rijnhard commented 5 years ago

There are very valid use cases for streaming a list of directory entries. We have one where we need to check for presence of millions of directories to audit data for instance (you don't want to know the file count). We check the actual files seperately. And its easy to say at face value that we can implement it differently but at the code & data level, iterating files is just not an option.

Fishrock123 commented 5 years ago

Now with async iteration: https://twitter.com/Fishrock123/status/1161794794361774080

Fishrock123 commented 5 years ago

@rijnhard That's gona add a whole bunch of overhead...

But to be clear, you want to: have an object stream of directory entries, transform those into file reads, in the same stream?

That doesn't really make sense to me. You end up muliplexing/splitting the stream anyways and that leads to a very similar can of worms as just "iterating" and "awaiting" until a stream is finished to pull the next entry.

Maybe you could elaborate more? I am presently avoiding making this a stream due to complexity and lack of solid use cases.

rijnhard commented 5 years ago

@Fishrock123 I think I misunderstood, I just need to iterate through directories and not their contents.

paragi commented 5 years ago

While you are working with it, would it be a lot of trouble to add a function with glob or just '?' and '*' wildcard functionality? I presume performance and memory usages would vastly improve, working on very large filesystems, if you do the scanning in libuv. It would be much appriciatet :1st_place_medal:

Fishrock123 commented 5 years ago

@paragi This API will do no filtering and will not return entries from subdirectories. That will be up to a module to do.

Fishrock123 commented 5 years ago

Ok folks, the pull request is up: https://github.com/nodejs/node/pull/29349

const fs = require('fs');

async function print(path) {
  const dir = await fs.promises.opendir(path);
  for await (const dirent of dir) {
    console.log(dirent.name);
  }
}
print('./').catch(console.error);
Fishrock123 commented 5 years ago

Landed in cbd8d715b2286e5726e6988921f5c870cbf74127 as fs{Promises}.opendir(), which returns an fs.Dir, which exposes an async iterator. 🎉

Fishrock123 commented 5 years ago

This has been released in 12.12.0

ma11hew28 commented 4 years ago

What about also making directories sync iterable (as initially suggested)?

I think this could be done by using dir.readSync().

The commit message could be "fs: make directories sync iterable".

frank-dspeed commented 4 years ago

@ma11hew28 sorry to tell you that Sync can't be iterable as its Sync :) a iterable is a async type

Qard commented 4 years ago

There are sync iterators too. It's entirely possible to make a sync version.

frank-dspeed commented 4 years ago

@Qard why should i use a iterator for a Sync call that will return after all is in mermory already but ok your right it could exist it can be done. i for my self would suggest for...of as iterate method but ok

ma11hew28 commented 4 years ago

Thank you, @frank-dspeed and @Qard, for responding. I'm sorry for not responding promptly. @frank-dspeed, I'm sorry, but I think you misunderstood me. What you suggested is what I meant, as it is the first part of what @jonathanong initially suggested. Ie, if we make directories sync iterable, then you could do something like this:

const fs = require('fs')

const dir = fs.opendirSync('.')
for (const dirent of dir) {
  console.log(dirent.name)
}

instead of something like this:

const fs = require('fs')

const dir = fs.opendirSync('.')
let dirent
while ((dirent = dir.readSync()) !== null) {
  console.log(dirent.name)
}
dir.closeSync()

As for the second sentence of my initial comment on this issue, by "this", I meant "making directories sync iterable". Ie, I think a directory's default sync iterator's next() method could call dir.readSync() to get the directory's next entry.

Qard commented 4 years ago

@frank-dspeed It doesn't have to all be loaded into memory with a sync iterator. If you have a directory with millions of entries in it, a sync iterator could read and return entries one at a time, or in batches, but synchronously.