Closed jonathanong closed 5 years ago
just fyi the streams wg is broadly in favor of adding async iterator support to streams once v8 supports it.
Why can't we have just a simple promise if callback is omitted, like for example
fs.readDir(path[, options])
.then(files=>console.log(files))
.catch(err=>console.log(err));
I know I can use https://github.com/thenables/thenify but I assumed that since node supports Promises it would use them...
@zevero Wouldn't that present the same problem as a callback for reading large masses of file names? The benefit of streams are that they save memory. A stream can be abstracted as a promise to get the benefits of a promise. A promise can't be abstracted as a stream to get the benefits of a stream.
Indeed. Promises are a bit off-topic here.
On a related note though, there has been discussion about streams supporting async iterators in the future, enabling async for-of to work. https://github.com/nodejs/readable-stream/issues/254
Should this remain open?
@Trott Is there a streaming API for fs.readdir yet?
Here's what has to happen for this, now in a new and exciting CHECKLIST. Progress!
Then this can be implemented.
Fair enough! Crazy to me that this still isn't possible yet.
We know what we need to make it happen, we just can't it yet. Unfortunately. It's definitely something I'd like to have
@whitlockjc How is progress? The PR on libuv looks complete, but failing tests.
I wonder if this is something Async iterators could be used for on top of a promise based api
/cc @jasnell @mcollina
@MylesBorins I think so, but we need libuv support first. Without the ability to receive all the dir content in a chunked fashion, we can't do much here.
After https://github.com/nodejs/node/pull/17755 lands, we have async iterators support here for free if we expose it as a stream.
Specifically, without the libuv support, there's absolutely zero performance advantage to this at all.
For reference, here is the libuv issue
Running into a situation atm where this would be nice...
Indeed. I think it's time to revisit.
IMHO This could be remedied by introducing a new dir/glob function to node_file:
fs.dir(pattern [, options])
It only needs to implement the simplest wildcards ? and *, using a C function like nftw()
I think it would be acceptable for this function to return a promise of an array, since the programmer can control the memory usages by choosing the appropriate pattern.
The advantage of this approach is that it preserves backwards compatibility, while enabling use in large scale operations, which can easily require handling of 10^6+ files, without gobbling up memory.
Hey @paragi
It would be nice to have that fs.dir()
function.
Notice this question - it might be that nftw is not the best starting point.
Anyway I don't think it really removes the need for directory streaming. I.e. when walking entire filesystem tree you might stumble a large folder, or when the filtering is not based on name patterns i.e. by file type (dir/file/other). It is unfortunate that the PR https://github.com/libuv/libuv/pull/416 got stuck for a while now.
@guymguym You are rigth. Boost.Filesystem library is probably the right choice for c++
The advantage of a fs.dir() over a stream version of fs.readdir() would be that the matching is done at the lowest possible level. IMO its two different use cases. But there is of cause no good reason not to mix them :)
This looks promising! Looking forward to test :1st_place_medal:
@paragi i have created a nice wrapper around the nativ Operating System Find Command that works on mac, linux, windows, and is returning a Real Observable Stream this way till lib uv is ready
- [X] libuv/libuv#2057 has to land in libuv
- [ ] libuv has to release v1.x.
- [ ] Node.js has to start using the v1.x release of libuv.
Then this can be implemented.
made changes to the tasks
has landed, and to the V1.X branch so it just has to be released and Node.js needs to update to use it. There are some caveats mentioned in the PR that should be read before trying to figure how the API should look in node.
Should this API also support returning Dirent
values?
Should this API also support returning
Dirent
values? To me, filename and a boolean is_directory would suffice.
You might consider implement it as an added directory separator, at the end of a directory name, so that it just returns a string pr. entry.
Tank you guys for effort!
has landed, and to the V1.X branch so it just has to be released and Node.js needs to update to use it. There are some caveats mentioned in the PR that should be read before trying to figure how the API should look in node.
That's not entirely necessary, updating libuv to a non-release is pretty easy and it's what I did.
I'm working on this for now, and will hand off to Colin if I get stuck. The whole uv_dir_t
thing is a bit of an internals paradigm change so there's some extra work to do under the hood making a new handle class (DirHandle::
) and whatnot.
Should this API also support returning
Dirent
values?
My plan is to have it only return Dirent
values. Hooray future.
Fwiw I'm presently focused on making this work directly as just an iterator (both sync and async). I'll see how it pans out.
I'm having a hard time imagining anyone would actually want a stream of directory entries...
i think stream was only suggested because async iterables weren't really a thing back in 2015.
i think stream was only suggested because async iterables weren't really a thing back in 2015.
I agree.
I can imagine 3 use cases, where you want to iterate over:
I don't think you need to support the 3th You need to be able to do it on a vary large dataset. (10 mill+ entries) Preferably doing wildcard selection on library level. I really don't see the need to return anything other than the filename (Dirs appended with a dir separator)
We're getting there — here's a taste: https://twitter.com/Fishrock123/status/1159337458091167745
There are very valid use cases for streaming a list of directory entries. We have one where we need to check for presence of millions of directories to audit data for instance (you don't want to know the file count). We check the actual files seperately. And its easy to say at face value that we can implement it differently but at the code & data level, iterating files is just not an option.
Now with async iteration: https://twitter.com/Fishrock123/status/1161794794361774080
@rijnhard That's gona add a whole bunch of overhead...
But to be clear, you want to: have an object stream of directory entries, transform those into file reads, in the same stream?
That doesn't really make sense to me. You end up muliplexing/splitting the stream anyways and that leads to a very similar can of worms as just "iterating" and "awaiting" until a stream is finished to pull the next entry.
Maybe you could elaborate more? I am presently avoiding making this a stream due to complexity and lack of solid use cases.
@Fishrock123 I think I misunderstood, I just need to iterate through directories and not their contents.
While you are working with it, would it be a lot of trouble to add a function with glob or just '?' and '*' wildcard functionality? I presume performance and memory usages would vastly improve, working on very large filesystems, if you do the scanning in libuv. It would be much appriciatet :1st_place_medal:
@paragi This API will do no filtering and will not return entries from subdirectories. That will be up to a module to do.
Ok folks, the pull request is up: https://github.com/nodejs/node/pull/29349
const fs = require('fs');
async function print(path) {
const dir = await fs.promises.opendir(path);
for await (const dirent of dir) {
console.log(dirent.name);
}
}
print('./').catch(console.error);
Landed in cbd8d715b2286e5726e6988921f5c870cbf74127 as fs{Promises}.opendir()
, which returns an fs.Dir
, which exposes an async iterator. 🎉
This has been released in 12.12.0
What about also making directories sync iterable (as initially suggested)?
I think this could be done by using dir.readSync()
.
The commit message could be "fs: make directories sync iterable".
@ma11hew28 sorry to tell you that Sync can't be iterable as its Sync :) a iterable is a async type
There are sync iterators too. It's entirely possible to make a sync version.
@Qard why should i use a iterator for a Sync call that will return after all is in mermory already but ok your right it could exist it can be done. i for my self would suggest for...of as iterate method but ok
Thank you, @frank-dspeed and @Qard, for responding. I'm sorry for not responding promptly. @frank-dspeed, I'm sorry, but I think you misunderstood me. What you suggested is what I meant, as it is the first part of what @jonathanong initially suggested. Ie, if we make directories sync iterable, then you could do something like this:
const fs = require('fs')
const dir = fs.opendirSync('.')
for (const dirent of dir) {
console.log(dirent.name)
}
instead of something like this:
const fs = require('fs')
const dir = fs.opendirSync('.')
let dirent
while ((dirent = dir.readSync()) !== null) {
console.log(dirent.name)
}
dir.closeSync()
As for the second sentence of my initial comment on this issue, by "this", I meant "making directories sync iterable". Ie, I think a directory's default sync iterator's next()
method could call dir.readSync()
to get the directory's next entry.
@frank-dspeed It doesn't have to all be loaded into memory with a sync iterator. If you have a directory with millions of entries in it, a sync iterator could read and return entries one at a time, or in batches, but synchronously.
since we're in ES6 territory now, i'm thinking the sync version should be an iterable
and have the async version be an object stream:
See: https://github.com/joyent/node/issues/388