Open jimmywarting opened 10 months ago
but i would change the node example to use TextDecoderStream
in any case instead of relying on NodeJS specific stuff for better cross comp with other platform.
And i would have used ReadableStream.from(iterable)
instead of importing node:streams
that would add other dependencies into your final web bundle.
let stream = fs.createReadStream(filePath);
let webStream = ReadableStream.from(stream);
let textStream = webStream.pipeThrough(new TextDecoderStream());
let parser = null;
for await (const strChunk of textStream) {
parser ??= initParser(inferSchema(strChunk));
parser.chunk(strChunk, parser.stringArrs);
}
let result = parser.end();
there is also this new fs.openAsBlob(path)
that returns a blob with a .stream()
method.
let blob = await fs.openAsBlob(filePath);
let stream = blob.stream().pipeThrough(new TextDecoderStream())
let parser = null;
for await (const strChunk of textStream) {
parser ??= initParser(inferSchema(strChunk));
parser.chunk(strChunk, parser.stringArrs);
}
let result = parser.end();
but it requires node v20+
maybe you could also get some perf boost by moving the parser ??= initParser(inferSchema(strChunk));
out of the loop?
let blob = await fs.openAsBlob(filePath);
let stream = blob.stream().pipeThrough(new TextDecoderStream());
let iterator = stream.values();
let firstChunk = await await iterator.next()
let parser = initParser(inferSchema(firstChunk.value));
parser.chunk(firstChunk.value, parser.stringArrs);
// continue on the same iterator.
for await (const strChunk of iterator) {
parser.chunk(strChunk, parser.stringArrs);
}
let result = parser.end();
could also use two loops as described by this: https://stackoverflow.com/a/51020535/1008999
let blob = await fs.openAsBlob(filePath);
let stream = blob.stream().pipeThrough(new TextDecoderStream());
let iterator = stream.values();
// this will only loop 1 time.
for await (const strChunk of iterator) {
const parser = initParser(inferSchema(strChunk));
parser.chunk(strChunk, parser.stringArrs);
// continue on the same iterator by consuming the rest of the iterator.
for await (const strChunk of iterator) {
parser.chunk(strChunk, parser.stringArrs);
}
let result = parser.end();
}
hey, yeah we can try these out and see.
in my testing, the older/more mature apis tend to be faster. for example, the .toWeb
variant is like 20%-30% slower.
all these variations would be good for a wiki page or something, but i'd like to keep the fastest option in the main readme.
this thread is kind of nice affirmation that i made the right choice to keep the chunk-feeding pipeline external to the core.
maybe you could also get some perf boost by moving the
parser ??= initParser(inferSchema(strChunk));
out of the loop?
the chunk size is fixed to 64KB [1], so you have 16 of these per MB. if we're maxing out at about 300MB/s then that's 4800 null checks that should be noise-level (even if the JIT doesnt optimize them out, which looks like it should be easy). i'll bet that nullish coalesce on its own is on the order of millions/sec. i'm happy to be proven wrong...with data ;)
[1] https://github.com/nodejs/node/issues/41611#issuecomment-1703765405
it looks like https://github.com/nodejs/node/pull/49089 just landed in v20.6.0: https://nodejs.org/en/blog/release/v20.6.0 :eyes:
Not really worth opening as it's own issue, especially if these examples are getting reworked anyhow, but there is a typo in the node example here
That variable should be called parser
, not p
!
oops!
Wanted to open up a discussion. But that was closed. I know it was not a issue so i closed it after i wrote this
NodeJS streams as well as web streams are both async iterable and both yields uint8arrays. and both env have
TextDecoderStream
in the global namespaceso there is no need to use eventEmitter and writing different code for both.
so this
and this:
could be written as: