leeoniya / uDSV

A faster CSV parser in 5KB (min)
MIT License
655 stars 14 forks source link

make both node & web stream example the same #1

Open jimmywarting opened 10 months ago

jimmywarting commented 10 months ago

NodeJS streams as well as web streams are both async iterable and both yields uint8arrays. and both env have TextDecoderStream in the global namespace

so there is no need to use eventEmitter and writing different code for both.

so this

let stream = fs.createReadStream(filePath);

let parser = null;
let result = null;

stream.on('data', (chunk) => {
  // convert from Buffer
  let strChunk = chunk.toString();
  // on first chunk, infer schema and init parser
  parser ??= initParser(inferSchema(strChunk));
  // incremental parse to string arrays
  parser.chunk(strChunk, parser.stringArrs);
});

stream.on('end', () => {
  result = p.end();
});

and this:

let stream = fs.createReadStream(filePath);

let webStream = Stream.Readable.toWeb(stream);
let textStream = webStream.pipeThrough(new TextDecoderStream());

let parser = null;

for await (const strChunk of textStream) {
  parser ??= initParser(inferSchema(strChunk));
  parser.chunk(strChunk, parser.stringArrs);
}

let result = parser.end();

could be written as:

for await (const chunk of stream) {
  parser ??= initParser(inferSchema(strChunk.toString()));
  parser.chunk(strChunk, parser.stringArrs);
});

for await (const strChunk of textStream) {
  parser ??= initParser(inferSchema(strChunk));
  parser.chunk(strChunk, parser.stringArrs);
}
jimmywarting commented 10 months ago

but i would change the node example to use TextDecoderStream in any case instead of relying on NodeJS specific stuff for better cross comp with other platform.

And i would have used ReadableStream.from(iterable) instead of importing node:streams that would add other dependencies into your final web bundle.

let stream = fs.createReadStream(filePath);

let webStream = ReadableStream.from(stream);
let textStream = webStream.pipeThrough(new TextDecoderStream());

let parser = null;

for await (const strChunk of textStream) {
  parser ??= initParser(inferSchema(strChunk));
  parser.chunk(strChunk, parser.stringArrs);
}

let result = parser.end();
jimmywarting commented 10 months ago

there is also this new fs.openAsBlob(path) that returns a blob with a .stream() method.

let blob = await fs.openAsBlob(filePath);
let stream = blob.stream().pipeThrough(new TextDecoderStream())

let parser = null;

for await (const strChunk of textStream) {
  parser ??= initParser(inferSchema(strChunk));
  parser.chunk(strChunk, parser.stringArrs);
}

let result = parser.end();

but it requires node v20+

jimmywarting commented 10 months ago

maybe you could also get some perf boost by moving the parser ??= initParser(inferSchema(strChunk)); out of the loop?

let blob = await fs.openAsBlob(filePath);
let stream = blob.stream().pipeThrough(new TextDecoderStream());

let iterator = stream.values();

let firstChunk = await await iterator.next()
let parser = initParser(inferSchema(firstChunk.value));

parser.chunk(firstChunk.value, parser.stringArrs);

// continue on the same iterator.
for await (const strChunk of iterator) {
  parser.chunk(strChunk, parser.stringArrs);
}

let result = parser.end();
jimmywarting commented 10 months ago

could also use two loops as described by this: https://stackoverflow.com/a/51020535/1008999

let blob = await fs.openAsBlob(filePath);
let stream = blob.stream().pipeThrough(new TextDecoderStream());
let iterator = stream.values();

// this will only loop 1 time.
for await (const strChunk of iterator) {
  const parser = initParser(inferSchema(strChunk));
  parser.chunk(strChunk, parser.stringArrs);

  // continue on the same iterator by consuming the rest of the iterator.
  for await (const strChunk of iterator) {
    parser.chunk(strChunk, parser.stringArrs);
  }

  let result = parser.end();
}
leeoniya commented 10 months ago

hey, yeah we can try these out and see.

in my testing, the older/more mature apis tend to be faster. for example, the .toWeb variant is like 20%-30% slower.

all these variations would be good for a wiki page or something, but i'd like to keep the fastest option in the main readme.

this thread is kind of nice affirmation that i made the right choice to keep the chunk-feeding pipeline external to the core.

maybe you could also get some perf boost by moving the parser ??= initParser(inferSchema(strChunk)); out of the loop?

the chunk size is fixed to 64KB [1], so you have 16 of these per MB. if we're maxing out at about 300MB/s then that's 4800 null checks that should be noise-level (even if the JIT doesnt optimize them out, which looks like it should be easy). i'll bet that nullish coalesce on its own is on the order of millions/sec. i'm happy to be proven wrong...with data ;)

[1] https://github.com/nodejs/node/issues/41611#issuecomment-1703765405

leeoniya commented 10 months ago

it looks like https://github.com/nodejs/node/pull/49089 just landed in v20.6.0: https://nodejs.org/en/blog/release/v20.6.0 :eyes:

ollyfg commented 10 months ago

Not really worth opening as it's own issue, especially if these examples are getting reworked anyhow, but there is a typo in the node example here

That variable should be called parser, not p!

leeoniya commented 10 months ago

oops!

jimmywarting commented 10 months ago

Wanted to open up a discussion. But that was closed. I know it was not a issue so i closed it after i wrote this