mholt / PapaParse

Fast and powerful CSV (delimited text) parser that gracefully handles large files and malformed input
http://PapaParse.com
MIT License
12.5k stars 1.15k forks source link

Support async iterator protocol #638

Open dobesv opened 5 years ago

dobesv commented 5 years ago

ECMAScript 2018 introduces the AsyncIterator protocol for looping over large inputs. It's a good fit for CSV parsing. Papaparse could potentially provide a function that returns an async iterator instead of an array.

I created an example function that returns an async iterator wrapping papaprase, which you can see here:

https://gist.github.com/dobesv/e637893adb0588a768db70e2c2e7ba29

Using the standard AsyncIterator has some advantages:

Feel free to adapt my example code for inclusion in Papaparse. Or, if you feel this would be better as a separate package let me know.

pokoli commented 5 years ago

Wow, this sounds like a very good addition for Papaparse.

Is this supported on all major browsers? I can not found the feature on caniuse.com

I'm wondering if it will be possible to activate this behaviour by a configuration parameter. So if the configuration parameter is set paparse will return the iterator instead of the current array of results.

dobesv commented 5 years ago

AsyncIterator can be supported in browsers using babel, I don't think it is natively supported.

I think providing a different function in Papaparse, like Papa.asyncIterable({...options...}) would make more sense because the API is quite different in this case. It is not great design to have radically different return types for a single function.

dobesv commented 5 years ago

The async iterable protocol only requires Symbol.asyncIterable which isn't necessarily available in the browser if you don't have a polyfill (e.g. @babel/polyfill). The for await syntax requires a transpiler for most browsers.

pokoli commented 5 years ago

Maybe we can add a new method named Papa.ParseIterable, which will behave like the current parse but return and iterable instead.

Will this feature require to add babel as dependency? We should take care when adding new dependencies. Altought I dont think babe will be a big issue, if it requires it we should add as optional dependency.

dobesv commented 5 years ago

You shouldn't add babel as a dependency.

I doubt many people will use ES2018 features unless they are using babel or running in an ES2018 environment. It should be fine if this particular function just throws an error, like if(!Symbol && Symbol.asyncIterator) throw new Error('This feature requires ES2018');.

If you do want it to work even with babel / ES2018 you can use iterall and use require('iterall').$$asyncIterator in place of Symbol.asyncIterator when defining the iterable:

https://github.com/leebyron/iterall/blob/master/index.mjs#L39 https://github.com/leebyron/iterall/blob/master/index.mjs#L84 https://github.com/leebyron/iterall/blob/master/index.mjs#L420

Then people who use iterall compatible libraries / code can still use the async iterable even if they do not have ES2018 Symbol.asyncIterator

trxcllnt commented 5 years ago

Hiya :wave:, IxJS maintainer here. I'm in the middle of implementing a streaming CSV -> apache-arrow transform, and having this in PapaParse would be fantastic for compatibility with both node and whatwg streams.

We should be able to use Ix's fromNodeStream() method in node to transform PapaParse's ReadableStreamStreamer into an AsyncIterable, or Ix's toNodeStream() method to pipe an AsyncIterable to PapaParse's DuplexStreamStreamer:

import fs from 'fs';
import { AsyncIterable, fromNodeStream, map } from 'ix/asynciterable';

fromNodeStream(fs
    .createReadStream('cols.csv')
    .pipe(Papa.parse(Papa.NODE_STREAM_INPUT)))
  // maybe do an element-wise transform
  .pipe(map(({ colA, colB }) => `${colA + colB}\n`))
  // implicitly calls toNodeStream() when piping to a node writable stream
  .pipe(fs.createWriteStream('sums.txt'))

If PapaParse had an AsyncIterable implementation, we could also use Ix to convert into whatwg streams in the browser (via AsyncIterable#toDOMStream() and AsyncIterable.fromDOMStream()).

About Symbol.asyncIterator, http://kangax.github.io/compat-table/es2016plus/#test-Asynchronous_Iterators indicates AsyncIterable is now supported everywhere except Edge. We've been shipping Ix and Arrow without the polyfill for the last 2-3 years and haven't heard any complaints. Typically the client's Babel or Closure compilation step will include it if necessary for their target envs.

ryanking8215 commented 1 year ago

There is a simple way, which seems no necessary to add async iterator for papaparse.

const {pipeline} = require("node:stream/promises");
const fs = require("node:fs");
const Papa = require("papaparse");

pipeline(
        fs.createReadStream('1.csv'),
        Papa.parse(Papa.NODE_STREAM_INPUT),
        async (data) => {
                for await (const a of data) {
                        console.log(a);
                }
        }
)

It's working, but it is slow.