Open jdesboeufs opened 2 years ago
PapaParse
Pros: straightforward bugfix Cons: will depend on a polyfill when using Node stream syntax in browser. Possibly a breaking change
PapaParse
Pros: future-proof and universal bugfix Cons: require Node.js 8.3+, Firefox 19/20, Chrome 38+
Buffer
Throw an error when using with a stream of Buffer
=> force user to decode stream on its own (add example with iconv-lite
)
Pros: keep PapaParse
simple
Cons: breaking change if no deprecation
I'm trying to proactively use the iconv-lite option. Can you check if this pseudo implementation correct? Could also be added to the docs after clean-up. It does work but I haven't tested all edge cases. I assume iconv-lite guarantees that multi bytes UTF-8 characters are kept together?
And is there a way to get the "meta" field in the streaming api? on('data') only gets you the data part of the result. See https://github.com/mholt/PapaParse/blob/1f2c7330d5f562630195c8c450e7ec9cf6233684/papaparse.js#L917-L918
I assume that's intentional?
import { parse, NODE_STREAM_INPUT } from 'papaparse';
import { decodeStream } from 'iconv-lite';
import { pipeline } from 'stream';
const stream // some ReadStream
const converterStream = decodeStream('utf8');
const csvStream = parse(NODE_STREAM_INPUT, {
header: true,
});
csvStream.on('data', (data) => {
console.log('do something with the data')
// can I get the meta info here?
});
csvStream.on('end', (result) => {});
pipeline(stream, converterStream, csvStream, (err) => {
console.log('stream complete', err);
});
If I may, I believe the WHATWG's TextDecoder option would be your best move here.
As already said it is future-proof, and can be polyfilled if needed.
It would also fix a bug in browsers with the chunk
option: https://jsfiddle.net/3zypkqtg/ (not sure if yet another report is needed for that).
any news on this? I tried every method to fix it, but it doesn't work or it just take forever to read the stream. What should be done in the mentime to be able to read mutil-byte UTF-8 cjaracters when streaming to papa?
PapaParse breaks multi bytes UTF-8 characters when they are sliced between different chunks of
Buffer
. For exampleç
would become��
.To reproduce:
A workaround is to ensure UTF-8 decoding with
string_decoder
(internal Node module),WHATWG TextDecoder
or withiconv-lite
(user-land dependency). But a better answer is to usestring_decoder
orTextDecoder
intoPapaParse
, in place ofchunk.toString()
.Related to #751