mholt / PapaParse

Fast and powerful CSV (delimited text) parser that gracefully handles large files and malformed input
http://PapaParse.com
MIT License
12.57k stars 1.15k forks source link

CSV parsing is broken in 5.4.1 #998

Open rpr-ableton opened 1 year ago

rpr-ableton commented 1 year ago

Hi everyone,

First of all, big thanks for this great library that's been a big help in my work :raised_hands:

I noticed when going from 5.3.2 to 5.4.1 that some CSV files can't be parsed successfully anymore. I'll attach below a small CSV, a nodejs script to reproduce the issue as well as the output.

example.csv:

colName_2,colName_3,another_col,entity,some_bool,period,date,value_eur
"foo, bar","foo, bar",foo_bar_baz_V1,bar baz,0,2022_02,02/01/22,1234.45678
"foo, bar","foo, bar",foo_bar_baz_V1,bar baz,0,2022_03,03/01/22,1234.45678
"foo, bar","foo, bar",foo_bar_baz_V2,bar baz,0,2022_11,11/01/22,-12345.67891
"foo, bar","foo, bar",foo_bar_baz_V2,bar baz,0,2022_12,12/01/22,-12345.67891

script.js:

const papa = require('papaparse');
const fs = require('fs');

const readStream = fs.createReadStream(`${ __dirname }/example.csv`);
return papa.parse(readStream, {
    step: (res, parser) => {
        parser.pause();
        Object.keys(res.data).
            sort().
            forEach((key) => {
                console.log(res.data[key]);
            });
        return parser.resume();
    }, // step
    header: true,
    error: (err) => {
        return console.error(err);
    }, // error
    complete: () => {
        console.log('finished');
    }, // complete
}); // return papa.parse

Output with papaparse 5.4.1:

$ node script.js 
foo_bar_baz_V1
foo, bar
foo, bar
02/01/22
bar baz
2022_02
0
1234.45678
[ '-12345.67891' ]
foo, bar
foo, bar
foo_1, bar"_1,foo_bar_baz_V1,bar baz,0,2022_03,03/01/22,1234.45678
"foo, bar
2022_11
foo_bar_baz_V2
0
bar baz
11/01/22
foo, bar"_1,foo_bar_baz_V2,bar baz,0,2022_12,12/01/22,-12345.67891

 bar"
finished

Let me know if you need any additional information to better capture the issue.

Thanks a bunch.

pokoli commented 1 year ago

Hi @rpr-ableton

Could you please describe which is the issue?

rpr-ableton commented 1 year ago

The symptoms are:

Maybe to illustrate the point better, here's the output from the same script but using papaparse 5.3.2 :

$ node script.js 
foo_bar_baz_V1
foo, bar
foo, bar
02/01/22
bar baz
2022_02
0
1234.45678
foo_bar_baz_V1
foo, bar
foo, bar
03/01/22
bar baz
2022_03
0
1234.45678
foo_bar_baz_V2
foo, bar
foo, bar
11/01/22
bar baz
2022_11
0
-12345.67891
foo_bar_baz_V2
foo, bar
foo, bar
12/01/22
bar baz
2022_12
0
-12345.67891
finished
pokoli commented 1 year ago

Hi @rpr-ableton,

We need to isolate the problems with minimal reproducible cases. I do not see any reason why a value is converted to an Array. On the other hand the fact of _1 is appended to the value seems related to the duplicate headers feature and should be fixed with https://github.com/mholt/PapaParse/commit/824bbd9daf17168bddfc5485066771453cab423e

I will need a minimum test scenario in order to be able to fix it.

BTW: We have included some changes in master version which are not included on latest release. Could you please test with the master version?

rpr-ableton commented 1 year ago

The issues are the same whether I'm using master or 5.4.1.

This is the minimum CSV file to reproduce all errors I mentioned above:

colName_2,colName_3,value_eur
"foo, bar","foo, bar",1234.45678
"foo, bar","foo, bar",-12345.67891
"foo, bar","foo, bar",-12345.67891

This is one reproduces both the _1 appending and the incorrect parsing (but not the unexpected array one):

colName_2,colName_3,value_eur
"foo, bar","foo, bar",1234.45678
"foo, bar","foo, bar",-12345.67891

I haven't managed to do anything smaller than this to generate the errors. Also it seems that the 2 errors mentioned in the second CSV are tightly connected as I can't reproduce one without the other.

tjstavenger-pnnl commented 1 year ago

Any update on being able to address this issue? I've recently run into what sounds like a similar problem where a CSV gets parsed with _1 as the value for some of our values. The same rows with those _1 values have other columns that are read into the wrong fields of the javascript object. We're using papaparse 5.4.0. I'll see if I can downgrade to 5.3.2 in the meantime.

tjstavenger-pnnl commented 1 year ago

Follow-up -- downgrading to papaparse 5.3.2 corrected the problem.

tjstavenger-pnnl commented 1 year ago

The _1 value strikes me as similar to data issues found in #985.

raukaute commented 1 year ago

I am experiencing the same issue. Only occurred when using async iterator on stream. Using listeners works without issues.

ulo commented 1 year ago

I also got some data consistency issues with parsing csv files in Node using the streaming mode and using "header: true" to get objects. While it works fine with version 5.3.2, starting from 5.4.0 papaparse would change the content of my parsed objects.

As there is no issue using "header: false" I could narrow it down to the following commit https://github.com/mholt/PapaParse/commit/c1cbe16636be146f5f1d321cdd7cbddc9143b045

My guess is, that this code block tries to look for duplicate headers on each new streaming chunk, not just the first line of the first chunk, and thereby modifies not the header but actual data rows...

My solution is to stick with version 5.3.2 for now.

pokoli commented 1 year ago

@ulo What you describe should be fixed with https://github.com/mholt/PapaParse/commit/824bbd9daf17168bddfc5485066771453cab423e which is already released as 5.4.1

ulo commented 1 year ago

Thanks for the hint @pokoli ! Indeed, it is apparently not an issue of chunks streaming into Papaparse. I usually create a ReadStream using fs, and there the default chunk size of 64KB is much bigger than my actual table, so there will be only 1 chunk streaming into Papaparse.

However, after looking more into Papaparse's source code, the issue is more likely with streaming the parsed objects out of Papaparse. I could figure out that the central parse loop runs fine for the first 16 rows, but then gets paused. It resumes when the destination of my pipeline consumed the first 16 objects, but upon resuming the baseIndex is still 0 and therefore the header renaming (which was introduced in version 5.4.0) is happening on the actual data starting from row 17. The 16 objects correspond to the default highWaterMark: 16 of object streams in Node. And after the first chunk of 16 objects, it seems that they are processed one by one in the pipeline.

Here is a minimal working example, where I build a pipeline with a first function that generates one big string representing a tsv-table of 20 rows with id1 and id2 both set to the row index, and then parsing to objects using papaparse. I consume the pipeline by writing each parsed object to the console:

import { Stream } from "stream"
import * as papa from "papaparse"

async function main() {
  const pipeline = Stream.pipeline(
    async function* () {
      yield ["id1\tid2", ...Array.from({ length: 20 }, (v, i) => `${i}\t${i}`)].join("\n")
    },
    papa.parse(papa.NODE_STREAM_INPUT, { delimiter: "\t", newline: "\n", header: true }),
    (err) => err && console.error(err)
  )

  for await (const entry of pipeline) {
    console.log(entry)
  }
}
main()

While the example produces the expected output in version 5.3.2, starting from version 5.4.0 the output looks like this:

{ id1: '0', id2: '0' }
{ id1: '1', id2: '1' }
{ id1: '2', id2: '2' }
{ id1: '3', id2: '3' }
{ id1: '4', id2: '4' }
{ id1: '5', id2: '5' }
{ id1: '6', id2: '6' }
{ id1: '7', id2: '7' }
{ id1: '8', id2: '8' }
{ id1: '9', id2: '9' }
{ id1: '10', id2: '10' }
{ id1: '11', id2: '11' }
{ id1: '12', id2: '12' }
{ id1: '13', id2: '13' }
{ id1: '14', id2: '14' }
{ id1: '15', id2: '15' }
{ id1: '16', id2: '16_1' }
{ id1: '', id2: '17' }
{ id1: '18', id2: '18_1' }
{ id1: '', id2: '19' }
jlebras commented 1 year ago

Csv parsing still broken on NodeJS, using v5.4.1 with header: true. Wrong number of rows and invalid row objects filled with "_1", "_2", etc.... and concatenated columns that should be distinct values. The csv file used is valid and parsed correctly using different csv tools.

EDIT: The bug appears when using the stream mode like this:

import { NODE_STREAM_INPUT, parse } from 'papaparse';

// Some read stream returned from fs.createdReadStream
const csvFileReadStream = ...
csvFileReadStream.pipe(parse(NODE_STREAM_INPUT, options))

Using

parse(csvFileReadStream, options)

looks to work fine (though Typescript complains on the 1st arg of the parse function but that's maybe some misconfiguration on our side).

noahw3 commented 1 year ago

I'm experiencing exactly what @ulo described above. Streaming on 5.4.1 with header: true, the first 16 rows come back fine. The 17th row and beyond are parsed incorrectly.

Converting the stream to a string and parsing that works as expected.

basememara commented 7 months ago

I tried the latest master branch as of now (https://github.com/mholt/PapaParse/commit/016effe152b218c2545ab70b818f75758cad2e3c) and issue still persists, had to revert back to 5.3.2 :(

andynsd commented 6 months ago

Hello - Are there any plans to fix this?

codenamezjames commented 6 months ago

I am experiencing this issue too. i had to use another lib for now. Hopefully this can get fixed 🤞

ianengelbrecht commented 5 days ago

Having the same issue. step works fine if I don't pause the parser, but rows are not parsed properly with parser.pause().

Run the code below with parser.pause and parser.resume commented out, and not commented out, to compare the output.

import fs from 'fs';
import Papa from 'papaparse';

const fileName = String.raw`Theraphosidae 20241116.csv`

const fileStream = fs.createReadStream(fileName);
Papa.parse(fileStream, {
  header: true, // Convert rows to objects using the first row as header
  step: (results, parser) => {
    parser.pause()
    const row = results.data
    console.log(row.vmNumber)
    parser.resume()
  },
  complete: () => {
    console.log('all done')
  },
  error: (error) => console.log(error),
});

Theraphosidae 20241116.csv