Open rpr-ableton opened 1 year ago
Hi @rpr-ableton
Could you please describe which is the issue?
The symptoms are:
_1
appended to them for no apparent reasonMaybe to illustrate the point better, here's the output from the same script but using papaparse 5.3.2 :
$ node script.js
foo_bar_baz_V1
foo, bar
foo, bar
02/01/22
bar baz
2022_02
0
1234.45678
foo_bar_baz_V1
foo, bar
foo, bar
03/01/22
bar baz
2022_03
0
1234.45678
foo_bar_baz_V2
foo, bar
foo, bar
11/01/22
bar baz
2022_11
0
-12345.67891
foo_bar_baz_V2
foo, bar
foo, bar
12/01/22
bar baz
2022_12
0
-12345.67891
finished
Hi @rpr-ableton,
We need to isolate the problems with minimal reproducible cases.
I do not see any reason why a value is converted to an Array. On the other hand the fact of _1
is appended to the value seems related to the duplicate headers feature and should be fixed with https://github.com/mholt/PapaParse/commit/824bbd9daf17168bddfc5485066771453cab423e
I will need a minimum test scenario in order to be able to fix it.
BTW: We have included some changes in master version which are not included on latest release. Could you please test with the master version?
The issues are the same whether I'm using master or 5.4.1.
This is the minimum CSV file to reproduce all errors I mentioned above:
colName_2,colName_3,value_eur
"foo, bar","foo, bar",1234.45678
"foo, bar","foo, bar",-12345.67891
"foo, bar","foo, bar",-12345.67891
This is one reproduces both the _1
appending and the incorrect parsing (but not the unexpected array one):
colName_2,colName_3,value_eur
"foo, bar","foo, bar",1234.45678
"foo, bar","foo, bar",-12345.67891
I haven't managed to do anything smaller than this to generate the errors. Also it seems that the 2 errors mentioned in the second CSV are tightly connected as I can't reproduce one without the other.
Any update on being able to address this issue? I've recently run into what sounds like a similar problem where a CSV gets parsed with _1
as the value for some of our values. The same rows with those _1
values have other columns that are read into the wrong fields of the javascript object. We're using papaparse 5.4.0. I'll see if I can downgrade to 5.3.2 in the meantime.
Follow-up -- downgrading to papaparse 5.3.2 corrected the problem.
The _1
value strikes me as similar to data issues found in #985.
I am experiencing the same issue. Only occurred when using async iterator on stream. Using listeners works without issues.
I also got some data consistency issues with parsing csv files in Node using the streaming mode and using "header: true" to get objects. While it works fine with version 5.3.2, starting from 5.4.0 papaparse would change the content of my parsed objects.
As there is no issue using "header: false" I could narrow it down to the following commit https://github.com/mholt/PapaParse/commit/c1cbe16636be146f5f1d321cdd7cbddc9143b045
My guess is, that this code block tries to look for duplicate headers on each new streaming chunk, not just the first line of the first chunk, and thereby modifies not the header but actual data rows...
My solution is to stick with version 5.3.2 for now.
@ulo What you describe should be fixed with https://github.com/mholt/PapaParse/commit/824bbd9daf17168bddfc5485066771453cab423e which is already released as 5.4.1
Thanks for the hint @pokoli !
Indeed, it is apparently not an issue of chunks streaming into Papaparse. I usually create a ReadStream using fs
, and there the default chunk size of 64KB is much bigger than my actual table, so there will be only 1 chunk streaming into Papaparse.
However, after looking more into Papaparse's source code, the issue is more likely with streaming the parsed objects out of Papaparse. I could figure out that the central parse loop runs fine for the first 16 rows, but then gets paused. It resumes when the destination of my pipeline consumed the first 16 objects, but upon resuming the baseIndex
is still 0 and therefore the header renaming (which was introduced in version 5.4.0) is happening on the actual data starting from row 17.
The 16 objects correspond to the default highWaterMark: 16
of object streams in Node. And after the first chunk of 16 objects, it seems that they are processed one by one in the pipeline.
Here is a minimal working example, where I build a pipeline with a first function that generates one big string representing a tsv-table of 20 rows with id1
and id2
both set to the row index, and then parsing to objects using papaparse. I consume the pipeline by writing each parsed object to the console:
import { Stream } from "stream"
import * as papa from "papaparse"
async function main() {
const pipeline = Stream.pipeline(
async function* () {
yield ["id1\tid2", ...Array.from({ length: 20 }, (v, i) => `${i}\t${i}`)].join("\n")
},
papa.parse(papa.NODE_STREAM_INPUT, { delimiter: "\t", newline: "\n", header: true }),
(err) => err && console.error(err)
)
for await (const entry of pipeline) {
console.log(entry)
}
}
main()
While the example produces the expected output in version 5.3.2, starting from version 5.4.0 the output looks like this:
{ id1: '0', id2: '0' }
{ id1: '1', id2: '1' }
{ id1: '2', id2: '2' }
{ id1: '3', id2: '3' }
{ id1: '4', id2: '4' }
{ id1: '5', id2: '5' }
{ id1: '6', id2: '6' }
{ id1: '7', id2: '7' }
{ id1: '8', id2: '8' }
{ id1: '9', id2: '9' }
{ id1: '10', id2: '10' }
{ id1: '11', id2: '11' }
{ id1: '12', id2: '12' }
{ id1: '13', id2: '13' }
{ id1: '14', id2: '14' }
{ id1: '15', id2: '15' }
{ id1: '16', id2: '16_1' }
{ id1: '', id2: '17' }
{ id1: '18', id2: '18_1' }
{ id1: '', id2: '19' }
Csv parsing still broken on NodeJS, using v5.4.1 with header: true. Wrong number of rows and invalid row objects filled with "_1", "_2", etc.... and concatenated columns that should be distinct values. The csv file used is valid and parsed correctly using different csv tools.
EDIT: The bug appears when using the stream mode like this:
import { NODE_STREAM_INPUT, parse } from 'papaparse';
// Some read stream returned from fs.createdReadStream
const csvFileReadStream = ...
csvFileReadStream.pipe(parse(NODE_STREAM_INPUT, options))
Using
parse(csvFileReadStream, options)
looks to work fine (though Typescript complains on the 1st arg of the parse function but that's maybe some misconfiguration on our side).
I'm experiencing exactly what @ulo described above. Streaming on 5.4.1 with header: true
, the first 16 rows come back fine. The 17th row and beyond are parsed incorrectly.
Converting the stream to a string and parsing that works as expected.
I tried the latest master
branch as of now (https://github.com/mholt/PapaParse/commit/016effe152b218c2545ab70b818f75758cad2e3c) and issue still persists, had to revert back to 5.3.2
:(
Hello - Are there any plans to fix this?
I am experiencing this issue too. i had to use another lib for now. Hopefully this can get fixed 🤞
Having the same issue. step works fine if I don't pause the parser, but rows are not parsed properly with parser.pause().
Run the code below with parser.pause and parser.resume commented out, and not commented out, to compare the output.
import fs from 'fs';
import Papa from 'papaparse';
const fileName = String.raw`Theraphosidae 20241116.csv`
const fileStream = fs.createReadStream(fileName);
Papa.parse(fileStream, {
header: true, // Convert rows to objects using the first row as header
step: (results, parser) => {
parser.pause()
const row = results.data
console.log(row.vmNumber)
parser.resume()
},
complete: () => {
console.log('all done')
},
error: (error) => console.log(error),
});
Hi everyone,
First of all, big thanks for this great library that's been a big help in my work :raised_hands:
I noticed when going from 5.3.2 to 5.4.1 that some CSV files can't be parsed successfully anymore. I'll attach below a small CSV, a nodejs script to reproduce the issue as well as the output.
example.csv
:script.js
:Output with papaparse 5.4.1:
Let me know if you need any additional information to better capture the issue.
Thanks a bunch.