Closed obsius closed 1 year ago
This also breaks the buffered partial line when chunking by modifying input:
lastIndex is relying on a cursor position relative to aggregate, but when aggregate (input) is modified, the cursor becomes misaligned and the next chunk will not receive the correct partial line:
Could you please explain how to reproduce your issue?
Yes, thanks for supporting this library. This issue will occur when the following conditions are met:
ChunkStreamer
(through ParserHandler.parse
) ultimately calls Parser.parse
each time it receives the next chunk from a stream reader:
I'm guessing that whomever authored the dupe header commit did not realize that parse is called multiple times when parsing in chunk mode. This leads to the unintended consequence of altering data rows, which when it occurs, also omits x chars of the first row of the next chunk.
Could you prepare a small CSV for testing such scenario? I imagine a CSV file with 10 or less rows that should be parsed with step as 1 or so. That will ease the creation of a tests scenario which raises the issue and makes it esier to produce the fix.
source
const fs = require('fs');
const Papa = require('papaparse');
let i = 0;
Papa.parse(fs.createReadStream('./data.csv', { highWaterMark: 32 }), {
header: true,
chunk: (result) => {
console.log(`chunk[${i++}][0]:`, result.data[0]);
}
});
data.csv
h0,h1
abcd,abcd
abcd,abcd
abcd,abcd
abcd,abcd
abcd,abcd
abcd,abcd
abcd,abcd
output
chunk[0][0]: { h0: 'abcd', h1: 'abcd' }
chunk[1][0]: { h0: 'abcd', h1: 'abcd_1' }
chunk[2][0]: { h0: 'cd', h1: 'abcd' }
chunk[3][0]: { h0: 'abcd', h1: 'abcd_1' }
I have noticed this bug as well in the v5.4.0. We didn't have this in 5.3.2.
Going back to 5.3.0 fixed it for us.
I'm wondering if we should disable header renaming when running in chunks
Going back to 5.3.0 fixed it for us.
same
Wow - this is a bit alarming. We must always have confidence that the data integrity is preserved with this tool as it always has been - how soon can a fix be implemented?
@pokoli I think the quickest and easiest way to fix this is to remove the deduping header functionality. This feature has been reported as not working correctly (#982), so I don't see any reason to keep this feature in its current form.
I can not reproduce your issue. I'm testing with this minimal test which works for me:
{
description: "Data is correctly parsed with chunks and duplicated headers",
expected: [{h0: 'a', h1: 'b'}, {h0: 'a', h1: 'b'}],
run: function(callback) {
var data = [];
Papa.parse('h0,h1\na,b\na,b\na,b', {
header: true,
chunkSize: 10,
chunk: function(results) {
data.push(results.data[0]);
},
complete: function() {
callback(data);
}
});
}
},
Adding it to custom tests and running the test suit works for me. Could you please provide a minimal test suite so I can fix it? Thanks
Wow - this is a bit alarming. We must always have confidence that the data integrity is preserved with this tool as it always has been - how soon can a fix be implemented?
@tfrancois Did you ever considered contributing to the maintenance of the library? If no, please do not complain.
@pokoli I included a working example of this issue above, but here is another example using a string input instead of a read stream:
source
const Papa = require('papaparse');
let source = 'h0,h1\nabcd,abcd\nabcd,abcd\nabcd,abcd\nabcd,abcd';
let output = [];
Papa.parse(source, {
header: true,
chunkSize: 16,
chunk: (result) => {
output.push(...result.data);
}
});
console.log(source);
console.log('----------');
console.log(Papa.unparse(output));
output
h0,h1
abcd,abcd
abcd,abcd
abcd,abcd
abcd,abcd
----------
h0,h1
abcd,abcd
abcd,abcd_1
cd,abcd
abcd,abcd
You can read above in my comments for a more detailed explanation, but the principle cause of this bug is duplicate column data in a row.
@pokoli another way to reproduce this bug is to use normal.csv file from papaparse demo as an input and set a step
function like this:
let source = input.files[0] // set normal.csv using <input type="file" />
Papa.parse(source, {
header: true,
step: (results, parser) => { console.log(results.data) }
});
With 5.3.0 it will work perfectly fine, while with 5.4.0 it will append _1, _2, _3,...
to duplicate values in every row (not just headers).
@pokoli also your test data in comment above is completely incorrect for this case, it should be 'h0,h1\na,a\nb,b\nc,c'
with expected values being [{h0: 'a', h1: 'a'}, {h0: 'b', h1: 'b'}, {h0: 'c', h1: 'c'}]
, while actual values in 5.4.0 will be [{h0: 'a', h1: 'a_1'}, {h0: 'b', h1: 'b_1'}, {h0: 'c', h1: 'c_1'}]
which is not what user would expect to get.
So... it looks like this issue should be reopened 😉
@speqtr thanks to your valuable input I managed to reproduce the issue and I uploaded a fix.
Can somebody test on master branch with a named file to ensure everything works so I can publish a new release containing the fix? Thanks in advance
@pokoli That fix will not work for the other effect of this bug (see my second comment in this issue). By appending to the input string (in the case of duplicate headers), the partial chunk string index becomes misaligned:
source
const Papa = require('papaparse');
let source = 'h0,h1,h1\nabcd,abcd,abcd';
let output = [];
Papa.parse(source, {
header: true,
chunkSize: 8,
chunk: (result) => {
output.push(...result.data);
}
});
console.log(source);
console.log('----------');
console.log(Papa.unparse(output));
output
h0,h1,h1
abcd,abcd,abcd
----------
h0,h1,h1_1
cd,abcd,abcd
Notice the first two characters of the line after the header have been lost.
@pokoli please provide a working example. Otherwise I can not reporduce nor fix it. Thanks
@pokoli A full working example of this issue is in my previous comment, and an explanation of why this is occurring (with relevant source) is at the top of this thread.
When reading a stream, the following code is executed on each chunk:
This is taking the first data row of each chunk (when in header mode) and deduping it as if it was a header. This corrupts the data.