mholt / PapaParse

Fast and powerful CSV (delimited text) parser that gracefully handles large files and malformed input
http://PapaParse.com
MIT License
12.47k stars 1.14k forks source link

5.4.0 dupe header renaming breaks chunking #980

Closed obsius closed 1 year ago

obsius commented 1 year ago

When reading a stream, the following code is executed on each chunk:

image

This is taking the first data row of each chunk (when in header mode) and deduping it as if it was a header. This corrupts the data.

obsius commented 1 year ago

This also breaks the buffered partial line when chunking by modifying input:

image

lastIndex is relying on a cursor position relative to aggregate, but when aggregate (input) is modified, the cursor becomes misaligned and the next chunk will not receive the correct partial line:

image

pokoli commented 1 year ago

Could you please explain how to reproduce your issue?

obsius commented 1 year ago

Yes, thanks for supporting this library. This issue will occur when the following conditions are met:

ChunkStreamer (through ParserHandler.parse) ultimately calls Parser.parse each time it receives the next chunk from a stream reader:

image

I'm guessing that whomever authored the dupe header commit did not realize that parse is called multiple times when parsing in chunk mode. This leads to the unintended consequence of altering data rows, which when it occurs, also omits x chars of the first row of the next chunk.

pokoli commented 1 year ago

Could you prepare a small CSV for testing such scenario? I imagine a CSV file with 10 or less rows that should be parsed with step as 1 or so. That will ease the creation of a tests scenario which raises the issue and makes it esier to produce the fix.

obsius commented 1 year ago

source

const fs = require('fs');
const Papa = require('papaparse');

let i = 0;

Papa.parse(fs.createReadStream('./data.csv', { highWaterMark: 32 }), {
    header: true,
    chunk: (result) => {
        console.log(`chunk[${i++}][0]:`, result.data[0]);
    }
});

data.csv

h0,h1
abcd,abcd
abcd,abcd
abcd,abcd
abcd,abcd
abcd,abcd
abcd,abcd
abcd,abcd

output

chunk[0][0]: { h0: 'abcd', h1: 'abcd' }
chunk[1][0]: { h0: 'abcd', h1: 'abcd_1' }
chunk[2][0]: { h0: 'cd', h1: 'abcd' }
chunk[3][0]: { h0: 'abcd', h1: 'abcd_1' }
dashmug commented 1 year ago

I have noticed this bug as well in the v5.4.0. We didn't have this in 5.3.2.

sangamk commented 1 year ago

Going back to 5.3.0 fixed it for us.

pokoli commented 1 year ago

I'm wondering if we should disable header renaming when running in chunks

yuriirud1 commented 1 year ago

Going back to 5.3.0 fixed it for us.

same

tfrancois commented 1 year ago

Wow - this is a bit alarming. We must always have confidence that the data integrity is preserved with this tool as it always has been - how soon can a fix be implemented?

obsius commented 1 year ago

@pokoli I think the quickest and easiest way to fix this is to remove the deduping header functionality. This feature has been reported as not working correctly (#982), so I don't see any reason to keep this feature in its current form.

pokoli commented 1 year ago

I can not reproduce your issue. I'm testing with this minimal test which works for me:

    {                                                                          
        description: "Data is correctly parsed with chunks and duplicated headers",
        expected: [{h0: 'a', h1: 'b'}, {h0: 'a', h1: 'b'}],                    
        run: function(callback) {                                              
            var data = [];                                                     
            Papa.parse('h0,h1\na,b\na,b\na,b', {                               
                header: true,                                                  
                chunkSize: 10,                                                 
                chunk: function(results) {                                     
                    data.push(results.data[0]);                                
                },                                                             
                complete: function() {                                         
                    callback(data);                                            
                }                                                              
            });                                                                
        }                                                                      
    },  

Adding it to custom tests and running the test suit works for me. Could you please provide a minimal test suite so I can fix it? Thanks

pokoli commented 1 year ago

Wow - this is a bit alarming. We must always have confidence that the data integrity is preserved with this tool as it always has been - how soon can a fix be implemented?

@tfrancois Did you ever considered contributing to the maintenance of the library? If no, please do not complain.

obsius commented 1 year ago

@pokoli I included a working example of this issue above, but here is another example using a string input instead of a read stream:

source

const Papa = require('papaparse');

let source = 'h0,h1\nabcd,abcd\nabcd,abcd\nabcd,abcd\nabcd,abcd';
let output = [];

Papa.parse(source, {
    header: true,
    chunkSize: 16,
    chunk: (result) => {
        output.push(...result.data);
    }
});

console.log(source);
console.log('----------');
console.log(Papa.unparse(output));

output

h0,h1
abcd,abcd
abcd,abcd
abcd,abcd
abcd,abcd
----------
h0,h1
abcd,abcd
abcd,abcd_1
cd,abcd
abcd,abcd

You can read above in my comments for a more detailed explanation, but the principle cause of this bug is duplicate column data in a row.

speqtr commented 1 year ago

@pokoli another way to reproduce this bug is to use normal.csv file from papaparse demo as an input and set a step function like this:

let source = input.files[0] // set normal.csv using <input type="file" />

Papa.parse(source, {
    header: true,
    step: (results, parser) => { console.log(results.data) }
});

With 5.3.0 it will work perfectly fine, while with 5.4.0 it will append _1, _2, _3,... to duplicate values in every row (not just headers).

@pokoli also your test data in comment above is completely incorrect for this case, it should be 'h0,h1\na,a\nb,b\nc,c' with expected values being [{h0: 'a', h1: 'a'}, {h0: 'b', h1: 'b'}, {h0: 'c', h1: 'c'}], while actual values in 5.4.0 will be [{h0: 'a', h1: 'a_1'}, {h0: 'b', h1: 'b_1'}, {h0: 'c', h1: 'c_1'}] which is not what user would expect to get.

speqtr commented 1 year ago

So... it looks like this issue should be reopened 😉

pokoli commented 1 year ago

@speqtr thanks to your valuable input I managed to reproduce the issue and I uploaded a fix.

Can somebody test on master branch with a named file to ensure everything works so I can publish a new release containing the fix? Thanks in advance

obsius commented 1 year ago

@pokoli That fix will not work for the other effect of this bug (see my second comment in this issue). By appending to the input string (in the case of duplicate headers), the partial chunk string index becomes misaligned:

source

const Papa = require('papaparse');

let source = 'h0,h1,h1\nabcd,abcd,abcd';
let output = [];

Papa.parse(source, {
    header: true,
    chunkSize: 8,
    chunk: (result) => {
        output.push(...result.data);
    }
});

console.log(source);
console.log('----------');
console.log(Papa.unparse(output));

output

h0,h1,h1
abcd,abcd,abcd
----------
h0,h1,h1_1
cd,abcd,abcd

Notice the first two characters of the line after the header have been lost.

pokoli commented 1 year ago

@pokoli please provide a working example. Otherwise I can not reporduce nor fix it. Thanks

obsius commented 1 year ago

@pokoli A full working example of this issue is in my previous comment, and an explanation of why this is occurring (with relevant source) is at the top of this thread.