mafintosh / csv-parser

Streaming csv parser inspired by binary-csv that aims to be faster than everyone else
MIT License
1.41k stars 134 forks source link

csv-parser does not parse big csv file correct, after ~95K rows it begins merge all rows in single JSON. #207

Open iliaivanov2016 opened 2 years ago

iliaivanov2016 commented 2 years ago

Expected Behavior

167K rows parsed

Actual Behavior

95 K rows parsed

How Do We Reproduce?

https://edbq.xyz/test/Freight3.csv

danneu commented 2 years ago

I'm seeing something like this too with the authors dump on https://openlibrary.org/developers/dumps.

Replacing csv-parser with csv-stream with no changes to the data nor options fixes the issue.

However, I don't think it's failing after N rows. Rather, there seems to be a bug with quote/end-of-line detection as it will produce a row that contains hundreds of concatenated rows in the final column, go back to parsing rows correctly, and then parse a long concatenated row many more times, back and forth.

This code will demonstrate the issue on https://openlibrary.org/data/ol_dump_authors_latest.txt.gz (0.4GB):

const pipe = require('stream').createReadStream(
    require('fs').createReadStream('ol_dump_authors_latest.txt.gz'),
    require('zlib').createGunzip(),
    require('csv-parser')({
        headers: ['type', 'key', 'revision', 'last_modified', 'json'],
        separator: '\t',
    }),
    (err) => err ? console.error(err) : console.log('done')
)

let seen = 0

pipe.on('data', (row) => {
    seen++
    // detect long row
    if (row.json.length > 10000) {
        console.log(seen, row)
    }
})

This code will reveal many problem rows that accidentally concatenate following rows into the final column.

[40430] {
  type: '/type/author',
  key: '/authors/OL5247858A',
  revision: '1',
  last_modified: '2008-09-28T05:16:27.104438',
  json: '{"name": "Kommunisticheskaya partiya Armenii. S\\"ezd", "last_modified": {"type": "/type/datetime", "value": "2008-09-28 05:16:27.104438"}, "key": "/a/OL5247858A", "type": {"key": "/type/author"}, "id": 26329826, "revision": 1}\n' +
    '/type/author\t/authors/OL5247929A\t1\t2008-09-28T05:17:19.811748\t{"name": "Archibald Gray", "personal_name": "Archibald Gray", "last_modified": {"type": "/type/datetime", "value": "2008-09-28 05:17:19.811748"}, "key": "/a/OL5247929A", "type": {"key": "/type/author"}, "id": 26330110, "revision": 1}\n' +
    '/type/author\t/authors/OL5248963A\t1\t2008-09-28T05:39:41.512087\t{"name": "GREAT BRITAIN.  ROYAL COMMISSION ON LABOUR IN INDIA", "last_modified": {"type": "/type/datetime", "value": "2008-09-28 05:39:41.512087"}, "key": "/a/OL5248963A", "type": {"key": "/type/author"}, "id": 26336569, "revision": 1}\n' +
  '/type/au'... 710973 more characters

I notice that it happens on any row that has an escaped quote \" like in the example above. It looks like the parser will start to concatenate rows when it sees the first \" finish concatenating at the next row that contains a \".

Perhaps { escape: '\\' } just needs to be passed to the parser, but I would have thought that the default of escape: '"' would handle backslash escapes between quotes.

mjpowersjr commented 11 months ago

I also hit this bug, somewhere around line 2.7M in the following data set:

https://ridb.recreation.gov/downloads/reservations2022.zip

Switching to papaparse worked on the same file.