Open iliaivanov2016 opened 2 years ago
I'm seeing something like this too with the authors dump on https://openlibrary.org/developers/dumps.
Replacing csv-parser
with csv-stream
with no changes to the data nor options fixes the issue.
However, I don't think it's failing after N rows. Rather, there seems to be a bug with quote/end-of-line detection as it will produce a row that contains hundreds of concatenated rows in the final column, go back to parsing rows correctly, and then parse a long concatenated row many more times, back and forth.
This code will demonstrate the issue on https://openlibrary.org/data/ol_dump_authors_latest.txt.gz (0.4GB):
const pipe = require('stream').createReadStream(
require('fs').createReadStream('ol_dump_authors_latest.txt.gz'),
require('zlib').createGunzip(),
require('csv-parser')({
headers: ['type', 'key', 'revision', 'last_modified', 'json'],
separator: '\t',
}),
(err) => err ? console.error(err) : console.log('done')
)
let seen = 0
pipe.on('data', (row) => {
seen++
// detect long row
if (row.json.length > 10000) {
console.log(seen, row)
}
})
This code will reveal many problem rows that accidentally concatenate following rows into the final column.
[40430] {
type: '/type/author',
key: '/authors/OL5247858A',
revision: '1',
last_modified: '2008-09-28T05:16:27.104438',
json: '{"name": "Kommunisticheskaya partiya Armenii. S\\"ezd", "last_modified": {"type": "/type/datetime", "value": "2008-09-28 05:16:27.104438"}, "key": "/a/OL5247858A", "type": {"key": "/type/author"}, "id": 26329826, "revision": 1}\n' +
'/type/author\t/authors/OL5247929A\t1\t2008-09-28T05:17:19.811748\t{"name": "Archibald Gray", "personal_name": "Archibald Gray", "last_modified": {"type": "/type/datetime", "value": "2008-09-28 05:17:19.811748"}, "key": "/a/OL5247929A", "type": {"key": "/type/author"}, "id": 26330110, "revision": 1}\n' +
'/type/author\t/authors/OL5248963A\t1\t2008-09-28T05:39:41.512087\t{"name": "GREAT BRITAIN. ROYAL COMMISSION ON LABOUR IN INDIA", "last_modified": {"type": "/type/datetime", "value": "2008-09-28 05:39:41.512087"}, "key": "/a/OL5248963A", "type": {"key": "/type/author"}, "id": 26336569, "revision": 1}\n' +
'/type/au'... 710973 more characters
I notice that it happens on any row that has an escaped quote \"
like in the example above. It looks like the parser will start to concatenate rows when it sees the first \"
finish concatenating at the next row that contains a \"
.
Perhaps { escape: '\\' }
just needs to be passed to the parser, but I would have thought that the default of escape: '"'
would handle backslash escapes between quotes.
I also hit this bug, somewhere around line 2.7M in the following data set:
https://ridb.recreation.gov/downloads/reservations2022.zip
Switching to papaparse worked on the same file.
Expected Behavior
167K rows parsed
Actual Behavior
95 K rows parsed
How Do We Reproduce?
https://edbq.xyz/test/Freight3.csv