mholt / PapaParse

Fast and powerful CSV (delimited text) parser that gracefully handles large files and malformed input
http://PapaParse.com
MIT License
12.47k stars 1.14k forks source link

How to handle values starting with double quotes in papaparse #1057

Closed 4integration closed 1 month ago

4integration commented 3 months ago

Using latest Papaparse to parse large CSV files. It handles double quotes in the value but not when value starts with double quotes.

Using this code:

const parsePromise = new Promise<void>((resolve, reject) => {
    Papa.parse<Equipment>(fileStream, {
        header: true,
        delimiter: "\t",
        dynamicTyping: true,
        skipEmptyLines: true,
        step: (result) => {
            const rowData = {
                vehicle_id: result.data.vehicle_id,
                schema_id: result.data.schema_id,
                option_id: result.data.option_id,
                record_id: result.data.record_id,
                location: result.data.location,
                data_value: result.data.data_value,
                condition: result.data.condition,
            };
            entities.push(rowData);
            console.log(rowData)
        },
        complete: () => resolve(),
        error: (error) => reject(error),
    });
});

If I have the following csv data:

vehicle_id  schema_id   option_id   record_id   location    data_value  condition
425972620240523 15102   1266    7700    W   "Första hjälpen"- förbandslåda med varningstriangel, 2 varselvästar 
425972620240523 15104   1266    7700    W   W   
425972620240523 15101   1266    7800    INT S   
425972620240523 15102   1266    7800    INT medical kit, warning triangle, 2 safety vests   
425972620240523 15104   1266    7800    INT INT 
425972620240523 15101   1267    7900    W   S   
425972620240523 15102   1267    7900    W   Papperskorg (borttagbar)    

It outputs

{
  vehicle_id: 425972620240523,
  schema_id: 15102,
  option_id: 1266,
  record_id: 7700,
  location: 'W',
  data_value: 'Första hjälpen"- förbandslåda med varningstriangel, 2 varselvästar\t\r\n' +
    '425972620240523\t15104\t1266\t7700\tW\tW\t\r\n' +
    '425972620240523\t15101\t1266\t7800\tINT\tS\t\r\n' +
    '425972620240523\t15102\t1266\t7800\tINT\tmedical kit, warning triangle, 2 safety vests\t\r\n' +
    '425972620240523\t15104\t1266\t7800\tINT\tINT\t\r\n' +
    '425972620240523\t15101\t1267\t7900\tW\tS\t\r\n' +
    '425972620240523\t15102\t1267\t7900\tW\tPapperskorg (borttagbar)\t\r\n',
  condition: undefined
}

If I move the first double quote as in:

vehicle_id  schema_id   option_id   record_id   location    data_value  condition
425972620240523 15102   1266    7700    W   Första "hjälpen"- förbandslåda med varningstriangel, 2 varselvästar 
425972620240523 15104   1266    7700    W   W   
425972620240523 15101   1266    7800    INT S   
425972620240523 15102   1266    7800    INT medical kit, warning triangle, 2 safety vests   
425972620240523 15104   1266    7800    INT INT 
425972620240523 15101   1267    7900    W   S   
425972620240523 15102   1267    7900    W   Papperskorg (borttagbar)    

The result is correct:


{
  vehicle_id: 425972620240523,
  schema_id: 15102,
  option_id: 1266,
  record_id: 7700,
  location: 'W',
  data_value: 'Första "hjälpen"- förbandslåda med varningstriangel, 2 varselvästar',
  condition: null
}
{
  vehicle_id: 425972620240523,
  schema_id: 15104,
  option_id: 1266,
  record_id: 7700,
  location: 'W',
  data_value: 'W',
  condition: null
}
....

How can Papaparse handle values starting with a double quote?
janisdd commented 1 month ago

Hardly known, there is a setting "quoteChar". The default value is ". It is normally used to encode fields that contain the separator (here tabulator).

So papaparse thinks that "Första hjälpen" is actually the whole field (because it starts and ends with the quoteChar) and gives an error.

Let m be your data

let m = `...`
Papa.parse(m, { delimiter: '\t', quoteChar: '#' }

Tested on https://www.papaparse.com/demo today