mholt / PapaParse

Fast and powerful CSV (delimited text) parser that gracefully handles large files and malformed input
http://PapaParse.com
MIT License
12.3k stars 1.14k forks source link

Quoted header values containing commas and comprised of the same string aren't able to be parsed. #1052

Open alliefitter opened 2 months ago

alliefitter commented 2 months ago

This is a pretty weird corner case, so let me know if y'all need more detail. Given a sheet with headers named "Bar, Baz" and "Spam, Baz", after splitting the header row on ,, Papa will treat Baz" as duplicate header, and append _1 to the second instance of it in headerMap. Then while seemingly attempting to remediate duplicates, the second header value will become "Spam, Baz"_1, and seems to break parsing fields later on. The following scirpt...

import Papa from 'papaparse'

console.log(Papa.parse('Foo,"Bar, Baz","Spam, Baz",Some,Other,Headers\n1,2,3,4,5,6', { header: true }))

... will print...

{
  data: [],
  errors: [
    {
      type: 'Quotes',
      code: 'InvalidQuotes',
      message: 'Trailing quote on quoted field is malformed',
      row: 0,
      index: 16
    },
    {
      type: 'Quotes',
      code: 'MissingQuotes',
      message: 'Quoted field unterminated',
      row: 0,
      index: 16
    }
  ],
  meta: {
    delimiter: ',',
    linebreak: '\n',
    aborted: false,
    truncated: false,
    cursor: 57,
    fields: [
      'Foo',
      'Bar, Baz',
      'Spam, Baz"_1,Some,Other,Headers\n1,2,3,4,5,6'
    ]
  }
}

I was going to submit a PR, but the code is a bit difficult to follow. If this will take some time for y'all to get to, just comment here, and I can spend some time on a PR.

tony-cocco commented 1 month ago

This describes my issue as well. Can reproduce on the demo site. Should be noted that headers: true must be set.

My test snippet:

"Enum 1 (A, B, Other)","Enum 2 (C, D, Other)"
A,Other

Adjusting the second instance of Other to anything else resolves the errors listed.

FilippoSalvarani21 commented 1 month ago

I think I have the same issue https://github.com/mholt/PapaParse/issues/1055

any way to fix?

tony-cocco commented 1 month ago

I think I have the same issue #1055

any way to fix?

We control the template we're parsing, so we just removed the duplicate strings in the headers. Alternatively, you could do headers: false but might change your logic for the rest of your processing.

FilippoSalvarani21 commented 1 month ago

I think I have the same issue #1055 any way to fix?

We control the template we're parsing, so we just removed the duplicate strings in the headers. Alternatively, you could do headers: false but might change your logic for the rest of your processing.

i don't understand why this is happening, it should take the value from pipe to pipe regarldess of mismatched quotation marks. I need headers, so I cannot disable them, but It looks like a basic feature. Maybe I can escape the " somehow?