vincentlaucsb / csv-parser

A high-performance, fully-featured CSV parser and serializer for modern C++.
MIT License
901 stars 150 forks source link

trim() does not play nicely with quoted fields #166

Closed Jibbow closed 3 years ago

Jibbow commented 3 years ago

When there is a quoted field with leading whitespace characters, and the trim() option is enabled for whitespaces, the opening quote of the quoted field is included as the content of the field. It's probably easier to demonstrate this with an example:

Assume we have the following format config:

CSVFormat format;
format.delimiter(',').quote('"').header_row(0).trim({' ', '\t'});

And we want to parse the following CSV file:

column1, column2
"value1",         "value2"

The first value in column1 is parsed correctly as value1. However, the value in column2 is parsed as "value2 (not the opening quote)

vincentlaucsb commented 3 years ago

"value1", "value2" is not valid CSV per RFC-4180.

Each field may or may not be enclosed in double quotes (however some programs, such as Microsoft Excel, do not use double quotes at all). If fields are not enclosed with double quotes, then double quotes may not appear inside the fields. For example:

   "aaa","bbb","ccc" CRLF
   zzz,yyy,xxx

The CSV you posted is not a valid quoted field because "value2" contains double quotes even though it is not enclosed by double quotes (it begins with whitespace).

Technically there is no such thing as valid CSV, since RFC 4180 is merely a suggestion, but this parser targets mainly RFC 4180-ish CSV files. I don't see a trivial way to modify the parser to accommodate this deviation from RFC 4180.

As a general rule, I do not expect the parser to handle significant deviations from RFC 4180.

Jibbow commented 3 years ago

Makes sense, thanks! :)