ruby / csv

CSV Reading and Writing
https://ruby.github.io/csv/
BSD 2-Clause "Simplified" License
182 stars 114 forks source link

liberal parsing does not split column as expected with quote at end of column #291

Closed mreinsch closed 1 year ago

mreinsch commented 1 year ago

Thanks for all the effort with making parsing even broken CSVs quite reliable.

I've now encountered a specific case when using liberal parsing. The original source is actually using tabs to separate the columns (which aren't allowed in the data), and for some reason decided to quote each value. Anyway, the issue also happens with comma separated values.

Here's some examples to illustrate the issue:

First, this works as expected, the parser splits this into 3 values:

CSV.parse_line(%Q{"size: 2" ",'time: 3'',"test"}, liberal_parsing: true)
=> ["\"size: 2\" \"", "'time: 3''", "test"]

Now if we remove the trailing space in the first value, this becomes a single value:

CSV.parse_line(%Q{"size: 2"",'time: 3'',"test"}, liberal_parsing: true)
=> ["\"size: 2\",'time: 3'',\"test\""]

I've tried debugging this a bit further, and I can see different behaviour in the parse_quoted_column_value method, but I'm not very familiar with the code base, so things are slow. Any pointers you might have are welcome. Also whether you'd consider this a case that should be handled of if there's any concerns here?

kou commented 1 year ago

"" is processed as escaped " like "\"" in Ruby's String. This is a common convention in CSV. See also RFC 4180: https://datatracker.ietf.org/doc/html/rfc4180#section-2

   7.  If double-quotes are used to enclose fields, then a double-quote
       appearing inside a field must be escaped by preceding it with
       another double quote.  For example:

       "aaa","b""bb","ccc"
mreinsch commented 1 year ago

@kou thanks for looking into this. Makes sense.

I suppose a way forward could be to provide an option to tell the parser that no escaping is being used in the data. I could look into that if that's something you'd be open to add.

Though I'm actually thinking that in my case it'd be easier to implement a TSV parser as tabs aren't allowed in values... but it'd be nice to provide a common way to deal with such cases.

kou commented 1 year ago

Can we use col_sep: "\t"?

mreinsch commented 1 year ago

@kou not sure what you mean, the problem is the same with col_sep: "\t". In that case the \t becomes part of the string.

kou commented 1 year ago

I thought that your original data uses \t as a separator and all columns don't include \t:

The original source is actually using tabs to separate the columns (which aren't allowed in the data)

mreinsch commented 1 year ago

Yes, that's correct. But as CSV parser is converting "" to ", the same issue happens:

CSV.parse_line(%Q{"size: 2" "\t'time: 3''\t"test"}, col_sep: "\t", liberal_parsing: true)
=> ["\"size: 2\" \"", "'time: 3''", "test"]

CSV.parse_line(%Q{"size: 2""\t'time: 3''\t"test"}, col_sep: "\t", liberal_parsing: true)
=> ["\"size: 2\"\t'time: 3''\t\"test\""]

Are you suggesting to adjust the CSV parser to behave differently if tab is used as col_sep?

As I mentioned, I could probably just split the string on "\t" and hence not use the CSV parser at all / implement a simplified TSV parser. I was more wondering if you'd consider adjusting the CSV parser to handle this.

kou commented 1 year ago

Ah, you just used CSV not TSV as an example to show your use case, right?

for some reason decided to quote each value

Is it your choice for parsing? Or is it decided by a person who created the original source? (It seems that the first and third column are quoted by " and the second column is quoted ' not ". Is it intentional?)

If it's your choice, how about just stopping it and using quote_char: nil?

pp CSV.parse_line(%Q{size: 2" \ttime: 3'\ttest}, col_sep: "\t", quote_char: nil)
# ["size: 2\" ", "time: 3'", "test"]
pp CSV.parse_line(%Q{size: 2"\ttime: 3'\ttest}, col_sep: "\t", quote_char: nil)
# ["size: 2\"", "time: 3'", "test"]
mreinsch commented 1 year ago

@kou right, I suppose I used the quote_char so I don't need to strip the quotation marks afterwards, but that should be easy to do. Thanks for your help!