Closed mreinsch closed 1 year ago
""
is processed as escaped "
like "\""
in Ruby's String
.
This is a common convention in CSV. See also RFC 4180: https://datatracker.ietf.org/doc/html/rfc4180#section-2
7. If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example:
"aaa","b""bb","ccc"
@kou thanks for looking into this. Makes sense.
I suppose a way forward could be to provide an option to tell the parser that no escaping is being used in the data. I could look into that if that's something you'd be open to add.
Though I'm actually thinking that in my case it'd be easier to implement a TSV parser as tabs aren't allowed in values... but it'd be nice to provide a common way to deal with such cases.
Can we use col_sep: "\t"
?
@kou not sure what you mean, the problem is the same with col_sep: "\t"
. In that case the \t becomes part of the string.
I thought that your original data uses \t
as a separator and all columns don't include \t
:
The original source is actually using tabs to separate the columns (which aren't allowed in the data)
Yes, that's correct. But as CSV parser is converting "" to ", the same issue happens:
CSV.parse_line(%Q{"size: 2" "\t'time: 3''\t"test"}, col_sep: "\t", liberal_parsing: true)
=> ["\"size: 2\" \"", "'time: 3''", "test"]
CSV.parse_line(%Q{"size: 2""\t'time: 3''\t"test"}, col_sep: "\t", liberal_parsing: true)
=> ["\"size: 2\"\t'time: 3''\t\"test\""]
Are you suggesting to adjust the CSV parser to behave differently if tab is used as col_sep?
As I mentioned, I could probably just split the string on "\t" and hence not use the CSV parser at all / implement a simplified TSV parser. I was more wondering if you'd consider adjusting the CSV parser to handle this.
Ah, you just used CSV not TSV as an example to show your use case, right?
for some reason decided to quote each value
Is it your choice for parsing? Or is it decided by a person who created the original source?
(It seems that the first and third column are quoted by "
and the second column is quoted '
not "
. Is it intentional?)
If it's your choice, how about just stopping it and using quote_char: nil
?
pp CSV.parse_line(%Q{size: 2" \ttime: 3'\ttest}, col_sep: "\t", quote_char: nil)
# ["size: 2\" ", "time: 3'", "test"]
pp CSV.parse_line(%Q{size: 2"\ttime: 3'\ttest}, col_sep: "\t", quote_char: nil)
# ["size: 2\"", "time: 3'", "test"]
@kou right, I suppose I used the quote_char so I don't need to strip the quotation marks afterwards, but that should be easy to do. Thanks for your help!
Thanks for all the effort with making parsing even broken CSVs quite reliable.
I've now encountered a specific case when using liberal parsing. The original source is actually using tabs to separate the columns (which aren't allowed in the data), and for some reason decided to quote each value. Anyway, the issue also happens with comma separated values.
Here's some examples to illustrate the issue:
First, this works as expected, the parser splits this into 3 values:
Now if we remove the trailing space in the first value, this becomes a single value:
I've tried debugging this a bit further, and I can see different behaviour in the
parse_quoted_column_value
method, but I'm not very familiar with the code base, so things are slow. Any pointers you might have are welcome. Also whether you'd consider this a case that should be handled of if there's any concerns here?