tilo / smarter_csv

Ruby Gem for convenient reading and writing of CSV files. It has intelligent defaults, and auto-discovery of column and row separators. It imports CSV Files as Array(s) of Hashes, suitable for direct processing with ActiveRecord, kicking-off batch jobs with Sidekiq, parallel processing, or oploading data to S3. Writing CSV Files is equally easy.
MIT License
1.46k stars 190 forks source link

v1.8.5 introduces a behavior change when handling trailing backslashes in quoted columns #252

Closed averycrespi-moz closed 11 months ago

averycrespi-moz commented 11 months ago

In version 1.8.5, SmarterCSV introduced logic to parse \" like an escaped double quote.

The justification for this change was that "some tools will generate an escaped value of \" instead of """.

However, this change breaks RFC4180-compliant CSV files which have trailing backslashes in quoted columns.

For example, trying to parse this CSV file:

first,second
"foo \",bar

with this code:

require 'smarter_csv'

SmarterCSV.process('test.csv') do |chunk|
  chunk.map do |row|
    puts row
  end
end

raises the following error:

/Users/averycrespi/.rbenv/versions/3.2.2/lib/ruby/gems/3.2.0/gems/smarter_csv-1.9.0/lib/smarter_csv.rb:74:in `readline': end of file reached (EOFError)
        from /Users/averycrespi/.rbenv/versions/3.2.2/lib/ruby/gems/3.2.0/gems/smarter_csv-1.9.0/lib/smarter_csv.rb:74:in `process'
        from repro.rb:3:in `<main>'

Can you please consider either reverting this change, or adding a new option to control this behaviour?

Let me know if you have any questions, or if I've overlooked something in the docs.

Thanks!


System information:

tilo commented 11 months ago

@averycrespi-moz can you please provide a small sample file for testing?

averycrespi-moz commented 11 months ago

Sure, here's a simple file adapted from one of your existing test fixtures: quoted_trailing_backlash.csv

tilo commented 11 months ago

RFC4180 does not specifically talk about the use of escaped quote characters\", and also does not talk about if escaped quotes can be un-balanced like in your test file. ChatGPT agrees.

When doing the change in version 1.8.5 it was a conscious decision to only allow balanced quotes - in particular: balanced escaped quotes.

Data line 1 and line 2 do not have balanced quotes, and are not considered valid.

However the line 3,"Billy \\",Bob should be valid, and read as first name Billy \

averycrespi-moz commented 11 months ago

RFC4180 does not specifically talk about the use of escaped quote characters\", and also does not talk about if escaped quotes can be un-balanced like in your test file. ChatGPT agrees.

You're right, RFC4180 doesn't talk about using backslashes to escape quotes. The issue is that those backslashes should be interpreted literally (NOT as an escape character) according to RFC4180, but some CSV parsers choose to extend RFC4180 by redefining a backslash as an escape character.

When doing the change in version 1.8.5 it was a conscious decision to only allow balanced quotes - in particular: balanced escaped quotes.

Allowing only balanced escaped quotes makes sense. However, the issue here is whether or not a backslash should be treated as an escape character when :quote_char is set to " in the SmarterCSV options.

Data line 1 and line 2 do not have balanced quotes, and are not considered valid.

This is only true if the backslash is treated an as escape character. If the backslash is a literal, then 1,"John\",Doe is balanced.

TL;DR: RFC4180 only defines the double quote as an escape character, but some CSV parsers also defines the backslash as an escape character. As a user, it would be nice to be able to configure whether or not a backslash is treated as an escape character.

E.g. if there was a :escape_backslash option:

Thanks for your time, and hopefully that makes sense!

tilo commented 11 months ago

the issue here is whether or not a backslash should be treated as an escape character when :quote_char is set to " in the SmarterCSV options.

@averycrespi-moz The escape character \ is the standard escpape character used on UNIX systems, like LINUX, MacOS, and others, as well as on Windows systems. It is also used by many programming languages. It is built-in to those systems and languages.

I made the decision to keep with requiring balanced quotes, and will fix the issue with \\ at the end of a line. This will be fixed in version 1.9.2. The README will also call out these assumptions / limitations.

tilo commented 11 months ago

Fix for fields ending in // was released in version 1.9.2