tilo / smarter_csv

Ruby Gem for convenient reading and writing of CSV files. It has intelligent defaults, and auto-discovery of column and row separators. It imports CSV Files as Array(s) of Hashes, suitable for direct processing with ActiveRecord, kicking-off batch jobs with Sidekiq, parallel processing, or oploading data to S3. Writing CSV Files is equally easy.
MIT License
1.46k stars 189 forks source link

"Wrong" parsing #4

Closed heldersilva closed 11 years ago

heldersilva commented 11 years ago

I think the line 72 of smarter_csv.rb is wrong. If you have a separator inside a quoted text it will split the text incorrectly.

Maybe I'm seeing this wrong, but check the example below and see if what I'm saying makes any sense.

value, text, other_value 1,"sffdfd, sdfdfsf",3

the resulting Hash is going to be:

value => 1, text =>"\"sffdfd\"", other_value => "\"sdfdfsf\""

You can use this test using the default options.

francof commented 11 years ago

I encountered the same issue

frankhinek commented 11 years ago

Same issue here as well. smarter_csv breaks on any fields that contain commas that are within quotes in a given field.

tilo commented 11 years ago

yes, that is a bug! thank you for letting me know!

bradly-swart commented 11 years ago

Just ran into this problem now, tried various combinations of formatting the csv file and changing the field separator for example :col_sep => "\t"

Any fix or patch available ?

tilo commented 11 years ago

I'm looking into it.

tilo commented 11 years ago

Note to self:

quote_char = '"' => """ get_quoted_strings = %r{#{quotechar}((?:.|.)?)#{quotechar}} => "((?:.|.)?)"

quoted_strings = line.scan(get_quoted_strings).flatten

heldersilva commented 11 years ago

Tilo, I don't know if it helps, but I think looking to the LibreOffice source code might help you come up with a regular expression to deal with this.

I'll also try to look into it, but I have limited time at the moment =/

activestylus commented 11 years ago

I was having so much fun until I hit this wall. Pretty nasty bug! Any progress?

tilo commented 11 years ago

yes,i made some progress ... I'll update this weekend.

http://tools.ietf.org/html/rfc4180

digerata commented 11 years ago

Hate to rush you :)

digerata commented 11 years ago

Sent you a PR on that commit...

tilo commented 11 years ago

@digerata unfortunately the fix does only work if all fields are quoted - but as per CSV specification, fields can be mixed - some quoted, some not quoted - this makes it a bit more difficult

digerata commented 11 years ago

Maybe that is useful but I'm not smart enough to figure out why. How old is that spec? What modern app does that these days? I can tell you right now that I'm seeing commas in quoted fields 100% more than I'm seeing mixed quoted fields.

You might as well include the patch until you figure out a solution to the spec because right now SmarterCSV does not work for a very common use case.

On May 17, 2013, at 4:04 AM, Tilo notifications@github.com wrote:

@digerata unfortunately the fix does only work if all fields are quoted - but as per CSV specification, fields can be mixed - some quoted, some not quoted - this makes it a bit more difficult

— Reply to this email directly or view it on GitHub.

tilo commented 11 years ago

apparently some versions of Excel quote only fields which need to be quoted -- and leaves other fields unquoted.

I agree with you. Let's just add your fix for now, and deal with the other case later.

tilo commented 11 years ago

fixed in 1.0.6

heldersilva commented 11 years ago

Thanks Tilo =D