haskell-hvr / cassava

A CSV parsing and encoding library optimized for ease of use and high performance
http://hackage.haskell.org/package/cassava
BSD 3-Clause "New" or "Revised" License
222 stars 105 forks source link

Fixes handling of double quotes for unescaped fields with tests #182

Open robwithhair opened 4 years ago

robwithhair commented 4 years ago

related to issue #98.

This is to fix a possible Oracle / MS Excel CSV export weirdness where fields can be created which are 'unescaped' but still contain double quote characters. While it violates RFC4180, there is no reason why we cannot parse these files correctly. I would argue that it is more correct to parse the field without error than to terminate a field early, hence splitting the field and providing confusing error message.

Example of such a field unescaped BSV "Frohsinn" Mehr-Ork-Gest e. Example of such a field escaped "BSV ""Frohsinn"" Mehr-Ork-Gest e".

Both fields can appear in CSV files. Because the first character is not a double quote, the string is considered unquoted but it can still be parsed. Previous functionality was to error with a confusing error message due to fields being split in two at the double quote character.

I have run tests and all seem to be passing. I have also added a new test for this functionality.

Although all tests are passing there is a possibility that existing code depending on the error could be affected though I believe this is unlikely. It would be wise to test with some varied known working CSV files.

johannesgerer commented 4 years ago

Yes, this would be helpful! Currently, I have to maintain a cassava fork because of this

andreasabel commented 2 years ago

Frankly, I am a bit uneasy with the change request, because the package prominently states that it implements RFC 4180: https://github.com/haskell-hvr/cassava/blob/5a410c9b423da4e51e591cca6571bed536aa9ca5/cassava.cabal#L7-L9 If we should deviate from RFC 4180, then only under a flag. (Alternatively, the faithful implementation could be under flag rfc4180 or strict-rfc-4180, and the more liberal/practical implementation the default.) @robwithhair @johannesgerer What do you think?

j-rockel commented 1 year ago

What is the status here? I'd also really appreciate having this as an option!