haskell-hvr / cassava

A CSV parsing and encoding library optimized for ease of use and high performance
http://hackage.haskell.org/package/cassava
BSD 3-Clause "New" or "Revised" License
222 stars 105 forks source link

Isn't csv a text-based rather than binary(bytestring)-based format? #202

Open tysonzero opened 2 years ago

tysonzero commented 2 years ago

The spec appears to only mention text, and the specific binary encoding / charset of that text seems out of scope.

Accordingly it seems to me as though cassava should generally be dealing with Text instead of ByteString, perhaps with a Data.Csv.Utf8 module for just directly treating ByteString values as encoded utf-8 text.

jchia commented 1 year ago

An ASCII delimiter in the undecoded ByteString corresponds a delimiter in the corresponding UTF-8-decoded Text, so under UTF-8 encoding there is no problem with making a mistake with delimiters.

However, the user is forced to use UTF-8 if there are Text/ShortText/Char fields (cassava assumes UTF-8). If he wants to use another text encoding, he needs to use ByteString fields and do the ByteString-Text conversion separately. Alternatively, he can perform transcoding between UTF-8 and the other text encoding, using UTF-8-encoded ByteStrings when interfacing with cassava.

I have no idea about the performance characteristics of each alternative, though, including the proposed Data.Csv.Utf8.

tysonzero commented 1 year ago

To be clear Data.Csv.Utf8 would just be the current implementation. The module name should make it clear that using utf8 for the ByteString arguments is safe, and that non-utf8 arguments should expect edge cases and require additional care.

I am also unsure of how the Data.Csv or Data.Csv.Text or whatever Text-based alternative would change performance.