Open tysonzero opened 2 years ago
An ASCII delimiter in the undecoded ByteString corresponds a delimiter in the corresponding UTF-8-decoded Text, so under UTF-8 encoding there is no problem with making a mistake with delimiters.
However, the user is forced to use UTF-8 if there are Text/ShortText/Char fields (cassava assumes UTF-8). If he wants to use another text encoding, he needs to use ByteString fields and do the ByteString-Text conversion separately. Alternatively, he can perform transcoding between UTF-8 and the other text encoding, using UTF-8-encoded ByteStrings when interfacing with cassava.
I have no idea about the performance characteristics of each alternative, though, including the proposed Data.Csv.Utf8
.
To be clear Data.Csv.Utf8
would just be the current implementation. The module name should make it clear that using utf8 for the ByteString
arguments is safe, and that non-utf8 arguments should expect edge cases and require additional care.
I am also unsure of how the Data.Csv
or Data.Csv.Text
or whatever Text
-based alternative would change performance.
The spec appears to only mention text, and the specific binary encoding / charset of that text seems out of scope.
Accordingly it seems to me as though cassava should generally be dealing with
Text
instead ofByteString
, perhaps with aData.Csv.Utf8
module for just directly treatingByteString
values as encoded utf-8 text.