Closed mgajda closed 12 years ago
Char8
is bad. Use a UTF8 decode routine (maybe the one from utf8-strings
) at the very least, but even that is not safe unless you happen to know the bytes are UTF8 encoded text. Probably should not be an instance, but an explicit function you have to call (which already exist in Data.Text.Encoding
)
I believe that it is correct when you have 100% certainty that you deal with pure ASCII, like when reading/writing legacy databases.
I added the instances since I noticed that many people are still using ASCII strings around for efficiency reasons. They also allow to sidestep unpleasant issue when UTF8 codec crashes entire program, or "forgets" characters because they were input by user in a wrong encoding.
Disclosure: 2 out of 3 languages I use every day are much more readable with non-ASCII encoding. I have been badly bitten by encoding mixtures in many data files I processed with a default UTF-8 encoding.
Char8
is actually latin1, and yes if you know your encoding it's safe, but in that case call a function specific to your encoding. An instance is dangerous because if someone uses it on bytes not in latin1, it will corrupt the data.
Yes, using a wrong tool may lead to data corruption. I believe that one should always use Data.ByteString instead of Data.ByteString.Char8 if one doesn't have 8-bit characters inside.
And that's why I suggest instances only for *.Char8 types, not for those in Data.ByteString module itself.
If there are other people sharing this reservation, I may factor the *.Bytestring.Char8 instances into a separate module that requires explicit imports. Still it could be beneficial to share this code in text-format, where it belongs.
I've applied 02c15bf80709e4246d3833c578fb3bf9b989489d, thanks.
I won't take the other patch, as it is not safe for the reasons that @singpolyma mentions.
Hi,
I have added Params class for up to 20 parameters (useful if generating old FORTRAN-style column data formats), and added Builder instance for both strict and lazy ByteString. The latter may be useful when dealing with legacy code.
Any hints about efficiency improvements in the future? (My parsing code is actually much faster than formatting with text-format.)