haskell / text-format

A Haskell text formatting library optimized for ease of use and high performance.
BSD 2-Clause "Simplified" License
32 stars 27 forks source link

Convenience: up to 20 parameters, Builders for ByteStrings #3

Closed mgajda closed 12 years ago

mgajda commented 12 years ago

Hi,

I have added Params class for up to 20 parameters (useful if generating old FORTRAN-style column data formats), and added Builder instance for both strict and lazy ByteString. The latter may be useful when dealing with legacy code.

Any hints about efficiency improvements in the future? (My parsing code is actually much faster than formatting with text-format.)

singpolyma commented 12 years ago

Char8 is bad. Use a UTF8 decode routine (maybe the one from utf8-strings) at the very least, but even that is not safe unless you happen to know the bytes are UTF8 encoded text. Probably should not be an instance, but an explicit function you have to call (which already exist in Data.Text.Encoding)

mgajda commented 12 years ago

I believe that it is correct when you have 100% certainty that you deal with pure ASCII, like when reading/writing legacy databases.

I added the instances since I noticed that many people are still using ASCII strings around for efficiency reasons. They also allow to sidestep unpleasant issue when UTF8 codec crashes entire program, or "forgets" characters because they were input by user in a wrong encoding.

Disclosure: 2 out of 3 languages I use every day are much more readable with non-ASCII encoding. I have been badly bitten by encoding mixtures in many data files I processed with a default UTF-8 encoding.

singpolyma commented 12 years ago

Char8 is actually latin1, and yes if you know your encoding it's safe, but in that case call a function specific to your encoding. An instance is dangerous because if someone uses it on bytes not in latin1, it will corrupt the data.

mgajda commented 12 years ago

Yes, using a wrong tool may lead to data corruption. I believe that one should always use Data.ByteString instead of Data.ByteString.Char8 if one doesn't have 8-bit characters inside.

And that's why I suggest instances only for *.Char8 types, not for those in Data.ByteString module itself.

mgajda commented 12 years ago

If there are other people sharing this reservation, I may factor the *.Bytestring.Char8 instances into a separate module that requires explicit imports. Still it could be beneficial to share this code in text-format, where it belongs.

bos commented 12 years ago

I've applied 02c15bf80709e4246d3833c578fb3bf9b989489d, thanks.

I won't take the other patch, as it is not safe for the reasons that @singpolyma mentions.