Fail to parse UTF-8 file with BOM

abailly commented 8 years ago

When trying to decode a CSV file with BOM (U+EFFF at beginning of file), it fails with the following error:

*** Exception: parse error (Failed reading: satisfy) at " D1 "," Account Number "," Value Date "," Date "," Time "," Description "," Your Reference "," Our  (truncated)

A possible solution would be to allow user to pass a Text instead of a ByteString ?

3noch commented 7 years ago

Just hit this as well. Very annoying!

chshersh commented 6 years ago

I also faced this issue :( And I spent a lot of time to discover that problem was actually in BOM... Better error message would be appreciated!

peti commented 6 years ago

I fell into the same trap: https://github.com/haskell-hvr/cassava/issues/160. The error message one gets in that case is really not particularly helpful. I guess that's the downside of using attoparsec for performance.

hvr commented 6 years ago

A possible solution would be to allow user to pass a Text instead of a ByteString ?

That would be an interesting idea (it would be interesting to see if this can be done w/o duplicating cassava's API), but even then you'd have to deal with a BOM somewhere (as the BOM code-point would still be part of the Text).

What we can do in any case is improving the documentation to warn about this; and give a simple recipe for filtering a BOM if the user expects it (personally I consider a BOM in UTF-8 encodings a sign that something's wrong, as UTF-8 BOMs cause a lot of interoperability issues all over the place; so that's why I wouldn't want cassava to silently strip them out by default).

As for the recipe, bytestring e.g. offers the following verb

stripPrefix :: ByteString -> ByteString -> Maybe ByteString

So the recipe would simply be something like

stripUtf8Bom :: BS.ByteString -> BS.ByteString
stripUtf8Bom bs = fromMaybe bs (BS.stripPrefix "\239\187\191" bs)

EDIT: fixed stripUtf8Bom as pointed out by https://github.com/haskell-hvr/cassava/issues/106#issuecomment-379228397

ghost commented 6 years ago

I had to hack around a bom codepoint in Text, as alluded to above.

chshersh commented 6 years ago

I asked for the solution of this problem on SO when I first encountered this problem:

https://stackoverflow.com/questions/47367728/simplest-way-to-remove-bom-from-haskell-bytestring

Maybe solution from SO will be faster than stripPrefix since I expect take and drop functions to not do any copying at all (just slicing).

hvr commented 6 years ago

@ChShersh stripPrefix doesn't do any copying either

adfretlink commented 6 years ago

First of all: thanks a bunch to @hvr for this elegant and simple solution.

However, unless I'm mistaken, \357\273\277 is the UTF-8 BOM expressed in octal, and trying to strip with this string doen't work on my setup - I suspect it wouldn't typically. It works if you express it in decimal though, using \239\187\191, or if you make the use of octal explicit: \o357\o273\o277.

hvr commented 6 years ago

@adfretlink good catch!

tysonzero commented 3 years ago

I understand the decision to not directly address this, as it isn't strictly in scope for CSV, even if I personally think it's easier to fix directly on pareto-y grounds, particularly since it seems like plenty of extant programs continue to put out BOMs.

With that said better error messages would be very helpful here, as this was unnecessarily painful to debug.

haskell-hvr / cassava

Fail to parse UTF-8 file with BOM #106