Open abailly opened 8 years ago
Just hit this as well. Very annoying!
I also faced this issue :( And I spent a lot of time to discover that problem was actually in BOM... Better error message would be appreciated!
I fell into the same trap: https://github.com/haskell-hvr/cassava/issues/160. The error message one gets in that case is really not particularly helpful. I guess that's the downside of using attoparsec
for performance.
A possible solution would be to allow user to pass a Text instead of a ByteString ?
That would be an interesting idea (it would be interesting to see if this can be done w/o duplicating cassava
's API), but even then you'd have to deal with a BOM somewhere (as the BOM code-point would still be part of the Text
).
What we can do in any case is improving the documentation to warn about this; and give a simple recipe for filtering a BOM if the user expects it (personally I consider a BOM in UTF-8 encodings a sign that something's wrong, as UTF-8 BOMs cause a lot of interoperability issues all over the place; so that's why I wouldn't want cassava to silently strip them out by default).
As for the recipe, bytestring
e.g. offers the following verb
stripPrefix :: ByteString -> ByteString -> Maybe ByteString
So the recipe would simply be something like
stripUtf8Bom :: BS.ByteString -> BS.ByteString
stripUtf8Bom bs = fromMaybe bs (BS.stripPrefix "\239\187\191" bs)
EDIT: fixed stripUtf8Bom
as pointed out by https://github.com/haskell-hvr/cassava/issues/106#issuecomment-379228397
I had to hack around a bom codepoint in Text, as alluded to above.
I asked for the solution of this problem on SO when I first encountered this problem:
Maybe solution from SO will be faster than stripPrefix
since I expect take
and drop
functions to not do any copying at all (just slicing).
@ChShersh stripPrefix
doesn't do any copying either
First of all: thanks a bunch to @hvr for this elegant and simple solution.
However, unless I'm mistaken, \357\273\277
is the UTF-8 BOM expressed in octal, and trying to strip with this string doen't work on my setup - I suspect it wouldn't typically. It works if you express it in decimal though, using \239\187\191
, or if you make the use of octal explicit: \o357\o273\o277
.
@adfretlink good catch!
I understand the decision to not directly address this, as it isn't strictly in scope for CSV, even if I personally think it's easier to fix directly on pareto-y grounds, particularly since it seems like plenty of extant programs continue to put out BOMs.
With that said better error messages would be very helpful here, as this was unnecessarily painful to debug.
When trying to decode a CSV file with BOM (U+EFFF at beginning of file), it fails with the following error:
A possible solution would be to allow user to pass a
Text
instead of aByteString
?