Open parsonsmatt opened 7 years ago
The behavior you're experiencing is documented at http://hackage.haskell.org/package/bytestring-0.10.8.2/docs/Data-ByteString-Char8.html
However, it may make sense to attach a docstring to the IsString instance since the IsString instance transcends module barriers.
It seems unusual to me that packChars
is the basis for fromString
yet is not part of the API. I think packChars
ought to be exported, and the IsString
instance haddock can just link to the packChars
doc, and the packChars
doc should explain what it does.
I don't understand the rationale of having to export packChars
if we already have access to it via other means (i.e. fromString
& pack
) which alias it.
Ah, sorry, I didn't notice the pack
function in the Char8
module. Yeah, the important thing is for the IsString
instance doc to have a link to Char8.pack
.
This issue is still confusing users today: https://stackoverflow.com/questions/58777439/haskell-data-yaml-utf-8-decoding
This is a landmine to anyone using RawFilePath. Consider:
{-# LANGUAGE OverloadedStrings #-} import System.Posix.Directory.ByteString main = removeDirectory "bla語"
I agree that this seems to assume the source encoding will be used.
Since the IsString
instance is still confusing both beginners and experienced users, how about removing it?
@sjakobi Since this instance has been in place for as long as I can remember, it'd be interesting in order to inform a decision to know how many packages would be affected (and also whether those packages were in full compliance with the PVP by having the mandated upper bounds in place) on Hackage if this instance was removed in a future release of bytestring
.
@hvr I'm sure that a removal would cause tons of upgrade hassle.
Maybe it would be better to change the instance to UTF8-encode the input – that should limit any "breakage" to code that intentionally or more likely unintentionally triggered the current truncating behaviour.
I think the instance should be removed. It isn't well defined. Carrying such pitfalls around for backwards compatibility reasons isn't what the haskell ecosystem should be about in my opinion. It should be about correctness.
We could mark it deprecated and remove it in 1 year.
Yet another alternative to removal: Raise an error if any Char
in the input would be truncated. IMO that's better than truncating silently, but runtime errors aren't that great either…
Yet another alternative to removal: Raise an error if any
Char
in the input would be truncated. IMO that's better than truncating silently, but runtime errors aren't that great either…
As much as I'd like for ""bla語" :: ByteString
to be a compile-time error, I'd rather have a runtime error telling me why something is wrong rather than having this fact stick around in the back of my head, and I'd certainly rather have received the error instead of investigating and discovering this :smile:
I'm in favor of removing the instance over a long-enough timeline. With qq-literals
and the statically checked overloaded strings trick, the fix could be as simple as inserting the relevant quasiquoter ([bs| ... |]
) or quotation $$(...)
(or even $$"..."
in 8.12!). This is an annoyance but would dramatically improve safety.
In stages:
fromString
function with one that throws a runtime error in the next major version bump.IsString
instance in the next major version bump with migration instructionsTypeError
to the IsString
instance in the next major version bump with instructions on how to migrate away.@parsonsmatt I really like the idea of offering quasiquoters as an alternative to literals as a first step. The $$(...)
and $$"..."
quotations seem slightly less readable to me.
bs
and lbs
seem like good names to me. Does everyone agree or have better suggestions?
bs "foo"
is syntactically lighter-weight than quasiquotation, and
trivial to implement:
bs :: Text -> ByteString
bs = Data.Text.Encoding.encodeUtf8
-- see shy jo
@joeyh text
depends on bytestring
, so bytestring
can't have a dependency on text
.
I guess it would be convenient if bytestring
could UTF8-encode String
s itself – I would be surprised if that hadn't been discussed before!
To clarify: The bs
and lbs
quasiquoters would only validate the input, rejecting any input containing non-Latin-1 characters, i.e. characters that don't fit into a single byte.
The documentation could direct users who want UTF-8 encoding to the utf8-string
package.
I guess it would be convenient if
bytestring
could UTF8-encodeString
s itself – I would be surprised if that hadn't been discussed before!
Turns out that there is some UTF-8 functionality in bytestring
– for Builder
:
In fact the IsString
instance for Builder
supports UTF-8 via stringUtf8
:
That's a big surprise to me, and I think it changes the picture.
I think we should aim for consistency with this (superior) instance then, and change the instances for strict and lazy ByteString
to support UTF-8. That should be much more convenient than having to resort to Builder
, text
or utf8-string
to get a UTF-8-encoded ByteString
.
Thoughts on this?
Thoughts on this?
My problem with this is that it's not a good API. ByteString has less structure than String and fromString
is just misleading here. Once you turned it into a bytestring, you lost explicit information (about the encoding) and must be aware that it is utf8 and not something else.
I'd rather have people use those utf8 functions explicitly, so they write less bugs. The semantics between IsString instances are just not clear enough, IMO. If you need additional documentation for an instance, then it's a good sign you shouldn't have that instance in the first place.
Maybe someone will expect it to truncate... and someone expect it to be utf8.
As much as I'd like for ""bla語" :: ByteString to be a compile-time error, I'd rather have a runtime error telling me why something is wrong rather than having this fact stick around in the back of my head, and I'd certainly rather have received the error instead of investigating and discovering this
@parsonsmatt @sjakobi Fortunately for you guys I had this epiphany years ago and already implemented the required typed TH in a library (https://hackage.haskell.org/package/validated-literals) that makes it very simple to turn these problems into compile time errors :p
In fact, a compile time checked ByteString is literally one of the examples in that library ;)
One issue with changing the ByteString
IsString
instances to UTF-8 is that users would need to be careful to have proper lower bounds on bytestring
to avoid truncation with older versions.
@hasufell
Maybe someone will expect it to truncate...
I really really don't believe that anyone would intentionally write non-Latin-1 literals knowing that they are truncated.
I would be very interested in knowing what @hvr and @dcoutts think about the idea of changing the instances to UTF-8.
A lot of people I know (including me) have shot their feet by this instance. As a user of Japanese writing system, I strongly support changing the instance to use UTF-8 (also to be consistent with the instance for Builder).
I really really don't believe that anyone would intentionally write non-Latin-1 literals knowing that they are truncated.
The point is, you are proposing to change behavior without renaming the function and you have no idea about the bugs in other peoples codebases, which may depend on unexpected behavior. PVP doesn't work, people don't read ChangeLogs of 200+ packages to figure out such things (especially when they are not even aware of what they are relying on).
Removing the instance is safe. Old code will stop compiling, people will be forced to change it and think about its semantics.
PVP doesn't work, people don't read ChangeLogs of 200+ packages to figure out such things (especially when they are not even aware of what they are relying on).
The PVP only doesn't work if you're dealing with irresponsible programmers/maintainers which can't be bothered to honor it. Quite frankly, I recommend to stay away from those people's packages as they don't instill me with great confidence and obviously it makes little sense for a PVP-compliant package to depend upon packages which don't themselves make any effort to uphold the PVP contract.
That being said, luckily almost all packages I rely on are maintained by people who care to write correct software and therefore also appreciate the goals and benefits of the PVP formal framework.
To summarise the basic options,instance IsString ByteString
could be either
The nice property of 1. is that you can roundtrip the ByteString literals serialized by Show
; especially those that aren't valid UTF-8 encodings -- after all, ByteString
is about binary data, which is a superset of valid UTF-8 encodings.There's also code out there which relies on the IsString instance to efficiently (still O(n) but with a relatively small factor; but way better than using pack
on [Word8]
-- but see also @phadej and @andrewthad's https://github.com/ghc-proposals/ghc-proposals/pull/292) embed binary blobs into haskell code (NB: to make things more fun to reason about; GHC currently uses a modified UTF8 encoding to represent string literals as CStrings; and there's been a long-time desire to have the data-length be known statically at compile time... but I've seen @andrewthad make some progress in this department recently).
The original complaint about silent failure could be addressed by turning it into a non-silent runtime exception. This probably wouldn't add any significant overhead since currently string literals have a O(n) overhead as they need to be transcoded; we'd just need to signal errors instead of truncating code-points during this transcoding that needs to occur anyway.
Variant 2. would make it consistent with instance IsString Builder
which currently uses UTF-8 encoding (however, it could be argued that IsString Builder
should be using ISO-8859-1 as well!). But other than that, it's still a weird choice to pick UTF-8 for a binary ByteString type. Also note that you still have to deal transcoding GHC's modified UTF8 serialization into proper UTF-8 as well as dealing with improper UTF-8 strings (either silently or by runtime exceptions). And you lose the ability to represent all ByteString
values as string literals.
The appeal of variant 3. is to turn the silent failures into compile-time failures but also all currently legitimate uses of bytestring literals! thereby throwing out the baby with the bathwater.
Note however that TH or QQ is not a proper replacement for the lack of string literals. Not all GHC platforms support TH (and also for those that do, TH adds quite a bit of compile-time overhead); so I'd hate to see packages starting depending on TH support for something common such as bytestring literals which should be IMO provided by GHC w/o the need for the quite heavy TH machinery.
And if we go one step further by making the instance illegal (i.e. so you can't define your own local instance), it'd effectively represent a regression to me as I couldn't rely on TH for libraries (which I strive to be portable, which means avoiding TH when possible). Consequently, I'd strongly object with to any TypeError which would primarily recommend the use of TH/QQ as replacement as I'd consider this harmful for the ecosystem.
In summary, I think that adding a runtime error to ISO-8859-1 string literals represents the approach which provides the incremental improvement over the status quo with the best power-to-weight ratio; after all, string literals are constant values; if you manage to evaluate them at least once (hint hint, test-coverage via hpc
) as part of your CI, you'll be able for force those runtime errors. And typically you don't write that many Bytestring literals anyway -- and if you do, I'd expect you to be mindful about why you're using ByteString
ltierals in the first place -- and if you're abusing ByteString
literals for encoding Unicde text instead of using text
/utf8-text
/short-text
/..., I'd have little sympathy if you're not being careful. ;-)
I could also see hlint
be able to detect silent code-point truncation when the type is trivially inferrable -- if it doesn't already support this; i.e. code like
s :: Bytestring
s = "€uro"
could be something that hlint
could heuristically detect IMO.
The PVP only doesn't work if you're dealing with irresponsible programmers/maintainers which can't be bothered to honor it.
Well, I think it's fair to raise this issue here. PVP says:
Breaking change. If any entity was removed, or the types of any entities or the definitions of datatypes or classes were changed, or orphan instances were added or any instances were removed, then the new A.B MUST be greater than the previous A.B. Note that modifying imports or depending on a newer version of another package may cause extra orphan instances to be exported and thus force a major version change.
This doesn't say anything about changing semantics (not types) of existing functions. There is no way for a maintainer to know when this has happened. In fact, even a minor version bump may introduce changed semantics of a function as part of a "bugfix" (since it's even debatable what a bug is). And if that "bug" has been "fixed" after a few years, your chances of knowing the impact are very low.
In this instance, PVP doesn't give us any guarantees whatsoever. The only proper way for that is:
This is the only way that makes a significant change in semantics visible to your users. They will see the deprecation warning in their build logs, especially with -Werror
and after a year or two they will experience compilation failure and will have to look up the documentation/ChangeLog.
In summary, I think that adding a runtime error to ISO-8859-1 string literals represents the approach which provides the incremental improvement over the status quo with the best power-to-weight ratio
I think this is the most dangerous one. An error may crash someones backend. We don't know that. Just imagine IsString
used in the context of foreign input (from a user, another http server). Now you have a DoS? :fearful:
@hasufell You're are right about the incomplete wording in the PVP! I.e. that the current wording of the PVP could be interpreted that way; even though most people have interpreted the intent behind the PVP to mean that obviously semantically observable changes are considered breaking changes. And we've been discussing already a clarification (see haskell/pvp#30) of the wording to state this detail more explicit so there's really no doubt about what the intent is (the enumerated rules merely encode the general principle from the POV of a consumer; i.e. they follow from that principle; not the other way around -- but that's a different topic). In any case, I'd suggest to take the PVP specific discussion to haskell/pvp#30 where we're trying to address this minor oversight.
In summary, I think that adding a runtime error to ISO-8859-1 string literals represents the approach which provides the incremental improvement over the status quo with the best power-to-weight ratio
I think this is the most dangerous one. An error may crash someones backend. We don't know that. Just imagine
IsString
used in the context of foreign input (from a user, another http server). Now you have a DoS?
How realistic is this scenario? For one, typically you have proper exception handling in place for critical components which take untrusted input -- we're specifically talking about the IsString
which is primarily intended for string literals; so you'd have to assume that somebody had been using code that was incorrect to begin with and silently truncated code-points of a string literal. Or you might be considering the unintended use of IsString
for non-static input from untrusted input. But then again, you'd have to assume that code would be doing input validation given that IsString
has . I feel like you're trying to come up with pathological scenarios which may be technically correct but not very common in practice. At least I can't think of code of mine I've written over the years where my components would crash and burn without deserving it if non-ISO-8859-1 ByteString literals would start triggering runtime exceptions. That being said, I'd find ways to cope if the IsString instance went missing (and without an annoying TypeError-fake-instance in its place), resulting in some busywork finding ways to express the previous concise ByteString in less verbose forms; I just fail to see whether this is worth the cost (especially since to me personally I can't remember the last time I encountered this class of error).
PS: There's still the option of leaving the IsString instance as-is and accept the truncating behaviour legitimate if for the sake of having a total function (at the cost of being non-injective). It's not like this is a "bug" of IsString ByteString
; it's a documented behaviour that's been in place for a very long-time; and yes, it may bite people; yes it's not aligned with the Builder
instance (but you can also argue that the Builder instance having been added later is at fault here for diverging); but you can just as well justify the current semantics being legitimate for a specific tradeoff.
The whole point of the discussion at hand is gauge the cost/benefit ratios of a change from the current status quo to inform whether a change is worth doing, and most importantly, all choices are merely tradeoffs; there doesn't appear to be one variant which is absolutely superior to the others to me.
Otoh, I partly blame that GHC or the core Haskell standard didn't anticipate the need to have a richer support for annotating the purpose/subtype of string literals (but then again; the Haskell Report considered a "String" merely an alias for [Char]
); other languages I've worked with had some kind of charset annotation (and most importantly without requiring the heavy machinery of something like QQ/TH for something so basic); even Python whose typesystem is in a totally different corner of the typesystem space has modifiers to annotate and distinguish regex, bytestrings and text literals...
I've been bitten by changing instances before. The last major release of time
changed the instance Read UTCTime
silently (accidentally?) to require the trailing time zone. This broke a decent amount of code across the internet with *** Prelude.read: no parse
exceptions, which are a bunch of fun to debug.
Personally, I really want the instance IsString ByteString
to do UTF-8 decoding. That's what I initially expected, and it seems like the expectation of many other people as well. It is possible that people are relying on the current behavior somehow (accidentally?), but I'd suspect that more people have silent/hidden bugs based on this truncation. That belief is based on the fact that the bug that inspired this issue was in production for quite some time before it was eventually discovered.
A runtime exception with an informative error message would be surprising, but at least it should point to exactly where you need to fix your code. I usually expect to find some weird behavior when I do a major version upgrade (typically a new stackage
resolver), and a runtime exception like this would be satisfactory. Especially if that runtime exception noted that a future version of the library would change the behavior to use UTF-8 encoding instead of truncation, and provided a pointer to functions that could either a) have the old truncation behavior or b) have the future encoding behavior.
Removing the instance would be costly, and I wouldn't want to do it without a TypeError
that gave you a note saying "Please just use this function
instead to fix this error." Unfortunately, we can't attach a warning to an instance - this might be a good GHC feature to add.
I wish there was a good way of reaching out and polling the community about their preferences on this - it seems pretty important!
Builder’s UTF8 instance is probably blaze-builder/markup legacy.
I’d hope that if we do something, it will be change to
TypeError ... => IsString ByteString/Builder
Herbert’s argument that Latin1 instance roundtrips Show instance is one in favor of it. UTF8 is somewhat arbitrary, why not UTF16/32LE/BE.
I’d also prefer not introducing any pure exceptions.
Fwiw, http://hackage.haskell.org/package/overloaded-0.2/docs/Overloaded-Symbols.html instance for ByteStrings accepts only ASCII (7bit) literals. That’s somewhat accidentally by implementation; but it’s a proof that if we really want we can make GHC to fail compile time on ”invalid” literals.
On 31. Dec 2019, at 18.08, Matt Parsons notifications@github.com wrote:
I've been bitten by changing instances before. The last major release of time changed the instance Read UTCTime silently (accidentally?) to require the trailing time zone. This broke a decent amount of code across the internet with *** Prelude.read: no parse exceptions, which are a bunch of fun to debug.
Personally, I really want the instance IsString ByteString to do UTF-8 decoding. That's what I initially expected, and it seems like the expectation of many other people as well. It is possible that people are relying on the current behavior somehow (accidentally?), but I'd suspect that more people have silent/hidden bugs based on this truncation. That belief is based on the fact that the bug that inspired this issue was in production for quite some time before it was eventually discovered.
A runtime exception with an informative error message would be surprising, but at least it should point to exactly where you need to fix your code. I usually expect to find some weird behavior when I do a major version upgrade (typically a new stackage resolver), and a runtime exception like this would be satisfactory. Especially if that runtime exception noted that a future version of the library would change the behavior to use UTF-8 encoding instead of truncation, and provided a pointer to functions that could either a) have the old truncation behavior or b) have the future encoding behavior.
Removing the instance would be costly, and I wouldn't want to do it without a TypeError that gave you a note saying "Please just use this function instead to fix this error." Unfortunately, we can't attach a warning to an instance - this might be a good GHC feature to add.
I wish there was a good way of reaching out and polling the community about their preferences on this - it seems pretty important!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
Just to throw out another bikeshedding idea; there is precedent for silently remapping invalid code-points in text literals, specifically text
/utf8-text
/text-short
remap invalid code-points (specifically surrogate pairs as documented in the IsString
instance documentation) to the replacement character U+FFFD
.
So instead of truncating code-points beyond the ISO-8859-1 plane to whatever low 8-bits the code-point value has; pick a specific code-point to map invalid code-points to (there's multiple potential choices here; one could e.g. pick 0xFF
which is an octet that would be easy to spot and also would make sure that the resulting bytestring literal can't be a valid UTF-8 encoding anymore as such octects are not allowed to occur in a well-formed UTF-8 string).
Btw, whatever we do, we're on a relatively long time-scale; bytestring has been on the major API version 0.9
between 2007 and 2012; and the last 7 years since 2012 we've been on major version 0.10; And I'd point out that this level of API stability is something I've grown to greatly appreciate for fundamental libraries such as bytestring
which almost every package depends upon. So any migration scheme involving multiple major versions is not something I'd expect to play out in a reasonable timeframe. Consequently, if (say) we decide to drop IsString ByteString
(with or without TypeError =>
), that's gonna happen in say bytestring-0.11.0
, end of story. There's really no point in any long-winded multi-stage process IMO. Also we should try to bundle up any other breaking changes that have been accumulated since 2012, to make good use of this costly major version increment.
after all, string literals are constant values; if you manage to evaluate them at least once (hint hint, test-coverage via hpc) as part of your CI, you'll be able for force those runtime errors. And typically you don't write that many Bytestring literals anyway
Just wanted to respond to this statement as someone who was bitten by this issue in a non-string literal situation. The IsString
instance is not necessarily only used for literals: in fact it is (ab)used by the Data.String.Interpolate.IsString
module from interpolate
to provide string interpolation. So, after obtaining a Unicode-containing ByteString through other means, I wrote
interpolatedText :: T.Text = [i|Interpolated bytes: #{bytes}|]
And this produced a garbled result by truncating the Unicode characters. A runtime exception would be really nasty here because I wouldn't expect CI/hpc to catch it. (https://github.com/sol/interpolate/issues/14)
FWIW, I agree with @parsonsmatt that the best/least astonishing is just to change the instance to do UTF-8 decoding. I think most of the other considerations in this thread (such as nice roundtrip properties with Show
) are minor compared to this one--language designers and core library authors need to help the rest of us avoid shooting ourselves in the foot with Unicode :). Consider for instance how Python 3 did so by adopting Unicode strings as the native string type.
Edit: another reason the status quo behavior is perplexing: I look up the documentation for Data.Char
and it says "The character type Char
is an enumeration whose values represent Unicode" (emphasis mine). Then I look at the definition of IsString
, where its sole method is fromString :: String -> a
aka fromString :: [Char] -> a
. So it's easy to see why someone would assume that the fromString
instance for ByteString
will convert my list of Unicode characters to some kind of valid Unicode-encoded ByteString
(maybe not necessarily UTF-8, but something...).
in fact it is (ab)used by the Data.String.Interpolate.IsString module from interpolate
Unfortunately it is abused basically everywhere.
Another example is optparse-applicative
which happily mangles your command-line options (optparse-applicative/issues/368, wontfix is a nice touch).
{-# LANGUAGE OverloadStrings #-}
MUST continue to work for plain ASCII literals. Therefore, the IsString
instance cannot just go away."\xEF\xBB\xBF"
, each character literal encoding an octet in the string MUST produce exactly one octet in the ByteString
:
Prelude> :set -XOverloadedStrings
Prelude> import Data.ByteString
Prelude Data.ByteString> Data.ByteString.unpack "\xEF\xBB\xBF"
[239,187,191]
'\x00'
and '\xFF'
are semantically valid in ByteString literals.
{-# LANGUAGE OverloadedStrings #-}
module Main (main) where
import qualified Data.Text as T import qualified Data.Text.Encoding as T import qualified Data.ByteString as B
utf8bom :: B.ByteString utf8bom = "\xEF\xBB\xBF"
main :: IO () main = do putStr "Valid: " print $ B.unpack utf8bom putStr "Wrong: " print $ B.unpack $ T.encodeUtf8 "\xEF\xBB\xBF"
Output:
Valid: [239,187,191] Wrong: [195,175,194,187,194,191]
Users who expect bytestring's `IsString` to perform Utf8 encoding are misguided. A `ByteString` is an octet-string! The only change that might be reasonable is to not truncate input characters, and instead raise an error when encountering input octets in literals outside the valid range:
```haskell
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE ViewPatterns #-}
module Main (main) where
import qualified Data.Text as T
import qualified Data.Text.Encoding as T
import qualified Data.ByteString as B
import Data.String (IsString, fromString)
import Data.Char (ord)
import Data.Coerce (coerce)
import Data.Word (Word8)
import Numeric (showHex)
newtype MyBS = MyBS B.ByteString
instance IsString MyBS where
fromString s = coerce B.pack $ map c2w s
where
c2w :: Char -> Word8
c2w (ord -> o)
| o <= 0xff = toEnum o
| otherwise = error $ "Illegal character in ByteString literal: "
++ show s ++ ": U+" ++ showHex o ""
utf8bom :: MyBS
utf8bom = "\xEF\xBB\xBF"
main :: IO ()
main = do
putStr "Valid: "
print $ B.unpack $ coerce utf8bom
putStr "Wrong: "
print $ B.unpack $ T.encodeUtf8 "\xEF\xBB\xBF"
This version will raise an exception when the input string contains characters outside the supported range.
Bottom line, I take a firm stance that encoding strings to UTF8 to produce ByteString
literals would not be correct. Any other conversions should be wrapped in suitable newtypes
that represent a flavour of ByteString
for which Utf8 encoding of literals is expected/supported.
People use IsString
for various purposes, from pretty-printing to parsing, so I would not say that there are any invariants to be violated. At least, Data.String
does not mention any.
Bringing in template-haskell
machinery does not look appealing to me. Besides introducing a fragile dependency, literal bytestrings tend to be small, and even a short quasiquoter is too much noise (one would probably prefer old plain pack
instead).
From bytestring
point of view there is no reason to prefer UTF8 to UTF{16,32}{LE,BE}. I would rather put Builder
in line with ByteString
than vice versa.
On the other side, I feel important to address @joeyh remark about removeDirectory "bla語"
. This is a potential security vulnerability for Haskell applications. IMHO it is worth to add a validation, throwing errors for non-octet bytes (or - to be on the safe side - for all non-ASCII chars).
@vdukhovni what you're saying is inconsistent with the way Haskell String
actually works. When you write a string literal in double quotes, it is first understood at the language/compiler level as unicode. This goes all the way back to the Haskell Report.
In other words, when you write a string like "\xEF\xBB\xBF"
, it is understood as a String
consisting of 3 Unicode characters. If you want to get a BOM, the right way to write it is with the actual Unicode code point, "\xFEFF"
. The fact that your example above works is because you're first creating a String
containing an invalid value (i.e. not a BOM at all) and then relying on the current (broken IMO) fromString
instance behavior to truncate that into the desired bytes.
This kind of intricate weirdness is not what users of bytestring
should have to rely upon :P. As I said above, I think encoding as UTF8 (or at least some valid UTF encoding) would bring this library into consistency with the String
type and Haskell itself. (Code below illustrates this.)
{-# LANGUAGE OverloadedStrings, ScopedTypeVariables #-}
module Main (main) where
import qualified Data.ByteString as B
import qualified Data.Text as T
import qualified Data.Text.Encoding as T
main :: IO ()
main = do
-- A string written this way is understood by Haskell to be 3 Unicode characters
let s1 :: String = "\xEF\xBB\xBF"
putStrLn ("Length of s1: " ++ (show $ length s1))
putStr "Wrong (because it's not a single UTF BOM character!): "
print $ B.unpack $ T.encodeUtf8 "\xEF\xBB\xBF"
-- If you actually want a BOM, you should write it using the actual Unicode value U+FEFF
let s2 :: String = "\xFEFF"
putStrLn ("Length of s2: " ++ (show $ length s2))
putStr "Right: "
print $ B.unpack $ T.encodeUtf8 "\xFEFF"
@vdukhovni what you're saying is inconsistent with the way Haskell
String
actually works. When you write a string literal in double quotes, it is first understood at the language/compiler level as unicode. This goes all the way back to the Haskell Report.
It is a list of (Unicode) Characters. We're discussing a way to recover an octet string (a list of Unicode code points in the range 0x00 - 0xff) from a convenient to keyboard literal character string. The text string to enter is the one you get from show . Data.ByteString.Char8.unpack
. The expected invariant (for strings not containing out of range code points) is:
Prelude> :set -XOverloadedStrings
Prelude> import qualified Data.ByteString.Char8 as B
Prelude B> let s = "\xEF\xBB\xBF" :: String
Prelude B> s == Data.ByteString.Char8.unpack "\xEF\xBB\xBF"
True
In other words, when you write a string like
"\xEF\xBB\xBF"
, it is understood as aString
consisting of 3 Unicode characters.
Yes, of course, this is exactly what I intended when I wrote that string. I want to write 3 literal octets.
I could have written: Data.ByteString.pack [0xef, 0xbb, 0xbf]
(and might have in this case), but in other cases I may have other surrounding octets that are more naturally represented as character data.
If you want to get a BOM, the right way to write it is with the actual Unicode code point,
"\xFEFF"
.
But I'm not trying to represent the BOM as a Unicode String, I'm trying to represent its UTF-8 octet-string encoding! I'm creating a literal ByteString, not a literal String. That's why I'm calling it "utf8bom", not "bom".
The fact that your example above works is because you're first creating a
String
containing an invalid value.
The string is perfectly valid, it contains three character literals, all below 0xff.
(i.e. not a BOM at all)
See above, the intended "string" is a 3 character sequence with the three characters equal to the 3 octets of the UTF-8 octets of the encoded BOM. Like Horton, I meant what I said, and I said what I meant... :-)
and then relying on the current (broken IMO)
fromString
instance behavior to truncate that into the desired bytes.
There's no truncation in my example. All the octets are <= 255
.
Bottom line is that I really don't expect lists of Chars in a ByteString context to be magically utf8-encoded. I expect to get their numeric octets, allowing correct handling of:
foo :: ByteString
foo = "A common hex string \xDE\xAD\xBE\xEF"
and get the string literal back from show . unpack
. Indeed just show
on a Data.ByteString.Char8 will do exactly that, and surely IsString
should be able to read back what show
produced.
On the other side, I feel important to address @joeyh remark about
removeDirectory "bla語"
. This is a potential security vulnerability for Haskell applications. IMHO it is worth to add a validation, throwing errors for non-octet bytes (or - to be on the safe side - for all non-ASCII chars).
Well, at least for literals in the code, this is not a "security issue", the bad data is not from an attacker. It is nevertheless a correctness issue, and a potential WTF surprise.
The reason it is a surprise, is that there exist interfaces for manipulating system resources that take raw bytestrings (in some unspecified encoding) and use them as identifiers.
What file a user has in mind when typing some string in some specific encoding is a rather complex question, particularly when dealing with network filesystems. It brings in complex questions about canonical forms (NFC, NFD, ...) case-sensitivity, ... It is a quagmire, and many questions in this space simply have no correct answers. What is the underlying codepage of the filesystem? Is it ISO-8859-1 or UTF-8?
About all one might hope for, is that if you created a file name from a ByteString literal of "это не английский алфавит.txt" about the most you can hope for is that using the same name again will refer to the same file (in the same directory). Otherwise, system-programming with non-ASCII filenames requires considerable care. Naive expectations that automatic UTF-8 encoding in IsString
will save the day are just that, the problem is harder than it looks.
The expected invariant (for strings not containing out of range code points) is...
You can't just assert what the "expected invariant" is (or what this library "MUST" do, or what is "correct," in your previous comments), because that is precisely what we're debating here. There's clearly no solution here that is universally "correct," just those that are more or less confusing to new users, likely to break existing code, etc. We're trying to find the lesser evil here.
Yes, of course, this is exactly what I intended when I wrote that string. I want to write 3 literal octets.
Yes, but at the String
level that's not what you're getting! When you write "\xEF"
you get a String
containing the unicode character U+239, which can be encoded any number of ways. Yes, the current behavior of fromString
means you get to pretend this is an octet. But my argument is that this is overall not the best behavior due to inconsistency with the rest of the language.
There's no truncation in my example.
You're right, my mistake. The current implementation is just encoding U+239 as the byte 239
.
Bottom line is that I really don't expect lists of Chars in a ByteString context to be magically utf8-encoded.
I get that, and I can see the value of being able to construct strings of octets this way.
I hope you can also see the value of someone being able to write hello = "你好" :: ByteString
and getting a sensible result. And also the value of consistency, where a function named fromString
in a language that explicitly uses Unicode as its native string type would produce a valid Unicode result. It's 2020 after all, Unicode and I18N aren't supposed be an afterthought in language design.
Which of these considerations is more important is of course up to the community to decide, but let's not pretend one is clearly better than the other.
Edit: doing your octet building instead with bom = B.pack [0xEF, 0xBB, 0xBF]
as you mentioned is advantageous in other ways too, because GHC will catch invalid octets (even overflowed ones via -Woverflowed-literals
).
You can't just assert what the "expected invariant" is (or what this library "MUST" do, or what is "correct," in your previous comments), because that is precisely what we're debating here. There's clearly no solution here that is universally "correct," just those that are more or less confusing to new users, likely to break existing code, etc. We're trying to find the lesser evil here.
Sorry, you're right about the tone. That should be "the invariant I would expect", or something similar. There's no way to please everyone here. Any change from the status quo will break something. One consideration is whether the breakage is silent or highly visible.
If we stop truncating, and error out on Char values above 255, the breakage will likely be quite visible.
If we instead 'encodeUtf8', I am less confident that the problem will be found before it affects users.
Of course, for resource identifiers that are typed as ByteStrings, but are generally UTF-8 encoded strings, the more friendly behaviour is encoding. And not surprisingly that would, for example, break a couple of doctests in the DNS library.
We could of course provide a related type LiteralByteString
that's a newtype
around ByteString
which has the legacy is IsString
interface. One can then use literal strings to initialize these, and then extract the underlying ByteString. Along the lines of the MyBS
example above.
Sure, I would just add that (as suggested earlier in this thread) there might be silent bugs out there already due to people interpreting the IsString
instance the way I did. One of those bugs in my own code is what brought me to this thread in the first place. So it's possible a switch to encoding would lead to silent fixes as well as silent breakages :).
But yeah, clearly a breaking change of this magnitude involves a lot of considerations and I'm not sure what is actually best--I just wanted to get some arguments out there for the pro-encoding camp.
After reading the thread again, I am definitely against removal of the instance. Which leaves imperfect choices. I can't think of any situation where truncating Unicode code-points above 255 to 8-bits produces better results than latching the value at 255 for all invalid inputs, garbage-in/garbage-out. This option is backwards-compatible for ISO-Latin (really 8-bit octets), and is perhaps more likely to be safe in practice (less rope for attackers to exploit the truncation to yield chosen "useful" outputs).
So my "vote" is either status-quo, or truncate to 255, rather than low 8-bits. Throwing runtime errors for out-of-range inputs, where none were raised before seems much more risky, and would be my least preferred choice (other than removal, which I'm not considering as an option).
@merijn Thanks for the pointer to ValidLiterals
(I'd seen it once before, but it never really registered), I may have some uses for this in code that will not be a core library for the ecosystem, and so a TH dependency is not a deal-breaker.
A made a patch to GHC https://gitlab.haskell.org/ghc/ghc/-/merge_requests/3290 which would make GHC warn about "overflowing" ByteString literals.
(Please don't comment there if it's not about implementation, let's keep the discussion here instead)
What file a user has in mind when typing some string in some specific encoding is a rather complex question, particularly when dealing with network filesystems. It brings in complex questions about canonical forms (NFC, NFD, ...) case-sensitivity, ... It is a quagmire, and many questions in this space simply have no correct answers. What is the underlying codepage of the filesystem? Is it ISO-8859-1 or UTF-8?
The point is, when you're already on ByteString level, then all those questions are irrelevant. Nothing is to be interpreted automatically. IsString violates this.
I like @phadej solution, but I would additionally make it a deprecated instance and remove it after 1 year. The amount of discussion here is proof that nothing about this instance is intuitive or correct.
The GHC patch by @phadej sounds like a step in the right direction, but I'd just point out it doesn't solve the runtime uses of fromString
, in libraries like interpolate
and optparse-applicative
discussed above.
Also, responding to a comment on that merge request:
The truncation of characters to octets might be surprising, yet using a blessed (UTF8 or any other) encoding is not optimal. Then one couldn't represent all ByteStrings with literal.
For a point of comparison, I'm pretty sure this is exactly the situation that exists in Python (a language well-known for its pragmatism). To quote Python's documentation, UTF-8 is one of the most commonly used encodings, and Python often defaults to using it.
For example, the prototype of the string encode function in Python is str.encode(encoding="utf-8", errors="strict")
. Python also seems to recommend writing out arbitrary bytestrings using lists of numbers, not literals. So I don't see why a blessed encoding is so bad.
Bringing in
template-haskell
machinery does not look appealing to me. Besides introducing a fragile dependency, literal bytestrings tend to be small, and even a short quasiquoter is too much noise (one would probably prefer old plainpack
instead).
So, I actually brought up the issue of "partial" instances of the various OverloadedX
classes and how it would be nice to have the conversions be String -> Either Error a
or similar and having GHC check those conversions at compile time, probably somewhere in 2013. At the time this idea was shot down as "unnecessary and complex to implement" and that there seemed to be no demand for that and people should just assume no one misuses those classes.
I write a proof-of-concept TH library at the time that does exactly this (and has an example of enforcing ascii in bytestrings!). I'm not quite happy with the current API (requires too much manual annotation), so maybe I'll revisit it sometime Soon (TM).
I still think this functionality should just be inside GHC, but my last attempt to convince people failed. So if anyone wants to take up this crusade feel free to point at the POC implementation :)
@thomasjm
fromInteger 123132 :: Word8
also truncates. Whether this is desirable or not depends.
Concretely: should it be truncated to maxBound
or be truncated by bytes. Both behaviours are useful.
@phadej I'm not sure how to interpret this analogy. I already acknowledged above that the ability to leverage the current behavior to write octet strings is useful, ignoring other considerations. Of course, truncation on integer overflow feels like a rather different beast from silently dropping Unicode bytes since it can happen at the hardware level and programmers are more likely to expect it. The point is that as long as the instance exists we have to make a choice. Also:
foo = 123132 :: Word8
(which is why I think your patch is a good idea, if we keep the status quo)foo = "\1114112" :: String
(you get error: numeric escape sequence out of range
, because GHC understands that '\1114111'
is the largest valid Unicode value).I write a proof-of-concept TH library at the time that does exactly this (and has an example of enforcing ascii in bytestrings!). I'm not quite happy with the current API (requires too much manual annotation), so maybe I'll revisit it sometime Soon (TM).
FWIW, I found it reasonably appealing as-is, for a new DNS library I'm working on, I have prototyped the below:
newtype Domain = Domain
{ getDomain :: ByteString } deriving Eq
instance Validate String Domain where
fromLiteral = safePack >=> parseDomain
where
safePack :: String -> Maybe ByteString
safePack = fmap B.pack . traverse c2w
c2w = toIntegralSized . ord
liftResult _ (Domain v) = TH.unsafeTExpCoerce $
TH.AppE <$> [| Domain . B.pack |] <*> TH.lift (B.unpack v)
mkDomain :: String -> TH.Q (TH.TExp Domain)
mkDomain = valid
Which makes available $$(mkDomain "foo.example.org")
for writing literal DNS names, that are validated at compile-time to match the syntax of a DNS name in presentation form. The input string-literal is first mapped to a ByteString via a validating c2w
that rejects non-octet values, and this is then run though my ByteString -> Domain
parser that decodes valid DNS presentation forms (decimal escapes, non-decimal escapes, label length limits, ...). Yes, the user-visible syntax is more heavy-weight than just "foo.example.org"
, but compile-time safety, avoiding fromJust
, etc., is likely worth the price.
[ I cargo-culted the definition of liftResult
, without really knowing why I need unsafeTExpCoerce
, but it looks reasonably plausible, even if I couldn't presently come up with it myself... ]
@vdukhovni that's because [| .. |]
is an regular (untyped) TH quote. So you have to unsafely coerce the untyped TH expression into a typed one. You probably want to use typed TH quotation [|| .. ||]
(note the double bar), which should eliminate the need to coerce it.
A made a patch to GHC https://gitlab.haskell.org/ghc/ghc/-/merge_requests/3290 which would make GHC warn about "overflowing" ByteString literals.
(Please don't comment there if it's not about implementation, let's keep the discussion here instead)
This is brilliant! I love it. Thank you!
At work, we discovered a somewhat surprising behavior of ByteString's
IsString
instance and interaction withOverloadedStrings
.The following REPL session demonstrates the issue:
The
IsString
instance callspackChars
which callsc2w
, which silently truncates the bytes.I'd be happy to put together a PR to document the behavior of the
IsString
instance.I think I expected it to encode the string using the source encoding. I don't know whether or not that's a feasible or desirable change.