Surprising behavior of ByteString literals via IsString

parsonsmatt commented 7 years ago

At work, we discovered a somewhat surprising behavior of ByteString's IsString instance and interaction with OverloadedStrings.

The following REPL session demonstrates the issue:

λ> BS.unpack $ T.encodeUtf8 ("bla語" :: Text)
[98,108,97,232,170,158]
λ> BS.unpack $ ("bla語" :: BS.ByteString)
[98,108,97,158]
λ> T.decodeUtf8 $ ("bla語" :: BS.ByteString)
*** Exception: Cannot decode byte '\x9e': Data.Text.Internal.Encoding.decodeUtf8: Invalid UTF-8

The IsString instance calls packChars which calls c2w, which silently truncates the bytes.

I'd be happy to put together a PR to document the behavior of the IsString instance.

I think I expected it to encode the string using the source encoding. I don't know whether or not that's a feasible or desirable change.

hvr commented 7 years ago

The behavior you're experiencing is documented at http://hackage.haskell.org/package/bytestring-0.10.8.2/docs/Data-ByteString-Char8.html

However, it may make sense to attach a docstring to the IsString instance since the IsString instance transcends module barriers.

chris-martin commented 6 years ago

It seems unusual to me that packChars is the basis for fromString yet is not part of the API. I think packChars ought to be exported, and the IsString instance haddock can just link to the packChars doc, and the packChars doc should explain what it does.

hvr commented 6 years ago

I don't understand the rationale of having to export packChars if we already have access to it via other means (i.e. fromString & pack) which alias it.

chris-martin commented 6 years ago

Ah, sorry, I didn't notice the pack function in the Char8 module. Yeah, the important thing is for the IsString instance doc to have a link to Char8.pack.

sjakobi commented 5 years ago

This issue is still confusing users today: https://stackoverflow.com/questions/58777439/haskell-data-yaml-utf-8-decoding

joeyh commented 4 years ago

This is a landmine to anyone using RawFilePath. Consider:

{-# LANGUAGE OverloadedStrings #-} import System.Posix.Directory.ByteString main = removeDirectory "bla語"

I agree that this seems to assume the source encoding will be used.

sjakobi commented 4 years ago

Since the IsString instance is still confusing both beginners and experienced users, how about removing it?

hvr commented 4 years ago

@sjakobi Since this instance has been in place for as long as I can remember, it'd be interesting in order to inform a decision to know how many packages would be affected (and also whether those packages were in full compliance with the PVP by having the mandated upper bounds in place) on Hackage if this instance was removed in a future release of bytestring.

sjakobi commented 4 years ago

@hvr I'm sure that a removal would cause tons of upgrade hassle.

Maybe it would be better to change the instance to UTF8-encode the input – that should limit any "breakage" to code that intentionally or more likely unintentionally triggered the current truncating behaviour.

hasufell commented 4 years ago

I think the instance should be removed. It isn't well defined. Carrying such pitfalls around for backwards compatibility reasons isn't what the haskell ecosystem should be about in my opinion. It should be about correctness.

We could mark it deprecated and remove it in 1 year.

sjakobi commented 4 years ago

Yet another alternative to removal: Raise an error if any Char in the input would be truncated. IMO that's better than truncating silently, but runtime errors aren't that great either…

parsonsmatt commented 4 years ago

Yet another alternative to removal: Raise an error if any Char in the input would be truncated. IMO that's better than truncating silently, but runtime errors aren't that great either…

As much as I'd like for ""bla語" :: ByteString to be a compile-time error, I'd rather have a runtime error telling me why something is wrong rather than having this fact stick around in the back of my head, and I'd certainly rather have received the error instead of investigating and discovering this :smile:

I'm in favor of removing the instance over a long-enough timeline. With qq-literals and the statically checked overloaded strings trick, the fix could be as simple as inserting the relevant quasiquoter ([bs| ... |]) or quotation $$(...) (or even $$"..." in 8.12!). This is an annoyance but would dramatically improve safety.

In stages:

Replace the fromString function with one that throws a runtime error in the next major version bump.
Add a warning to the IsString instance in the next major version bump with migration instructions
Add a TypeError to the IsString instance in the next major version bump with instructions on how to migrate away.

sjakobi commented 4 years ago

@parsonsmatt I really like the idea of offering quasiquoters as an alternative to literals as a first step. The $$(...) and $$"..." quotations seem slightly less readable to me.

bs and lbs seem like good names to me. Does everyone agree or have better suggestions?

joeyh commented 4 years ago

bs "foo" is syntactically lighter-weight than quasiquotation, and trivial to implement:

bs :: Text -> ByteString
bs = Data.Text.Encoding.encodeUtf8

-- see shy jo

sjakobi commented 4 years ago

@joeyh text depends on bytestring, so bytestring can't have a dependency on text.

I guess it would be convenient if bytestring could UTF8-encode Strings itself – I would be surprised if that hadn't been discussed before!

sjakobi commented 4 years ago

To clarify: The bs and lbs quasiquoters would only validate the input, rejecting any input containing non-Latin-1 characters, i.e. characters that don't fit into a single byte.

The documentation could direct users who want UTF-8 encoding to the utf8-string package.

sjakobi commented 4 years ago

I guess it would be convenient if bytestring could UTF8-encode Strings itself – I would be surprised if that hadn't been discussed before!

Turns out that there is some UTF-8 functionality in bytestring – for Builder:

In fact the IsString instance for Builder supports UTF-8 via stringUtf8:

https://github.com/haskell/bytestring/blob/95fe6bdf13c9cc86c1c880164f7844d61d989574/Data/ByteString/Builder.hs#L452-L453

That's a big surprise to me, and I think it changes the picture.

I think we should aim for consistency with this (superior) instance then, and change the instances for strict and lazy ByteString to support UTF-8. That should be much more convenient than having to resort to Builder, text or utf8-string to get a UTF-8-encoded ByteString.

Thoughts on this?

hasufell commented 4 years ago

Thoughts on this?

My problem with this is that it's not a good API. ByteString has less structure than String and fromString is just misleading here. Once you turned it into a bytestring, you lost explicit information (about the encoding) and must be aware that it is utf8 and not something else.

I'd rather have people use those utf8 functions explicitly, so they write less bugs. The semantics between IsString instances are just not clear enough, IMO. If you need additional documentation for an instance, then it's a good sign you shouldn't have that instance in the first place.

Maybe someone will expect it to truncate... and someone expect it to be utf8.

merijn commented 4 years ago

As much as I'd like for ""bla語" :: ByteString to be a compile-time error, I'd rather have a runtime error telling me why something is wrong rather than having this fact stick around in the back of my head, and I'd certainly rather have received the error instead of investigating and discovering this

@parsonsmatt @sjakobi Fortunately for you guys I had this epiphany years ago and already implemented the required typed TH in a library (https://hackage.haskell.org/package/validated-literals) that makes it very simple to turn these problems into compile time errors :p

In fact, a compile time checked ByteString is literally one of the examples in that library ;)

sjakobi commented 4 years ago

One issue with changing the ByteString IsString instances to UTF-8 is that users would need to be careful to have proper lower bounds on bytestring to avoid truncation with older versions.

@hasufell

Maybe someone will expect it to truncate...

I really really don't believe that anyone would intentionally write non-Latin-1 literals knowing that they are truncated.

I would be very interested in knowing what @hvr and @dcoutts think about the idea of changing the instances to UTF-8.

fumieval commented 4 years ago

A lot of people I know (including me) have shot their feet by this instance. As a user of Japanese writing system, I strongly support changing the instance to use UTF-8 (also to be consistent with the instance for Builder).

hasufell commented 4 years ago

I really really don't believe that anyone would intentionally write non-Latin-1 literals knowing that they are truncated.

The point is, you are proposing to change behavior without renaming the function and you have no idea about the bugs in other peoples codebases, which may depend on unexpected behavior. PVP doesn't work, people don't read ChangeLogs of 200+ packages to figure out such things (especially when they are not even aware of what they are relying on).

Removing the instance is safe. Old code will stop compiling, people will be forced to change it and think about its semantics.

hvr commented 4 years ago

PVP doesn't work, people don't read ChangeLogs of 200+ packages to figure out such things (especially when they are not even aware of what they are relying on).

The PVP only doesn't work if you're dealing with irresponsible programmers/maintainers which can't be bothered to honor it. Quite frankly, I recommend to stay away from those people's packages as they don't instill me with great confidence and obviously it makes little sense for a PVP-compliant package to depend upon packages which don't themselves make any effort to uphold the PVP contract.

That being said, luckily almost all packages I rely on are maintained by people who care to write correct software and therefore also appreciate the goals and benefits of the PVP formal framework.

hvr commented 4 years ago

To summarise the basic options,instance IsString ByteString could be either

Decoding as ISO-8859-1 (status quo)
Decoding as UTF-8
not supported

The nice property of 1. is that you can roundtrip the ByteString literals serialized by Show; especially those that aren't valid UTF-8 encodings -- after all, ByteString is about binary data, which is a superset of valid UTF-8 encodings.There's also code out there which relies on the IsString instance to efficiently (still O(n) but with a relatively small factor; but way better than using pack on [Word8] -- but see also @phadej and @andrewthad's https://github.com/ghc-proposals/ghc-proposals/pull/292) embed binary blobs into haskell code (NB: to make things more fun to reason about; GHC currently uses a modified UTF8 encoding to represent string literals as CStrings; and there's been a long-time desire to have the data-length be known statically at compile time... but I've seen @andrewthad make some progress in this department recently).

The original complaint about silent failure could be addressed by turning it into a non-silent runtime exception. This probably wouldn't add any significant overhead since currently string literals have a O(n) overhead as they need to be transcoded; we'd just need to signal errors instead of truncating code-points during this transcoding that needs to occur anyway.

Variant 2. would make it consistent with instance IsString Builder which currently uses UTF-8 encoding (however, it could be argued that IsString Builder should be using ISO-8859-1 as well!). But other than that, it's still a weird choice to pick UTF-8 for a binary ByteString type. Also note that you still have to deal transcoding GHC's modified UTF8 serialization into proper UTF-8 as well as dealing with improper UTF-8 strings (either silently or by runtime exceptions). And you lose the ability to represent all ByteString values as string literals.

The appeal of variant 3. is to turn the silent failures into compile-time failures but also all currently legitimate uses of bytestring literals! thereby throwing out the baby with the bathwater.

Note however that TH or QQ is not a proper replacement for the lack of string literals. Not all GHC platforms support TH (and also for those that do, TH adds quite a bit of compile-time overhead); so I'd hate to see packages starting depending on TH support for something common such as bytestring literals which should be IMO provided by GHC w/o the need for the quite heavy TH machinery.

And if we go one step further by making the instance illegal (i.e. so you can't define your own local instance), it'd effectively represent a regression to me as I couldn't rely on TH for libraries (which I strive to be portable, which means avoiding TH when possible). Consequently, I'd strongly object with to any TypeError which would primarily recommend the use of TH/QQ as replacement as I'd consider this harmful for the ecosystem.

In summary, I think that adding a runtime error to ISO-8859-1 string literals represents the approach which provides the incremental improvement over the status quo with the best power-to-weight ratio; after all, string literals are constant values; if you manage to evaluate them at least once (hint hint, test-coverage via hpc) as part of your CI, you'll be able for force those runtime errors. And typically you don't write that many Bytestring literals anyway -- and if you do, I'd expect you to be mindful about why you're using ByteString ltierals in the first place -- and if you're abusing ByteString literals for encoding Unicde text instead of using text/utf8-text/short-text/..., I'd have little sympathy if you're not being careful. ;-)

I could also see hlint be able to detect silent code-point truncation when the type is trivially inferrable -- if it doesn't already support this; i.e. code like

s :: Bytestring
s = "€uro"

could be something that hlint could heuristically detect IMO.

hasufell commented 4 years ago

The PVP only doesn't work if you're dealing with irresponsible programmers/maintainers which can't be bothered to honor it.

Well, I think it's fair to raise this issue here. PVP says:

Breaking change. If any entity was removed, or the types of any entities or the definitions of datatypes or classes were changed, or orphan instances were added or any instances were removed, then the new A.B MUST be greater than the previous A.B. Note that modifying imports or depending on a newer version of another package may cause extra orphan instances to be exported and thus force a major version change.

This doesn't say anything about changing semantics (not types) of existing functions. There is no way for a maintainer to know when this has happened. In fact, even a minor version bump may introduce changed semantics of a function as part of a "bugfix" (since it's even debatable what a bug is). And if that "bug" has been "fixed" after a few years, your chances of knowing the impact are very low.

In this instance, PVP doesn't give us any guarantees whatsoever. The only proper way for that is:

deprecation phase
removing the function and providing a new one

This is the only way that makes a significant change in semantics visible to your users. They will see the deprecation warning in their build logs, especially with -Werror and after a year or two they will experience compilation failure and will have to look up the documentation/ChangeLog.

In summary, I think that adding a runtime error to ISO-8859-1 string literals represents the approach which provides the incremental improvement over the status quo with the best power-to-weight ratio

I think this is the most dangerous one. An error may crash someones backend. We don't know that. Just imagine IsString used in the context of foreign input (from a user, another http server). Now you have a DoS? :fearful:

hvr commented 4 years ago

@hasufell You're are right about the incomplete wording in the PVP! I.e. that the current wording of the PVP could be interpreted that way; even though most people have interpreted the intent behind the PVP to mean that obviously semantically observable changes are considered breaking changes. And we've been discussing already a clarification (see haskell/pvp#30) of the wording to state this detail more explicit so there's really no doubt about what the intent is (the enumerated rules merely encode the general principle from the POV of a consumer; i.e. they follow from that principle; not the other way around -- but that's a different topic). In any case, I'd suggest to take the PVP specific discussion to haskell/pvp#30 where we're trying to address this minor oversight.

hvr commented 4 years ago

In summary, I think that adding a runtime error to ISO-8859-1 string literals represents the approach which provides the incremental improvement over the status quo with the best power-to-weight ratio

I think this is the most dangerous one. An error may crash someones backend. We don't know that. Just imagine IsString used in the context of foreign input (from a user, another http server). Now you have a DoS?

How realistic is this scenario? For one, typically you have proper exception handling in place for critical components which take untrusted input -- we're specifically talking about the IsString which is primarily intended for string literals; so you'd have to assume that somebody had been using code that was incorrect to begin with and silently truncated code-points of a string literal. Or you might be considering the unintended use of IsString for non-static input from untrusted input. But then again, you'd have to assume that code would be doing input validation given that IsString has . I feel like you're trying to come up with pathological scenarios which may be technically correct but not very common in practice. At least I can't think of code of mine I've written over the years where my components would crash and burn without deserving it if non-ISO-8859-1 ByteString literals would start triggering runtime exceptions. That being said, I'd find ways to cope if the IsString instance went missing (and without an annoying TypeError-fake-instance in its place), resulting in some busywork finding ways to express the previous concise ByteString in less verbose forms; I just fail to see whether this is worth the cost (especially since to me personally I can't remember the last time I encountered this class of error).

PS: There's still the option of leaving the IsString instance as-is and accept the truncating behaviour legitimate if for the sake of having a total function (at the cost of being non-injective). It's not like this is a "bug" of IsString ByteString; it's a documented behaviour that's been in place for a very long-time; and yes, it may bite people; yes it's not aligned with the Builder instance (but you can also argue that the Builder instance having been added later is at fault here for diverging); but you can just as well justify the current semantics being legitimate for a specific tradeoff.

The whole point of the discussion at hand is gauge the cost/benefit ratios of a change from the current status quo to inform whether a change is worth doing, and most importantly, all choices are merely tradeoffs; there doesn't appear to be one variant which is absolutely superior to the others to me.

Otoh, I partly blame that GHC or the core Haskell standard didn't anticipate the need to have a richer support for annotating the purpose/subtype of string literals (but then again; the Haskell Report considered a "String" merely an alias for [Char]); other languages I've worked with had some kind of charset annotation (and most importantly without requiring the heavy machinery of something like QQ/TH for something so basic); even Python whose typesystem is in a totally different corner of the typesystem space has modifiers to annotate and distinguish regex, bytestrings and text literals...

parsonsmatt commented 4 years ago

I've been bitten by changing instances before. The last major release of time changed the instance Read UTCTime silently (accidentally?) to require the trailing time zone. This broke a decent amount of code across the internet with *** Prelude.read: no parse exceptions, which are a bunch of fun to debug.

Personally, I really want the instance IsString ByteString to do UTF-8 decoding. That's what I initially expected, and it seems like the expectation of many other people as well. It is possible that people are relying on the current behavior somehow (accidentally?), but I'd suspect that more people have silent/hidden bugs based on this truncation. That belief is based on the fact that the bug that inspired this issue was in production for quite some time before it was eventually discovered.

A runtime exception with an informative error message would be surprising, but at least it should point to exactly where you need to fix your code. I usually expect to find some weird behavior when I do a major version upgrade (typically a new stackage resolver), and a runtime exception like this would be satisfactory. Especially if that runtime exception noted that a future version of the library would change the behavior to use UTF-8 encoding instead of truncation, and provided a pointer to functions that could either a) have the old truncation behavior or b) have the future encoding behavior.

Removing the instance would be costly, and I wouldn't want to do it without a TypeError that gave you a note saying "Please just use this function instead to fix this error." Unfortunately, we can't attach a warning to an instance - this might be a good GHC feature to add.

I wish there was a good way of reaching out and polling the community about their preferences on this - it seems pretty important!

phadej commented 4 years ago

Builder’s UTF8 instance is probably blaze-builder/markup legacy.

I’d hope that if we do something, it will be change to

TypeError ... => IsString ByteString/Builder

Herbert’s argument that Latin1 instance roundtrips Show instance is one in favor of it. UTF8 is somewhat arbitrary, why not UTF16/32LE/BE.

I’d also prefer not introducing any pure exceptions.

Fwiw, http://hackage.haskell.org/package/overloaded-0.2/docs/Overloaded-Symbols.html instance for ByteStrings accepts only ASCII (7bit) literals. That’s somewhat accidentally by implementation; but it’s a proof that if we really want we can make GHC to fail compile time on ”invalid” literals.

On 31. Dec 2019, at 18.08, Matt Parsons notifications@github.com wrote:

I've been bitten by changing instances before. The last major release of time changed the instance Read UTCTime silently (accidentally?) to require the trailing time zone. This broke a decent amount of code across the internet with *** Prelude.read: no parse exceptions, which are a bunch of fun to debug.

Personally, I really want the instance IsString ByteString to do UTF-8 decoding. That's what I initially expected, and it seems like the expectation of many other people as well. It is possible that people are relying on the current behavior somehow (accidentally?), but I'd suspect that more people have silent/hidden bugs based on this truncation. That belief is based on the fact that the bug that inspired this issue was in production for quite some time before it was eventually discovered.

A runtime exception with an informative error message would be surprising, but at least it should point to exactly where you need to fix your code. I usually expect to find some weird behavior when I do a major version upgrade (typically a new stackage resolver), and a runtime exception like this would be satisfactory. Especially if that runtime exception noted that a future version of the library would change the behavior to use UTF-8 encoding instead of truncation, and provided a pointer to functions that could either a) have the old truncation behavior or b) have the future encoding behavior.

Removing the instance would be costly, and I wouldn't want to do it without a TypeError that gave you a note saying "Please just use this function instead to fix this error." Unfortunately, we can't attach a warning to an instance - this might be a good GHC feature to add.

I wish there was a good way of reaching out and polling the community about their preferences on this - it seems pretty important!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

hvr commented 4 years ago

Just to throw out another bikeshedding idea; there is precedent for silently remapping invalid code-points in text literals, specifically text/utf8-text/text-short remap invalid code-points (specifically surrogate pairs as documented in the IsString instance documentation) to the replacement character U+FFFD.

So instead of truncating code-points beyond the ISO-8859-1 plane to whatever low 8-bits the code-point value has; pick a specific code-point to map invalid code-points to (there's multiple potential choices here; one could e.g. pick 0xFF which is an octet that would be easy to spot and also would make sure that the resulting bytestring literal can't be a valid UTF-8 encoding anymore as such octects are not allowed to occur in a well-formed UTF-8 string).

Btw, whatever we do, we're on a relatively long time-scale; bytestring has been on the major API version 0.9 between 2007 and 2012; and the last 7 years since 2012 we've been on major version 0.10; And I'd point out that this level of API stability is something I've grown to greatly appreciate for fundamental libraries such as bytestring which almost every package depends upon. So any migration scheme involving multiple major versions is not something I'd expect to play out in a reasonable timeframe. Consequently, if (say) we decide to drop IsString ByteString (with or without TypeError =>), that's gonna happen in say bytestring-0.11.0, end of story. There's really no point in any long-winded multi-stage process IMO. Also we should try to bundle up any other breaking changes that have been accumulated since 2012, to make good use of this costly major version increment.

thomasjm commented 4 years ago

after all, string literals are constant values; if you manage to evaluate them at least once (hint hint, test-coverage via hpc) as part of your CI, you'll be able for force those runtime errors. And typically you don't write that many Bytestring literals anyway

Just wanted to respond to this statement as someone who was bitten by this issue in a non-string literal situation. The IsString instance is not necessarily only used for literals: in fact it is (ab)used by the Data.String.Interpolate.IsString module from interpolate to provide string interpolation. So, after obtaining a Unicode-containing ByteString through other means, I wrote

interpolatedText :: T.Text = [i|Interpolated bytes: #{bytes}|]

And this produced a garbled result by truncating the Unicode characters. A runtime exception would be really nasty here because I wouldn't expect CI/hpc to catch it. (https://github.com/sol/interpolate/issues/14)

FWIW, I agree with @parsonsmatt that the best/least astonishing is just to change the instance to do UTF-8 decoding. I think most of the other considerations in this thread (such as nice roundtrip properties with Show) are minor compared to this one--language designers and core library authors need to help the rest of us avoid shooting ourselves in the foot with Unicode :). Consider for instance how Python 3 did so by adopting Unicode strings as the native string type.

Edit: another reason the status quo behavior is perplexing: I look up the documentation for Data.Char and it says "The character type Char is an enumeration whose values represent Unicode" (emphasis mine). Then I look at the definition of IsString, where its sole method is fromString :: String -> a aka fromString :: [Char] -> a. So it's easy to see why someone would assume that the fromString instance for ByteString will convert my list of Unicode characters to some kind of valid Unicode-encoded ByteString (maybe not necessarily UTF-8, but something...).

llelf commented 4 years ago

in fact it is (ab)used by the Data.String.Interpolate.IsString module from interpolate

Unfortunately it is abused basically everywhere.

Another example is optparse-applicative which happily mangles your command-line options (optparse-applicative/issues/368, wontfix is a nice touch).

vdukhovni commented 4 years ago

Whatever else happens, {-# LANGUAGE OverloadStrings #-} MUST continue to work for plain ASCII literals. Therefore, the IsString instance cannot just go away.
When, at times, non-ASCII content is desired, e.g. to represent the UTF8 BOM byte sequence: "\xEF\xBB\xBF", each character literal encoding an octet in the string MUST produce exactly one octet in the ByteString:
```
Prelude> :set -XOverloadedStrings 
Prelude> import Data.ByteString
Prelude Data.ByteString> Data.ByteString.unpack "\xEF\xBB\xBF"
[239,187,191]
```
It would be wrong to take GHC's UTF-8 encoding of the above, and instead produce the bytes of the UTF8-encoding of the associated list of characters. ByteStrings are NOT Unicode strings, and only code-points between '\x00' and '\xFF' are semantically valid in ByteString literals.
```
{-# LANGUAGE OverloadedStrings #-}
```

module Main (main) where

import qualified Data.Text as T import qualified Data.Text.Encoding as T import qualified Data.ByteString as B

utf8bom :: B.ByteString utf8bom = "\xEF\xBB\xBF"

main :: IO () main = do putStr "Valid: " print $ B.unpack utf8bom putStr "Wrong: " print $ B.unpack $ T.encodeUtf8 "\xEF\xBB\xBF"

Output:

Valid: [239,187,191] Wrong: [195,175,194,187,194,191]

Users who expect bytestring's `IsString` to perform Utf8 encoding are misguided.  A `ByteString` is an octet-string!  The only change that might be reasonable is to not truncate input characters, and instead raise an error when encountering input octets in literals outside the valid range:
```haskell
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE ViewPatterns #-}

module Main (main) where

import qualified Data.Text as T
import qualified Data.Text.Encoding as T
import qualified Data.ByteString as B
import Data.String (IsString, fromString)
import Data.Char (ord)
import Data.Coerce (coerce)
import Data.Word (Word8)
import Numeric (showHex)

newtype MyBS = MyBS B.ByteString

instance IsString MyBS where
    fromString s = coerce B.pack $ map c2w s
      where
        c2w :: Char -> Word8
        c2w (ord -> o)
              | o <= 0xff = toEnum o
              | otherwise = error $ "Illegal character in ByteString literal: "
                                    ++ show s ++ ": U+" ++ showHex o ""

utf8bom :: MyBS
utf8bom = "\xEF\xBB\xBF"

main :: IO ()
main = do
    putStr "Valid: "
    print $ B.unpack $ coerce utf8bom
    putStr "Wrong: "
    print $ B.unpack $ T.encodeUtf8 "\xEF\xBB\xBF"

This version will raise an exception when the input string contains characters outside the supported range.

Bottom line, I take a firm stance that encoding strings to UTF8 to produce ByteString literals would not be correct. Any other conversions should be wrapped in suitable newtypes that represent a flavour of ByteString for which Utf8 encoding of literals is expected/supported.

Bodigrim commented 4 years ago

People use IsString for various purposes, from pretty-printing to parsing, so I would not say that there are any invariants to be violated. At least, Data.String does not mention any.

Bringing in template-haskell machinery does not look appealing to me. Besides introducing a fragile dependency, literal bytestrings tend to be small, and even a short quasiquoter is too much noise (one would probably prefer old plain pack instead).

From bytestring point of view there is no reason to prefer UTF8 to UTF{16,32}{LE,BE}. I would rather put Builder in line with ByteString than vice versa.

On the other side, I feel important to address @joeyh remark about removeDirectory "bla語". This is a potential security vulnerability for Haskell applications. IMHO it is worth to add a validation, throwing errors for non-octet bytes (or - to be on the safe side - for all non-ASCII chars).

thomasjm commented 4 years ago

@vdukhovni what you're saying is inconsistent with the way Haskell String actually works. When you write a string literal in double quotes, it is first understood at the language/compiler level as unicode. This goes all the way back to the Haskell Report.

In other words, when you write a string like "\xEF\xBB\xBF", it is understood as a String consisting of 3 Unicode characters. If you want to get a BOM, the right way to write it is with the actual Unicode code point, "\xFEFF". The fact that your example above works is because you're first creating a String containing an invalid value (i.e. not a BOM at all) and then relying on the current (broken IMO) fromString instance behavior to truncate that into the desired bytes.

This kind of intricate weirdness is not what users of bytestring should have to rely upon :P. As I said above, I think encoding as UTF8 (or at least some valid UTF encoding) would bring this library into consistency with the String type and Haskell itself. (Code below illustrates this.)

{-# LANGUAGE OverloadedStrings, ScopedTypeVariables #-}

module Main (main) where

import qualified Data.ByteString as B
import qualified Data.Text as T
import qualified Data.Text.Encoding as T

main :: IO ()
main = do
  -- A string written this way is understood by Haskell to be 3 Unicode characters
  let s1 :: String = "\xEF\xBB\xBF"
  putStrLn ("Length of s1: " ++ (show $ length s1))
  putStr "Wrong (because it's not a single UTF BOM character!): "
  print $ B.unpack $ T.encodeUtf8 "\xEF\xBB\xBF"

  -- If you actually want a BOM, you should write it using the actual Unicode value U+FEFF
  let s2 :: String = "\xFEFF"
  putStrLn ("Length of s2: " ++ (show $ length s2))
  putStr "Right: "
  print $ B.unpack $ T.encodeUtf8 "\xFEFF"

vdukhovni commented 4 years ago

@vdukhovni what you're saying is inconsistent with the way Haskell String actually works. When you write a string literal in double quotes, it is first understood at the language/compiler level as unicode. This goes all the way back to the Haskell Report.

It is a list of (Unicode) Characters. We're discussing a way to recover an octet string (a list of Unicode code points in the range 0x00 - 0xff) from a convenient to keyboard literal character string. The text string to enter is the one you get from show . Data.ByteString.Char8.unpack. The expected invariant (for strings not containing out of range code points) is:

Prelude> :set -XOverloadedStrings
Prelude> import qualified Data.ByteString.Char8 as B
Prelude B> let s = "\xEF\xBB\xBF" :: String
Prelude B> s == Data.ByteString.Char8.unpack "\xEF\xBB\xBF"
True

In other words, when you write a string like "\xEF\xBB\xBF", it is understood as a String consisting of 3 Unicode characters.

Yes, of course, this is exactly what I intended when I wrote that string. I want to write 3 literal octets. I could have written: Data.ByteString.pack [0xef, 0xbb, 0xbf] (and might have in this case), but in other cases I may have other surrounding octets that are more naturally represented as character data.

If you want to get a BOM, the right way to write it is with the actual Unicode code point, "\xFEFF".

But I'm not trying to represent the BOM as a Unicode String, I'm trying to represent its UTF-8 octet-string encoding! I'm creating a literal ByteString, not a literal String. That's why I'm calling it "utf8bom", not "bom".

The fact that your example above works is because you're first creating a String containing an invalid value.

The string is perfectly valid, it contains three character literals, all below 0xff.

(i.e. not a BOM at all)

See above, the intended "string" is a 3 character sequence with the three characters equal to the 3 octets of the UTF-8 octets of the encoded BOM. Like Horton, I meant what I said, and I said what I meant... :-)

and then relying on the current (broken IMO) fromString instance behavior to truncate that into the desired bytes.

There's no truncation in my example. All the octets are <= 255.

Bottom line is that I really don't expect lists of Chars in a ByteString context to be magically utf8-encoded. I expect to get their numeric octets, allowing correct handling of:

foo :: ByteString
foo = "A common hex string \xDE\xAD\xBE\xEF"

and get the string literal back from show . unpack. Indeed just show on a Data.ByteString.Char8 will do exactly that, and surely IsString should be able to read back what show produced.

vdukhovni commented 4 years ago

On the other side, I feel important to address @joeyh remark about removeDirectory "bla語". This is a potential security vulnerability for Haskell applications. IMHO it is worth to add a validation, throwing errors for non-octet bytes (or - to be on the safe side - for all non-ASCII chars).

Well, at least for literals in the code, this is not a "security issue", the bad data is not from an attacker. It is nevertheless a correctness issue, and a potential WTF surprise.

The reason it is a surprise, is that there exist interfaces for manipulating system resources that take raw bytestrings (in some unspecified encoding) and use them as identifiers.

What file a user has in mind when typing some string in some specific encoding is a rather complex question, particularly when dealing with network filesystems. It brings in complex questions about canonical forms (NFC, NFD, ...) case-sensitivity, ... It is a quagmire, and many questions in this space simply have no correct answers. What is the underlying codepage of the filesystem? Is it ISO-8859-1 or UTF-8?

About all one might hope for, is that if you created a file name from a ByteString literal of "это не английский алфавит.txt" about the most you can hope for is that using the same name again will refer to the same file (in the same directory). Otherwise, system-programming with non-ASCII filenames requires considerable care. Naive expectations that automatic UTF-8 encoding in IsString will save the day are just that, the problem is harder than it looks.

thomasjm commented 4 years ago

The expected invariant (for strings not containing out of range code points) is...

You can't just assert what the "expected invariant" is (or what this library "MUST" do, or what is "correct," in your previous comments), because that is precisely what we're debating here. There's clearly no solution here that is universally "correct," just those that are more or less confusing to new users, likely to break existing code, etc. We're trying to find the lesser evil here.

Yes, of course, this is exactly what I intended when I wrote that string. I want to write 3 literal octets.

Yes, but at the String level that's not what you're getting! When you write "\xEF" you get a String containing the unicode character U+239, which can be encoded any number of ways. Yes, the current behavior of fromString means you get to pretend this is an octet. But my argument is that this is overall not the best behavior due to inconsistency with the rest of the language.

There's no truncation in my example.

You're right, my mistake. The current implementation is just encoding U+239 as the byte 239.

Bottom line is that I really don't expect lists of Chars in a ByteString context to be magically utf8-encoded.

I get that, and I can see the value of being able to construct strings of octets this way.

I hope you can also see the value of someone being able to write hello = "你好" :: ByteString and getting a sensible result. And also the value of consistency, where a function named fromString in a language that explicitly uses Unicode as its native string type would produce a valid Unicode result. It's 2020 after all, Unicode and I18N aren't supposed be an afterthought in language design.

Which of these considerations is more important is of course up to the community to decide, but let's not pretend one is clearly better than the other.

Edit: doing your octet building instead with bom = B.pack [0xEF, 0xBB, 0xBF] as you mentioned is advantageous in other ways too, because GHC will catch invalid octets (even overflowed ones via -Woverflowed-literals).

vdukhovni commented 4 years ago

You can't just assert what the "expected invariant" is (or what this library "MUST" do, or what is "correct," in your previous comments), because that is precisely what we're debating here. There's clearly no solution here that is universally "correct," just those that are more or less confusing to new users, likely to break existing code, etc. We're trying to find the lesser evil here.

Sorry, you're right about the tone. That should be "the invariant I would expect", or something similar. There's no way to please everyone here. Any change from the status quo will break something. One consideration is whether the breakage is silent or highly visible.

If we stop truncating, and error out on Char values above 255, the breakage will likely be quite visible.

If we instead 'encodeUtf8', I am less confident that the problem will be found before it affects users.

Of course, for resource identifiers that are typed as ByteStrings, but are generally UTF-8 encoded strings, the more friendly behaviour is encoding. And not surprisingly that would, for example, break a couple of doctests in the DNS library.

We could of course provide a related type LiteralByteString that's a newtype around ByteString which has the legacy is IsString interface. One can then use literal strings to initialize these, and then extract the underlying ByteString. Along the lines of the MyBS example above.

thomasjm commented 4 years ago

Sure, I would just add that (as suggested earlier in this thread) there might be silent bugs out there already due to people interpreting the IsString instance the way I did. One of those bugs in my own code is what brought me to this thread in the first place. So it's possible a switch to encoding would lead to silent fixes as well as silent breakages :).

But yeah, clearly a breaking change of this magnitude involves a lot of considerations and I'm not sure what is actually best--I just wanted to get some arguments out there for the pro-encoding camp.

vdukhovni commented 4 years ago

After reading the thread again, I am definitely against removal of the instance. Which leaves imperfect choices. I can't think of any situation where truncating Unicode code-points above 255 to 8-bits produces better results than latching the value at 255 for all invalid inputs, garbage-in/garbage-out. This option is backwards-compatible for ISO-Latin (really 8-bit octets), and is perhaps more likely to be safe in practice (less rope for attackers to exploit the truncation to yield chosen "useful" outputs).

So my "vote" is either status-quo, or truncate to 255, rather than low 8-bits. Throwing runtime errors for out-of-range inputs, where none were raised before seems much more risky, and would be my least preferred choice (other than removal, which I'm not considering as an option).

@merijn Thanks for the pointer to ValidLiterals (I'd seen it once before, but it never really registered), I may have some uses for this in code that will not be a core library for the ecosystem, and so a TH dependency is not a deal-breaker.

phadej commented 4 years ago

A made a patch to GHC https://gitlab.haskell.org/ghc/ghc/-/merge_requests/3290 which would make GHC warn about "overflowing" ByteString literals.

(Please don't comment there if it's not about implementation, let's keep the discussion here instead)

hasufell commented 4 years ago

What file a user has in mind when typing some string in some specific encoding is a rather complex question, particularly when dealing with network filesystems. It brings in complex questions about canonical forms (NFC, NFD, ...) case-sensitivity, ... It is a quagmire, and many questions in this space simply have no correct answers. What is the underlying codepage of the filesystem? Is it ISO-8859-1 or UTF-8?

The point is, when you're already on ByteString level, then all those questions are irrelevant. Nothing is to be interpreted automatically. IsString violates this.

I like @phadej solution, but I would additionally make it a deprecated instance and remove it after 1 year. The amount of discussion here is proof that nothing about this instance is intuitive or correct.

thomasjm commented 4 years ago

The GHC patch by @phadej sounds like a step in the right direction, but I'd just point out it doesn't solve the runtime uses of fromString, in libraries like interpolate and optparse-applicative discussed above.

Also, responding to a comment on that merge request:

The truncation of characters to octets might be surprising, yet using a blessed (UTF8 or any other) encoding is not optimal. Then one couldn't represent all ByteStrings with literal.

For a point of comparison, I'm pretty sure this is exactly the situation that exists in Python (a language well-known for its pragmatism). To quote Python's documentation, UTF-8 is one of the most commonly used encodings, and Python often defaults to using it. For example, the prototype of the string encode function in Python is str.encode(encoding="utf-8", errors="strict"). Python also seems to recommend writing out arbitrary bytestrings using lists of numbers, not literals. So I don't see why a blessed encoding is so bad.

merijn commented 4 years ago

Bringing in template-haskell machinery does not look appealing to me. Besides introducing a fragile dependency, literal bytestrings tend to be small, and even a short quasiquoter is too much noise (one would probably prefer old plain pack instead).

So, I actually brought up the issue of "partial" instances of the various OverloadedX classes and how it would be nice to have the conversions be String -> Either Error a or similar and having GHC check those conversions at compile time, probably somewhere in 2013. At the time this idea was shot down as "unnecessary and complex to implement" and that there seemed to be no demand for that and people should just assume no one misuses those classes.

I write a proof-of-concept TH library at the time that does exactly this (and has an example of enforcing ascii in bytestrings!). I'm not quite happy with the current API (requires too much manual annotation), so maybe I'll revisit it sometime Soon (TM).

I still think this functionality should just be inside GHC, but my last attempt to convince people failed. So if anyone wants to take up this crusade feel free to point at the POC implementation :)

phadej commented 4 years ago

@thomasjm

fromInteger 123132 :: Word8 also truncates. Whether this is desirable or not depends.

Concretely: should it be truncated to maxBound or be truncated by bytes. Both behaviours are useful.

thomasjm commented 4 years ago

@phadej I'm not sure how to interpret this analogy. I already acknowledged above that the ability to leverage the current behavior to write octet strings is useful, ignoring other considerations. Of course, truncation on integer overflow feels like a rather different beast from silently dropping Unicode bytes since it can happen at the hardware level and programmers are more likely to expect it. The point is that as long as the instance exists we have to make a choice. Also:

GHC already throws a warning when you write foo = 123132 :: Word8 (which is why I think your patch is a good idea, if we keep the status quo)
GHC already throws an error when you write foo = "\1114112" :: String (you get error: numeric escape sequence out of range, because GHC understands that '\1114111' is the largest valid Unicode value).

vdukhovni commented 4 years ago

I write a proof-of-concept TH library at the time that does exactly this (and has an example of enforcing ascii in bytestrings!). I'm not quite happy with the current API (requires too much manual annotation), so maybe I'll revisit it sometime Soon (TM).

FWIW, I found it reasonably appealing as-is, for a new DNS library I'm working on, I have prototyped the below:

newtype Domain = Domain
    { getDomain :: ByteString } deriving Eq

instance Validate String Domain where
    fromLiteral = safePack >=> parseDomain
      where
        safePack :: String -> Maybe ByteString
        safePack = fmap B.pack . traverse c2w
        c2w = toIntegralSized . ord
    liftResult _ (Domain v) = TH.unsafeTExpCoerce $
        TH.AppE <$> [| Domain . B.pack |] <*> TH.lift (B.unpack v)

mkDomain :: String -> TH.Q (TH.TExp Domain)
mkDomain = valid

Which makes available $$(mkDomain "foo.example.org") for writing literal DNS names, that are validated at compile-time to match the syntax of a DNS name in presentation form. The input string-literal is first mapped to a ByteString via a validating c2w that rejects non-octet values, and this is then run though my ByteString -> Domain parser that decodes valid DNS presentation forms (decimal escapes, non-decimal escapes, label length limits, ...). Yes, the user-visible syntax is more heavy-weight than just "foo.example.org", but compile-time safety, avoiding fromJust, etc., is likely worth the price.

[ I cargo-culted the definition of liftResult, without really knowing why I need unsafeTExpCoerce, but it looks reasonably plausible, even if I couldn't presently come up with it myself... ]

merijn commented 4 years ago

@vdukhovni that's because [| .. |] is an regular (untyped) TH quote. So you have to unsafely coerce the untyped TH expression into a typed one. You probably want to use typed TH quotation [|| .. ||] (note the double bar), which should eliminate the need to coerce it.

parsonsmatt commented 4 years ago

A made a patch to GHC https://gitlab.haskell.org/ghc/ghc/-/merge_requests/3290 which would make GHC warn about "overflowing" ByteString literals.

(Please don't comment there if it's not about implementation, let's keep the discussion here instead)

This is brilliant! I love it. Thank you!

haskell / bytestring

Surprising behavior of ByteString literals via IsString #140