haskell / text

Haskell library for space- and time-efficient operations over Unicode text.
http://hackage.haskell.org/package/text
BSD 2-Clause "Simplified" License
406 stars 158 forks source link

Better conversion ergonomics #316

Open Kleidukos opened 3 years ago

Kleidukos commented 3 years ago

I'm opening this issue due to @Bodigrim's advice:

The people at Kowainik (cc @vrom911, @chshersh) have produced very good ergonomics when it comes to string types conversion. Functions like toText, toLText, toString, etc. Even risky conversion is taken care of, with errors (https://hackage.haskell.org/package/relude-0.7.0.0/docs/Relude-String-Conversion.html#t:ToText).

I think that this mechanism has its place in text, since it depends on base (for String), and bytestring.

emilypi commented 3 years ago

💯 from me. Historically, these types of functions were not considered for the codebase due to the author disagreeing with them, resulting in everyone implementing them on an ad-hoc basis. But, I happen to know that now, the author has absolved himself of responsibility, and we can push ahead with doing it. I haven't put too much thought into the string conversion ergonomics beyond the most convenient function we all implement:

 tshow :: Show a => a -> Text
 tshow = pack . show

I'm all ears on making this a feature.

Kleidukos commented 3 years ago

This is very true, we have seen an abundance of ad hoc implementations of ToText.

bgamari commented 3 years ago

While I agree that something of this flavour is useful, I'm not convinced that it should produce a Text. Afterall, almost all uses of tshow in my experience are as part of a concatenation of other textual values. Moreover, an efficiently-implemented tshow would almost certainly be constructing the its result as a concatenation of individual parts.

Bodigrim commented 3 years ago

Is this issue to discuss a potential design, witnessed by relude, or to bikeshed a (mostly-)new one?

@bgamari, relude provides show :: (Show a, IsString b) => a -> b, which happily produces a Builder if you wish one.

emilypi commented 3 years ago

@bgamari interesting point, and that's true. Grepping through my codebases, the most-used utility is sshow :: (Show a, IsString b) => a -> b, and always in the context of concatenation.

Kleidukos commented 3 years ago

@Bodigrim Having used the relude conversion functions in production systems, I am very satisfied of its ergonomics and features. My ticket was to implement it as-such in the Text library.

bgamari commented 3 years ago

@Bodigrim I'll admit that I'm not a fan of such doubly-polymorphic conversion functions as they suffer from very poor inference. Of course, this is just a matter of style; I am free not to use the function if text provided it. However, I do worry about the ambiguous-type errors that beginners will encounter. In my experience, such errors have been a common sticking-point for beginners working in code written with OverloadedStrings.

Bodigrim commented 3 years ago

@bgamari this is no worse than fromIntegral. I would argue that show :: (Show a, IsString b) => a -> b is not that ambiguous, even in presence of OverloadedStrings, unless its client (logging, console, etc.) is polymorphic, which usually is not the case.

bgamari commented 3 years ago

@Kleidukos consider the case of Data.Aeson.object [ "key" := "value" ] with OverloadedStrings enabled.

Kleidukos commented 3 years ago

@bgamari Okay thanks :)

Bodigrim commented 3 years ago

I like the design of string conversions in relude. I'm not quite sure about LazyStrict, which looks a bit ad-hoc, but ConvertUtf8, ToText and ToLText seem to be natural additions to text.

I would like to hear from @chshersh and @vrom911 how do they feel about integrating their design into text.

Kleidukos commented 3 years ago

@Bodigrim LazyStrict looks actually quite cool, with the fundeps. I like this design a lot.

kozross commented 3 years ago

As someone who (unwillingly) has to use an alternative prelude with exactly such a magic conversion function, I am strongly not in favour of it. It often turns type resolution to custard, and hides disgusting amounts of allocation and copying. Driving this via some kind of lawless type class is also not my idea of a good design.

Once again, 100% agree that the ergonomics of conversion are poor. However, I don't think this is a good solution.

phadej commented 3 years ago

re LazyStrict there is https://hackage.haskell.org/package/strict-0.4.0.1/docs/Data-Strict-Classes.html which e.g. lens build upon since lens-5. (I.e. it already exists, text doesn't need to invent it)

phadej commented 3 years ago

There is also https://hackage.haskell.org/package/string-conversions-0.4.0.1/docs/Data-String-Conversions.html if you want two way conversion. Thing like show :: (Show a, IsString b) => a -> b, why would it be in text, what make text the place for it, and not, I don't know, prettyprinter with its Pretty class!

emilypi commented 3 years ago

I agree with @phadej on both counts. If we do have nice conversions, I think I would prefer them to be text-valued, as opposed to generally convertible.

Kleidukos commented 3 years ago

As someone who (unwillingly) has to use an alternative prelude with exactly such a magic conversion function

@kozross Could you tell us a bit more about the prelude you're using and the conversion mechanism you're talking about?

Kleidukos commented 3 years ago

@phadej Yes I agree with you, and this was not part of my original proposal. :)

kozross commented 3 years ago

@Kleidukos My gripe was with the extremely polymorphic show (which as you clarified, isn't part of your original proposal). The one I have to deal with is from universum, which is even worse.

I am strongly against basing conversions (or indeed, anything) on lawless type classes. That is what relude appears to do based on what you linked. This is definitely not an idea I support.

Kleidukos commented 3 years ago

Okay, right now the main blocker to adoption seems to be that the typeclasses have no laws. Can't we make them?

kozross commented 3 years ago

Not really - the whole point I'm making is that type classes are a bad vehicle for this kind of idea. The way the Relude stuff is written is basically impossible to give any laws to, as it amounts to 'I can turn this into a Text... somehow'.

I think a much better example of what we should go for istshow. Everyone keeps writing it in their codebases, and providing it would be useful, minimal, and involve no lawless anything.

chshersh commented 3 years ago

I'm not entirely sure what input is required from us. We use the ToText typeclasses with other conversion functions from relude in various projects, and we find them highly convenient. We don't experience problems with interaction between OverloadedStrings and relude conversion functions. Moreover, this extension is enabled in our all projects by default in the .cabal file.

I also don't see any problems with typeclasses without laws. We already have Show, Read (which even fails in runtime with error), FromJSON/ToJSON, FromField/ToField from postgresql-simple, Arbitrary from QuickCheck, and many more. Even if they don't have laws, typeclasses are convenient, and they help a lot of people be productive and solve their problems efficiently, even if they don't have laws.

I don't think that having or not having laws should be the blocker to introduce typeclasses. There're different trade-offs for using typeclasses vs using explicit functions. But since it's possible to have only a single instance per type, global instance coherence plays a more crucial role. When introducing a typeclass, you can ask yourself a question, "How many instances of this typeclass could be for a type?". And for me, there's only one way to convert String to Text using the ToText typeclass, so typeclasses look suitable enough.

The way the Relude stuff is written is basically impossible to give any laws to, as it amounts to 'I can turn this into a Text... somehow'.

I wouldn't be so definitive in wording by saying that it's impossible. I see at least one law for the ToText typeclass:

And this law holds for all current instances. But even without this law, I think that the ToText typeclass if reasonable.

On the technical side, I have only one concern. If the ConvertUtf8 typeclass won't make it to text, but something like ToText would, then the wording of the custom error message for the ToText instances can prevent us from reexporting ToText from relude instead of using our own typeclass. Our custom error messages mention decodeUtf8 from ConvertUtf8 but if text would refer to decodeUtf8 from text, then we can't just reexport typeclasses, at least because the implementation of decodeUtf8 is different in text and relude.

Kleidukos commented 3 years ago

According to Bodigrim, ConvertUtf8 should make it into text, so I'm not excessively worried about that.

Kleidukos commented 3 years ago

And yes, we have many lawless typeclasses in Haskell. Now it may be because typeclasses are one polymorphic abstraction that is open to be extended, but it doesn't mean that it is untestable. One law that is not algebraic but that we can promote is that the ByteString/LByteString/ShortByteString conversions are a no-no, and that, quite fortunately, is taken care of by the EncodingError system.

amesgen commented 3 years ago

The way the Relude stuff is written is basically impossible to give any laws to, as it amounts to 'I can turn this into a Text... somehow'.

I wouldn't be so definitive in wording by saying that it's impossible. I see at least one law for the ToText typeclass:

  • Idempotence: toText . toText ≡ toText

And this law holds for all current instances. But even without this law, I think that the ToText typeclass if reasonable.

While I also don't think that typeclasses have to have laws, and I really like toText from relude, this particular law will always be satisfied, as toText . toText ≡ toText @Text . toText ≡ id . toText ≡ toText due to toText @Text ≡ id.

Lysxia commented 3 years ago

I also find those three classes reasonable enough (ToText, ToLText, ConvertUtf8).

I think there can be other ways than laws to easily grasp what is going on with a type class. In this case, there are many morally equivalent representations of "text" (String, Text, LText, also Builder and ShortText), and the functions ToText/ToLText reify that equivalence for situations where you just want to normalize everything down to the same type. What matters is that these types are all "Unicode strings", so toText does essentially nothing on some semantic level, and that's why it's reasonable. (One possible blind spot is IF any such conversion also requires some kind of encoding normalization, I hope that's not the case.)

Similarly, ConvertUtf8 is intended to be the way to convert in UTF-8 between any type of "bytestring" and any type of "text". Again, the idea is that all instances implement one well-defined semantics (that of UTF-8) in some intuitive sense.

I think what would be problematic is if UTF-8 conversion may or may not be performed depending on the inferred types, but that's not what @Kleidukos proposed here.

Another low-tech approach is to expose all those conversions as monomorphic functions, but decodeUtf8FromByteStringToText doesn't necessarily make things easier to read.

kozross commented 3 years ago

The argument that 'we have lawless type classes already, what's a few more' is pretty bad, especially when the existing examples given are some of the most problematic type classes in the ecosystem (not only for their lawlessness). If you see litter, the solution is at least not to add more litter, not drive by with a dump truck. This is particularly pertinent here, given that text is a boot library.

I am in favour of simple, monomorphic conversion functions. Dragging in type classes to solve this problem is the wrong approach, especially considering that UTF-8 converter type class will absolutely shred type inference without littering your code with explicit types or @-nnotations. Yes, it means people have to write slightly longer names - so what? If that means clarity and type inference, I'm in favour. Making things short and automagic by way of lawless type classes (especially ones with multiple parameters and no fundeps) is a false economy - you're trading (arguably justified) inconvenience now for massively worse inconvenience later.

I'm not speaking theoretically either - I'm knee-deep in several codebases at my Real Job For Real Money which decided on this false economy. The number of hours of my life I've lost on utterly unproductive tasks as a result of this is so high that I cannot agree with such a design even in theory.

Lysxia commented 3 years ago

will absolutely shred type inference without littering your code with explicit types or @-nnotations. Yes, it means people have to write slightly longer names - so what?

Don't these expressions convey exactly the same information?

(decodeUtf8 :: ByteString -> Text)
decodeUtf8FromByteStringToText

You're claiming that the latter variant is better; what's the problem with the former? Is a "slightly longer name" more than an ad hoc form of type annotation?

My own argument above was explicitly not "look at those other lawless type classes", but "these classes we are discussing here may make sense even without equational laws", so I'd like to know where that specific argument goes wrong.

To be clear, I am still ambivalent about whether those type classes are a good idea. I am willing to respect your informed opinion on this, but your argument as written is "I've been bitten by lawless classes before therefore lawless classes are bad" which is a logical fallacy. Elaborating that point would make your case more convincing.

kozross commented 3 years ago

@Lysxia I think I was a bit unclear in my argument - sorry! Let me clarify.

The issue I have with the UTF-8 conversion type class proposed here isn't (entirely) to do with its lawlessness. Likewise, the thing I was bitten by wasn't (entirely) lawlessness. The far bigger problem for me has to do with the fact it's a multi-parameter type class without functional dependencies. This is a known bad idea in the community - even the author of MPTCs said as much! - and it's not nearly as simple to resolve as you suggest in your response. I've had cases where even completely known monomorphic uses of a method from such a type class still required me to @-nnotate to say what instance I needed!

This is not to say I support lawless type classes (I don't). It's clarifying that my statements above were primarily with regard to MPTCs without fundeps. My work-related anecdotes regarding lawless type classes are still bad; the ones regarding MPTCs without fundeps are orders of magnitude worse. They're hard to diagnose, the output of the compiler is confusing and misleading, and they are generally terrible on every level. Even if you accept lawless type classes as a valid solution to anything, you should reject MPTCs without fundeps - they're not worth the cost later, no matter how convenient they appear.

Hopefully this clarifies my position.

Lysxia commented 3 years ago

Thanks for clarifying. So it's more about MPTC than lawlessness, but I'm still not sure how to understand this issue you mention

I've had cases where even completely known monomorphic uses of a method from such a type class still required me to @-nnotate to say what instance I needed!

decodeUtf8 :: ByteString -> Text  -- (1)
decodeUtf8 @ByteString @Text      -- (2)
decodeUtf8FromByteStringToText    -- (3)

AFAICT (1) and (2) are completely equivalent modulo getting the order of arguments right in (2), and there is no situation where you'd need (2) where you couldn't also have used (1). Do the problems you mention involve PolyKinds or ambiguous kinds? They are the only relevant circumstances I can think of, and they do not apply here.

And why are type annotations ((1) or (2) or both) bad? You've said that but not explained it. To me, none of the options above look significantly better or worse than the others.

The only difference I see is that in (1) and (2) you can sometimes omit the type annotation, which can lead to problems when a typo leads to wrong code compiling with the wrong instance. However, I've argued above that this should not be a problem since all instances "morally" do the same thing. The kind of mistakes that class may allow is on the same level as mistyping ...FromLazyByteString... instead of ...FromByteString..., which can happen just as well.

Abusing MPTC or lawless classes may lead to disaster. I am arguing that this proposal has good properties that make it not misuse and it will not lead to disaster.

kozross commented 3 years ago

The problem doesn't lie in the 'accidentally use wrong instance' issue. It lies in that if you omit the type signatures or @-nnotations, the error you can get from the compiler, even in a completely monomorphic setting, is not in any way, shape or form clear to someone who doesn't have the mental pattern of 'if I have a fundepless MPTC, this is what resolution failure looks like'. I encountered this exact issue about a week ago; I'm a fairly experienced Haskeller, and am used to dealing with GHC gore around impredicative instantiation and type-level programming, and it took me quite a while to figure out that this was the problem, not least of all because how nonsensical the error you get is relative the situation you find yourself in.

Furthermore, I reject the idea that the instances do "morally" the same thing - depending on which conversion you are doing, you might end up doing a lot of allocations versus possibly very few. This is not a trivial concern, as this can easily lead to performance issues. Having this magically hidden behind a type class is not particularly helpful. In fact, I had to chase down just such an issue in Real Job For Real Money. I won't mention how annoying it was or how long it took - you can probably guess by now.

I believe that in the case of text, these are not trivial concerns. It is probably one of the first libraries a newcomer to Haskell is going to need to learn. Forcing people to learn about the specific heuristics for detecting MPTC-with-no-fundeps resolution errors, as well as what conversions copy and what don't when hidden behind a type class method, does not strike me as a good experience for someone learning to use this library. I'd go even further and say that even for people who aren't learning, these problems are annoying and unnecessary.

parsonsmatt commented 3 years ago

And why are type annotations ((1) or (2) or both) bad? You've said that but not explained it. To me, none of the options above look significantly better or worse than the others.

Well, to provide a type annotation to decodeUtf8, it must be wrapped in parentheses, so (decodeUtf8 :: ByteString -> Text). 34 characters, 6 of which are non-alphanumeric and require holding the shift key - this isn't ergonomic to write, and it's a lot of syntactic noise and punctuation. If you need to add a type signature after initially writing an expression, you need to insert the beginning paren, jump to the end of the word, insert the signature, and close paren. Not fun!

decodeUtf8 @ByteString @Text is better - only 28 characters, and only two special characters. No need for parentheses. Much less line noise. To insert it after the fact, you just insert the type parameters. Passing type parameters has a UX concern in that it's not clear what the source or target is. from @Text @ByteString - what exactly is happening here? You have to memorize it. And documenting type variable ordering isn't something that Haddock does a great job with yet.

decodeUtf8FromByteStringToText is 30 characters, but no spaces and no special characters. Easy to type, no syntactic noise, and the name says exactly what it does - from bytestring to text.

That's the static view. Let's take the dynamic view - what happens when things change? What about typos and other mistakes?

As kozross mentioned, you can omit the type signature. This can give bad error messages, or worse yet, silently change behavior when another part of the system changes.

Autocomplete works better with the single function name than with a type-application or type-signature based approach. I can type decodeUtf8<TAB> and it'll bring up all the options for things starting with that, and I can easily pick the term I want from the menu.

If you forget to import a value that's supposed to be used with a type application, you get a bizarre error message (or, at least, you did on GHC < 8.10, that appears to be fixed).


Now, all these UX concerns are with a two-param class: class Convert a b where .... But I've had much better luck with single type variable classes: class ConvertToText a where convertToText :: a -> Text. These have much better usability, and by picking the output type, you don't run into the 'polymorphic return' type errors. There are more type classes - one per target type - but this still reduces the API surface area by only requiring a name for each target type, vs a name for each target * each source.

However, this still runs into an issue! Morally, convertToText :: ByteString -> Text is a lie - we ought to be specifying the encoding, and then it should really return an Either ConversionError Text, not Text. But convertToText :: LazyText -> Text is totally fine. Which means we want the signature to vary based on what we're doing - we're conflating parsing with conversion. And that's a big problem.

I'm not in favor of the conversion type classes. They're OK for rapid prototyping and playing around, but I'd rather not ever see them in production code.

Lysxia commented 3 years ago

Thanks @kozross and @parsonsmatt for making your concerns more concrete.

we're conflating parsing with conversion. And that's a big problem.

I completely agree with this. Are you still opposed to such classes if they are kept separate (as in the actual proposal here)? If we forbid ourselves from using MPTC there is still the option of offering a class ConvertUtf8ToText separate from ToText.

kozross commented 3 years ago

@Lysxia Yes, I am still opposed. @parsonsmatt spelled out some concerns, which I share, along with the inherent lawlessness of such classes.

emilypi commented 3 years ago

Thanks for spelling that out, @parsonsmatt. To piggyback on Matt's ideas, I've considered writing the "moral" convenience classes in the past to help expedite non-production code like test suites. By "moral", I mean MPTC with functional dependencies, a la

tshow :: Show a => a -> Text
tshow = Data.Text.pack . show 

-- this can be more complicated and deal with conversion errors too. Just an example. 
data Utf8Error = DecodeError Text | ConversionError Text deriving Show

data Encoding  = Unicode | Utf8

class HasTextRepr (enc :: Encoding) bs t | t -> enc  where 
  type TextRepErr enc :: *

  toText ::  bs -> Either (TextRepErr enc) t
  fromText :: t -> bs

instance HasTextRepr 'Utf8 ByteString Text where 
  type TextRepErr 'Utf8 = Utf8Error 
  toText = first (DecodeError . tshow) . decodeUtf8' 
  fromText = encodeUtf8
П> toUtf8Text :: HasTextRepr 'Utf8 a Text => a -> Either Utf8Error Text; toUtf8Text = toText
П> toUtf8Text @ByteString "foobar"
Right "foobar"

But, modulo syntactic or ideological errors in the code above, this has a knock-on effect that makes it a poor choice. Namely, the fundep will propagate throughout the code, and it makes type checking/inference, though good, grind to a halt for any sufficiently complicated program. This is compounded by the fact that Text is used ubiquitously, in many disjointed places. In fact, for some very polymorphic code I was using at one point that did something like this, with enough proxying and an arguably tiny amount of code (~400 lines), I managed to get GHC to OOM in stackage: https://github.com/emilypi/base64/issues/31. This leaves alot to be desired with respect to GHC's compile story.

The morality of why MPTC should always have fundeps is illustrated in Mark Jones' paper Type class with Functional Dependencies. I don't think I would be comfortable providing a non-fundep'd MPTC in the presence of other extensions like UndecidableInstances, IncoherentInstances, or even FlexibleInstances, which can be a vector for some very nasty bugs.

In summary, we have the following situation:

danwdart commented 2 years ago

I'm definitely in favour of @emilypi 's answer, maybe also with a tread / treadMaybe / treadEither helper. But then it becomes "how many treads do we need?", so maybe we need something that looks a bit like (this won't work, but to get the gist):

class IsString s => Readable r where
    sreadThrow :: s -> r
    sreadEither :: s -> Either e r
    sreadMaybe :: s -> Maybe r

class IsString s => Showable r where
    sshow :: r -> s

and then have Text, ByteString, String etc instances for those classes, so you can just sshow / sreadThrow your data type and have it in any IsString you like.

Then whenever someone wants a new choice class, they can just have:

class (IsString s, Readable r) => MyReadable r where
    sreadChoice :: s -> Choice a r

(whichever way round the s and the r goes depending on what they want).

Kleidukos commented 2 years ago

While I agree that something of this flavour is useful, I'm not convinced that it should produce a Text. Afterall, almost all uses of tshow in my experience are as part of a concatenation of other textual values. Moreover, an efficiently-implemented tshow would almost certainly be constructing the its result as a concatenation of individual parts.

@bgamari I ended up following your advice, in the text-display library, the typeclass methods produce a Builder, and the display function just converts the Builder amalgamation to a Text. :)

There is also a detailed explanation regarding ByteStrings & Function types: https://github.com/haskell-text/text-display/blob/main/DESIGN.md#-you-should-not-try-to-display-bytestrings