haskell / cabal

Official upstream development repository for Cabal and cabal-install
https://haskell.org/cabal
Other
1.63k stars 697 forks source link

Use Text instead of ShortByteString in ShortText? #8256

Open Bodigrim opened 2 years ago

Bodigrim commented 2 years ago

https://github.com/haskell/cabal/blob/0a2e68cbde9ad8dafd67c18511e7117956060f9b/Cabal-syntax/src/Distribution/Utils/ShortText.hs#L95-L96

Most of Cabal textual data (PackageName, ModuleName, etc.) are newtypes over ShortText. This provides for compact storage, but almost any operation on them requires conversion to String and back. Indeed, working with Unicode in ShortByteString directly is painful.

Could we replace ShortByteString by Text in ShortText?

newtype ShortText = ST { unST :: Data.Text.Text }

This would simplify boilerplate in Distribution.Utils.ShortText and it would be trivial to expose more utilities to avoid ubiquitous conversions to String. Potentially we can also parse directly into Text (instead of parsing ByteString and converting later).

Note that Cabal already depends on text via parsec, so this move does not change dependency graph.

(Another option is to use text-short, but this one is slightly less conservative. If the proposal is implemented, it would be trivial to swap text and text-short, as they provide almost the same API)

CC @phadej

phadej commented 2 years ago

text-2.0 is probably fine, as it's the same ByteArray# underneath. Older text with UTF-16 is unnecessary conversion, but OTOH using newer Cabal on older GHCs is somewhat special case (custom build-type, hackage-server - there memory usage matters as all data is in-memory, ...).

Yet, for @Bodigrim and others, I'm no longer a maintainer of (nor contributor to) Cabal, and there are plenty of people on the team, so I'd value if you don't ping me, especially as the first person.

Bodigrim commented 2 years ago

Even text-1.2 with UTF-16 is better than conversions to String here and there. But yes, I suppose cabal > 3.8 will be built with text-2.0, so we can have both.

phadej commented 2 years ago

It's also easy to add conversion from Cabals ShortText to Text (in text-2.0) without breaking any API, and then let users like hackage-server and ghc-pkg adapt first. cabal-install doesn't really care about ShortText fields (except to print info, which barely anyone uses).

phadej commented 2 years ago

https://github.com/haskell/hackage-server/issues/600#issuecomment-315625369 is the hackage-server comment I remembered. While changing from UTF-8 to UTF-16 won't be a big change (making e.g. each PackageName take twice as much space), I'd put the burden of showing that it's fine to the original proposer... Or migrating hackage-server to use text-2.0 first. (Though still each Text takes 16 bytes more space for each PackageName etc, which matters for these plentiful and short text strings).

Bodigrim commented 2 years ago

I experimented with this idea. Two conclusions:

  1. We should rather keep ShortByteString as is. It's indeed painful to implement something like a Unicode-aware uncons :: ShortByteString -> Maybe (Char, ShortByteString), but for Cabal purposes it's enough to have unconsAscii :: ShortByteString -> Maybe (Word8, ShortByteString) and similar for other utilities. If we could set bytestring >= 0.11.3.1, we'd have access to a wide range of ShortByteString routines.

  2. All package / module / unit names end up being converted to a type FilePath = String. To avoid useless conversions, we must migrate to filepath-1.4.100.0 first.

Altogether, I think this ticket is on hold until GHC 9.6.

Kleidukos commented 2 years ago

Thanks for the update!

phadej commented 2 years ago

If we could set bytestring >= 0.11.3.1 Altogether, I think this ticket is on hold until GHC 9.6.

No. Five years after that (emphasis mine)

Our GHC support window is five years for the Cabal library and three years for cabal-install: that is, the Cabal library must be buildable out-of-the-box with the dependencies that shipped with GHC for at least five years.

Bodigrim commented 2 years ago

I'm afraid Cabal already violates this policy: https://github.com/haskell/cabal/blob/207d4ee08b929ae71ae2c746fe6da660bc129f05/Cabal/Cabal.cabal#L51

phadej commented 2 years ago

That is fixing a (nasty) bug. Depending on newer bytestring can wait.

Bodigrim commented 2 years ago

bytestring is tangential. It would be nice to reuse bytestring-0.11.3.1, but it's not too much to copy-paste necessary functions. What is really important is to switch to filepath-1.4.100.0, but we need to accumulate more community-wide experience with a new interface first, which is not happening before GHC 9.6 is released.

I'm not saying (and I'm in no position to say) that we should pull the trigger immediately after GHC 9.6, it's simply that any further substantial progress requires it.

phadej commented 2 years ago

Relatedly: The FilePaths in BuildInfo etc data-structures respresenting .cabal files are not PosixPaths nor WindowsPaths but some common denominator (more closely to PosixPath though), and preparation to that can be started already today.

(In fact, that what I tried to with SymbolicPath stuff).

Mikolaj commented 2 years ago

I'm afraid Cabal already violates this policy:

Does it? This is under if impl(ghc >=8.2).

Mikolaj commented 2 years ago

Unless you understand "buildable out-of-the-box with the dependencies" as in, literally, using all those deps? Isn't this about overriding without conflicts and crashes?