haskell / cabal

Official upstream development repository for Cabal and cabal-install
https://haskell.org/cabal
Other
1.62k stars 697 forks source link

Cabal's use of GHC.Fingerprint #7398

Open Mistuke opened 3 years ago

Mistuke commented 3 years ago

Hi,

We are looking to change the fingerprint hash function in GHC as md5 is quite slow[1].

However it seems that Cabal has used this module internally to implement Backpack's fingerprinting and elsewhere with the assumption that the algorithm is MD5.

So I am trying to get an idea of what the breakage would be if we change the algorithm and how we can mitigate the problem.

[1] https://gitlab.haskell.org/ghc/ghc/-/merge_requests/4248

phadej commented 3 years ago

Cabal is also using md5 to implement quirky patching of invalid .cabal files which are on Hackage.

https://hackage.haskell.org/package/Cabal-3.4.0.0/docs/Distribution-PackageDescription-Quirks.html

There which matters is only a stability of the hash, it can be changed.

I.e. if you change Fingerprint algorithm the cabal-install (or stack using Cabal ...) built with that base/GHC will fail to parse Hackage index.


Also there is Structured setup to add a hash of structure to Binary encoded caches, so we don't decode incompatible Cabal stuff - there hash algorithm can be changed as well.


For Backpack, I think it's enough to have strong enough hash function, the exact algorithm is not required to be stable (i.e. it doesn't leak "outside" of Cabal). AFAICS it only used to generate component ids. They do change as you observe.


In conclusion, I think if GHC.Fingerprint is removed from base, Cabal will be fine. We "just" need to have some hash function, preferably returning something like data Digest = Digest !Word64 !Word64 (i.e. not ByteString).

I wonder how difficult it is to implement MD5 in Haskell, as the Cabal use cases are not time critical. (the parsing quirks kind of is, but I think it will be easier for now if Cabal "default" configuration doesn't have c-sources).

TL;DR Fingerprint is used as it was there. Adding dependency on something like cryptohash-sha256 in Cabal is dependency nightmare (also I don't like cryptohash-sha256 interface returning ByteString as a digest).

phadej commented 3 years ago

There is https://hackage.haskell.org/package/pureMD5-2.1.3 which can be vendored into Cabal (after stripping unneeded dependencies / features). I think t should be the quickest way forward. (Also it's not the first library to be vendored into Cabal).

Mistuke commented 3 years ago

@phadej thanks for the detailed response!

In conclusion, I think if GHC.Fingerprint is removed from base, Cabal will be fine. We "just" need to have some hash function, preferably returning something like data Digest = Digest !Word64 !Word64 (i.e. not ByteString).

So we can definitely still provide this, but it may not make sense to break one dependency to reintroduce one. We are also evaluating re-exporting from Base but that will be a bit messy.

I wonder how difficult it is to implement MD5 in Haskell, as the Cabal use cases are time critical. (the parsing quirks kind of is, but I think it will be easier for now if Cabal "default" configuration doesn't have c-sources).

The implementation in Base is actually mostly C with GHC.Fingerprint being just a wrapper around it. So you could technically just copy it from Base. It has no extra dependencies.

That said one of the main reasons for us moving away from MD5 is because throughput is becoming increasingly important. We're trying to switch to xxHash to improve responsiveness and startup time of the RTS.

phadej commented 3 years ago

I'd prefer Cabal to continue being pure Haskell library, without own cbits. One thing to not worry about while maintaining Cabal (e.g. it can be built with GHCJS if someone crazy does e.g. web based .cabal file formatter!)

EDIT: jokes aside, GHCJS does compile Setup.hses, so not using cbits is justified.

phadej commented 3 years ago

To be clear: I think it would be more future proof for Cabal have own hash function implementation, as now base won't have one stable across different versions.

If GHC.Fingerpint is still some hash function which could change again, it's bad. If there were GHC.XXHASH3_128, that would work (and Cabal could have shim for older versions of base). But maybe vendoring pureMD5 (or something else, which has relatively fast, pure Haskell implementation).

Mistuke commented 3 years ago

EDIT: jokes aside, GHCJS does compile Setup.hses, so not using cbits is justified.

I don't know about the implementation, but I assume then they can't be using Base, as that has a lot of C. Curious how they provide GHC.Fingerprint today.. (wondering if this would affect them too)

To be clear: I think it would be more future proof for Cabal have own hash function implementation, as now base won't have one stable across different versions.

This makes the most sense to me. This prevents another future breakage.

If GHC.Fingerpint is still some hash function which could change again, it's bad.

I fully expect it to change again. This is the second time we're changing the one in the RTS (but first for base), as hash functions improve we'd like to change it to benefit.

The fact that Base actually provided this is causing us an ABI issue. It technically should have been part of the GHC package.

phadej commented 3 years ago

GHCJS has shims for some c-symbols: https://github.com/ghcjs/ghcjs/tree/ghc-8.6/lib/boot/shims, in particular https://github.com/ghcjs/ghcjs/blob/ghc-8.6/lib/boot/shims/src/md5.js gives a good hint.

fgaz commented 3 years ago

Since the only algorithm required to be stable is the quirks one, wouldn't it be better to ask the hackage team to rewrite the index once and for all?

phadej commented 3 years ago

@fgaz it's not so easy. index is truly append only, and hackage-security enforces that. AFAICS the consequence is comparable to starting Hackage a fresh.

ulysses4ever commented 2 years ago

@Mistuke the GHC MR you mentioned seems to have been closed without merging. Not sure what's happened there but is this Cabal ticket something we still want to have open in your opinion?

Mistuke commented 2 years ago

Hi,

Yes it's still good to break the cabal dependency on this module.

Md5 is really slow so would be good for GHC to replace it.

Unfortunately my last attempt died a horrible bureaucratic death due to people bikeshedding their personal favorite algorithm.

I won't be trying again, but hopefully someone else will.

On Sat, Jul 16, 2022, 06:59 Artem Pelenitsyn @.***> wrote:

@Mistuke https://github.com/Mistuke the GHC MR you mentioned seems to have been closed without merging. Not sure what's happened there but is this Cabal ticket something we still want to have open in your opinion?

— Reply to this email directly, view it on GitHub https://github.com/haskell/cabal/issues/7398#issuecomment-1186049705, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI7OKPNV6UHYQSX6FGIAKDVUIF6BANCNFSM45BAYUKA . You are receiving this because you were mentioned.Message ID: @.***>

andreabedini commented 2 years ago

I won't be trying again, but hopefully someone else will.

:disappointed: :heart: