haskell / cabal

Official upstream development repository for Cabal and cabal-install
https://haskell.org/cabal
Other
1.6k stars 690 forks source link

Only store compressed index, don't store 01-index.tar #2707

Open amigalemming opened 9 years ago

amigalemming commented 9 years ago

EDIT: Update 6 years later. Since version 2.0 cabal uses the new hackage-security protocol and stores the new 01-index but it does still store both the compressed and uncompressed version. As October 2023, the numbers are:

❯ ls -lh ~/.cabal/packages/hackage.haskell.org/01-index.tar*
-rw-r--r-- 1 andrea andrea 835M Oct  6 14:53 /home/andrea/.cabal/packages/hackage.haskell.org/01-index.tar
-rw-r--r-- 1 andrea andrea 112M Oct  6 14:14 /home/andrea/.cabal/packages/hackage.haskell.org/01-index.tar.gz
-rw-r--r-- 1 andrea andrea 5.0M Oct  6 14:14 /home/andrea/.cabal/packages/hackage.haskell.org/01-index.tar.idx

Original post follows:

My packages/hackage.haskell.org/00-index.tar consumes 200 MB and I wonder whether it is necessary to store it uncompressed. Cabal already stores 00-index.tar.gz, too, which occupies only 10 MB. Is there a need to store the uncompressed archive on disk?

amigalemming commented 7 years ago

New cabal-install-2.0 stores an additional 01-index.tar containing more than 400MB.

ezyang commented 7 years ago

A few comments:

  1. It seems like a good idea to compress the index, as long as it doesn't appreciably affect how long it takes to dep solve. Would be a good thing to try out and test. (Maybe I can turn this into a newcomer ticket!)

  2. You can delete 00-index.tar; it is only necessary if you are using an old version of Cabal; 2.0 solely reads and writes 01-index.tar.

phadej commented 7 years ago

@hvr @dcoutts please correct me:

AFAIK we have 01-index.tar.idx, which is an index file; so we can seek the big 01-index.tar. And we cannot reasonably seek compressed variant.

I'm not competent enough to say what is the way to build seekable yet compressed store (compress in blocks, compress individual files with precomputed dictionary, something else?)

23Skidoo commented 7 years ago

Yep, the .idx file stores offsets to where the package descriptions are in the tar file; when reading the package index, we just go through those offsets one by one and create a lazy I/O thunk for each one that loads the corresponding package description when evaluated: https://github.com/haskell/cabal/blob/master/cabal-install/Distribution/Client/IndexUtils.hs#L704

23Skidoo commented 7 years ago

I googled a bit and found this: https://github.com/madler/zlib/blob/master/examples/zran.c

hvr commented 7 years ago

Fwiw, I wanted to switch to a more efficient compressed format once the index size becomes unbearably large; I had .xz in mind (which is significantly more efficient than .gz at compressing the 01-index.tar file, and yet is fast enough at decompressing), which has some built-in level of support for random access (not sure though if it's a good enough fit for our use-case).

The other plan I had (which could be combined w/ the .xz-one), was to maintain a cache (which would need to be able to be updated incrementally in an efficient way as to not make cabal update too expensive, as well as be indexed by package-name -- I've got a couple of ideas for that) of the data relevant for depsolving in a compact preparsed format.

dcoutts commented 7 years ago

It is indeed possible with some zlib cunning to maintain a set of seek points into a compressed .gzip file. It's just complicated. Eventually it may become worth it.

More useful imho, would be maintaining a cache in a binary form of all the relevant bits we need, and possibly trying to compress that in a block style so it's easier to do random access.

amigalemming commented 7 years ago

On Thu, 24 Aug 2017, Duncan Coutts wrote:

It is indeed possible with some zlib cunning to maintain a set of seek points into a compressed .gzip file. It's just complicated. Eventually it may become worth it.

Would it help to split the TAR archive into multiple ones?

dcoutts commented 7 years ago

Would it help to split the TAR archive into multiple ones?

The tar file is optimised for download size and performance, and also for security simplicity. Having a single append only one there is a nice local optimum.

Splitting into large tarball chunks might help with CDN/proxy caching for those proxies that cannot cache HTTP range gets, since it'd reduce the size of the chunk that is growing.

But for local random access while keeping things compressed it's a tradeoff between size and speed. The smaller the blocks (or separate files, or gzip seek points) you have, the faster it is to seek since you on average have a seek point closer to the data you're after, but the more space the extra blocks take. And doing it at all increases complexity. And we've only got a certain complexity budget.

amigalemming commented 6 years ago

Just found an interesting program named pixz. It promises: "pixz compresses and decompresses files using multiple processors. If the input looks like a tar(1) archive, it also creates an index of all the files in the archive. This allows the extraction of only a small segment of the tarball, without needing to decompress the entire archive." Maybe this could help to get both compression and fast random access.

amigalemming commented 5 years ago

I got to know about the new zstd compressor: https://en.wikipedia.org/wiki/Zstd One of its features are user defined dictionaries. With this feature, cabal-install could compress each Cabal project file individually and store them all in a Tar archive. The dictionary can be generated from the current hackage database and would be stored additionally to the 00-index.tar.

tumagonx commented 4 years ago

I found this very frustrating for user in area with bandwidth starved internet infrastructure. The fact that I need cloud convert 01-index.tar.gz into 01-index.tar.7z just to be able download it successfully at a third size (vs currently 83mb). Yet I can't get around cabal install from asking 01-index.tar.idx. How to manually generate the index file?

sheesh... every new language is like this: golang, ocaml/opam, rust/cargo... I giving up this "bandwidth for granted" package/library manager and go back to C/C++ where I can control everything.

amigalemming commented 4 years ago

I found this very frustrating for user in area with bandwidth starved internet infrastructure.

This ticket is about storing the uncompressed index file. It is downloaded with compression, but of course it could use a more efficient compression. I guess that using a more efficient compressor would be much simpler than maintaining a compressed index. (Btw. my stored 01-index.tar is 655MB today.)

tumagonx commented 4 years ago

yeah sorry about being OT, been struggling for two days just to get past cabal update, can't find a way feed this hard earned 01-index.tar.gz manually. Cabal2 sync, check and timestamp every single step of downloading those *.json and tarball. I even think of faking hackage.haskell.org with local DNS server.

I wonder what happened if tarball exceed 100mb and 1gb uncompressed, even wikipedia ZIM is an indexable LZMA-based.

Janfel commented 3 years ago

Hey, can somebody please implement a fix for this, or at least give us an option to not store the index locally? My 01-index.tar is at 700MiB already, which is way more than I am comfortable with. There are entire operating systems that take up less space that this index file, e.g. https://puppylinux.com/.

tomjaguarpaw commented 3 years ago

It would be good to understand what people's use cases for this feature would be.

Is it to save bandwidth or download time? As per https://github.com/haskell/cabal/issues/2707 compression is already applied. How much more compression could be possible?

Is it to save disk space? That's understandable, but even in a cloud environment 1 GB costs about $1 per year.

Are there other benefits we could get from this feature?

liskin commented 3 years ago

Is it to save disk space? That's understandable, but even in a cloud environment 1 GB costs about $1 per year.

Note that some providers have different pricing models, for example cheap (or entirely free and unlimited) traffic, but more expensive storage. And then there are greybeards (or green?) like me who don't want to relearn the IP address and don't remember who are the guys that need to be contacted to reconfigure slave DNS servers, and thus are reluctant to switch to a cheaper host. :-)

Are there other benefits we could get from this feature?

GitHub Actions (and possibly also other CI services') cache having a 5G storage limit comes to mind. When that becomes a problem, one might remove the uncompressed tar manually before the CI job ends, but it would be beneficial if this wasn't a problem in the first place.

We came close to that limit when we cached the entire ~/.cabal and ~/.stack directories for every test matrix combination. Now we install GHC via apt from the PPA and cache ~/.stack/pantry just once for the entire matrix, but again, it would be easier if these tools didn't carelessly waste space in the first place.

phadej commented 3 years ago

GitHub Actions (and possibly also other CI services') cache having a 5G storage limit comes to mind. When that becomes a problem, one might remove the uncompressed tar manually before the CI job ends, but it would be beneficial if this wasn't a problem in the first place.

For example haskell-ci doesn't cache index on GHA. Fetching it from the hackage servers (CDN in front!) or from GHA's cache is about the same, it comes over the network anyway.

Also caching whole ~/.cabal is silly. You only need to cache ~/.cabal/store.

That said, caching ~/cabal/store for each repository in isolation is wasteful. It's technically feasible to have global binary cache for GHA (as the environment is standard), but that one would need someone to give $$$ to host storage (and initial development), as we cannot use GHA's "free" cache then.

So TL;DR, the problem here is that we try to use stuff which is free, which is a local optima, but it's really not the global one.

liskin commented 3 years ago

Fetching it from the hackage servers (CDN in front!) or from GHA's cache is about the same, it comes over the network anyway.

Not really. Restoring ~/.stack/pantry from GHA cache takes 8s, whereas the cabal v2-update step of haskell-ci takes 26s. (https://github.com/xmonad/xmonad/runs/2446983366 vs https://github.com/xmonad/xmonad/runs/2446983356)

It's just 20s difference but our stack-based builds usually take a little over one minute so it's not entirely insignificant. :-)

Totally agree with the rest of your comment. I recently learned that using Nix and Cachix might already solve that problem, but I haven't managed to find time to learn about Nix yet.

phadej commented 3 years ago

It's just 20s difference but our stack-based builds usually take a little over one minute so it's not entirely insignificant. :-)

I should comment that ~minute builds is an outlier. Take e.g. unordered-containers: Running its tests

cabal test --enable-tests -j1  28,57s user 2,20s system 100% cpu 30,686 total

With more cores you can do better

cabal test --enable-tests  37,41s user 3,03s system 523% cpu 7,718 total

But there is GHA environments is two core, so you don't get much speed up (and only if you have multiple test suites, or they use multiple cores).

I.e. I'm very suspicious when test suites are too quick, do they really test anything?


Also note that how Hackage index is handled is a cabal-install issue, not Cabal. stack folks can do stuff differently. stack model is such it don't need the index to exist locally, it could query internet service for packages it doesn't know about (= outside of snapshot). OTOH cabal-install cannot, as solver needs the whole index up front to operate efficiently.

EDIT: I'm also not sure what is inside ~/.stack/pantry, does it contain unpacked 01-index.tar, why?

liskin commented 3 years ago

I should comment that ~minute builds is an outlier. Take e.g. unordered-containers: Running its tests [...] I.e. I'm very suspicious when test suites are too quick, do they really test anything?

xmonad does indeed not have a comprehensive test suite. We primarily test that it still builds with the supported versions of GHC and dependencies and Stackage LTSs. There are a handful of properties and unit tests, but you're correct in assuming that the little time spent in the test suite itself is indicative of low test coverage. (But then, it's Haskell—if it compiles, it works. 🙂)

That being said, I'm confident that fast test suite is achievable even with tests that actually test most of the code base, provided fast turnaround is a real priority (and it definitely is a priority on projects where I happen to be in charge).

(We're probably going off topic with this, apologies to all those who follow the issue. Feel free to reach out privately to discuss this further.)

EDIT: I'm also not sure what is inside ~/.stack/pantry, does it contain unpacked 01-index.tar, why?

It contains a hackage subdirectory with 00-index{.tar,.tar.gz,.tar.idx} and also pantry.sqlite3, which seems to contain the same information, just in sqlite and with possibly some extra data on top of it. As to why, I just don't know, sorry.

amigalemming commented 3 years ago

On Thu, 13 May 2021, tomjaguarpaw wrote:

Is it to save disk space? That's understandable, but even in a cloud environment 1 GB costs about $1 per year.

For me it is to save disk space, both for builds on a dedicated root server and for builds on single board computers.

praduca commented 1 year ago

I believe compression could be beneficial on most cases, but cabal files are so tiny that this would be limited... I can get only about 50% compression with the most aggressive settings with zstd ... Will test with dictionary soon.

On another note, I think this need some thought.. the size of the index as is right now is not unmanageable, but if we want haskell to get more traction (especially with the works of the foundation now) the size of this index could get ridiculous sizes pretty fast...

amigalemming commented 11 months ago

If compression in Cabal is too complicated, I think the simplest solution for now is to enable transparent file system compression as in btrfs.

andreabedini commented 11 months ago

I edited the title and the original post to keep it up to date. Please let's not turn this issue into a discussion around compression algorithms.

Hackage serves both 01-index.tar and 01-index.tar.gz and this is the constrain we have to work with.

praduca commented 11 months ago

Well, without changing the compression algorithm i think this is not feasible... Maybe then this thread should be closed and other specific about the index format and compression opened (if it doesn't already exists)

phadej commented 11 months ago

@andreabedini

Hackage serves both 01-index.tar and 01-index.tar.gz and this is the constrain we have to work with.

You are probably misunderstanding something. Hackage serves the .gz file. cabal-install uses .tar (i.e. unzips .gz file), because cabal-install needs random-access. If there were compression format supporting random access, then that file could be kept on disk instead of uncompressed .tar file.

See hackage-security: https://hackage.haskell.org/package/hackage-security-0.6.2.3/docs/Hackage-Security-Server.html#t:CacheLayout

Compressed index tarball

We cache both the compressed and the uncompressed tarballs, because incremental updates happen through the compressed tarball, but reads happen through the uncompressed one (with the help of the tarball index).

But IIRC, the reading scheme can be different than using uncompressed .tar file (i.e. something using different compression, which allows creation of a random access index)

andreabedini commented 11 months ago

@phadej

Hackage serves the .gz file.

Hackage serves both https://hackage.haskell.org/01-index.tar.gz and https://hackage.haskell.org/01-index.tar. It's documented in https://hackage.haskell.org/api#core.

But IIRC, the reading scheme can be different than using uncompressed .tar file (i.e. something using different compression, which allows creation of a random access index)

I agree with this. cabal-install incremental update of the Hackage index manages to make it work off the tar.gz (this is implemented in hackage-security actually). IIRC the .tar.gz is updated incrementally but the .tar and the .idx files are regenerated every time.