ocaml / opam

opam is a source-based package manager. It supports multiple simultaneous compiler installations, flexible package constraints, and a Git-friendly development workflow.
https://opam.ocaml.org
Other
1.24k stars 356 forks source link

Allow to use a "compressed" (one file) repository format for performance and sustainability purpose #5648

Open kit-ty-kate opened 1 year ago

kit-ty-kate commented 1 year ago

opam repositories currently have a "one file per packages/versions" but as the number of packages grow it creates a sustainability problem for people with low number of inodes for their filesystems (e.g. see https://github.com/ocaml/opam/issues/5484) and a performance problem (you have to open each file on every opam update)

I'm not set on a particular format for that file but it could be the format that opam switch export already uses. @mseri also suggested using SQLite

rjbou commented 1 year ago

Some data: today repo concatened gives a file of 39M and 1 235 506 lines.

kit-ty-kate commented 1 year ago

some more data: a xz-compressed repository would be only 2M and takes 0.2s to uncompress on my local machine

avsm commented 1 year ago

One mechanism would be to use normal tar.xz files and use the OCaml libraries to parse them directly with unpacking. That has the benefits of making them easy to create, and there performance improvements from not having a lot of small files.

rjbou commented 1 year ago

From dev meeting

We can have several formats in the opam repository itself:

The repo file can mention what it the format of the repo, but they can coexists in a simple repo. Opam can understand all those formats (backward compatibility), for API users, it is imperceptible as fetching repo functions remain the same. Opam can also try to retrieve compressed format, then fallback on aggregated format, then fallback to plain directory one.

These new formats can be served via a webserver, for example having opam2web generate them. For github main opam repository, one solution is to have an alternate branch, that serves the aggregated file. It would be automatically updated for each merge.

c-cube commented 10 months ago

I have another suggestion: a zip file. It doesn't compress as well as .zst or .xz but it has a big advantage that it's randomly addressable, so you never need to actual unzip it. If you want foo/foo.1.2/opam you can directly get the corresponding entry without decompression.

kit-ty-kate commented 10 months ago

xz is also randomely addressable. I use this feature in https://github.com/kit-ty-kate/opam-health-check-ng using the pixz external tool.

c-cube commented 10 months ago

Oh I didn't know that!! Zip has the upside of having very mature bindings (camlzip) but xz does compress a lot better. In any case I think it'd accelerate some things a lot.

Another performance issue I've seen is that opam tends to check the state of various switches many, many times in a row.

dinosaure commented 10 months ago

I'd like to point out that we currently have an opam-mirror implementation as an unikernel that uses tar (as well as zlib/decompress) and allows random addressable contents. In your proposition, it would be difficult for us to support *.xz in the immediate future (our approach would suggest a re-implementation of this format in OCaml) unfortunately.

I know this probably implies a regression in the compression ratio but, as @c-cube points out, zip (or even tar) has the advantage of a mature existence in the OCaml ecosystem (in contrast to xz).

kit-ty-kate commented 9 months ago

I had a deeper look and I think we can keep the current .tar.gz format and hijack OpamRepositoryConfig.repo_tarring to implement this feature in the simplest way i know (this is still not trivial though)

I implemented a proof of concept reader of the opam-repository's index.tar.gz using ocaml-tar and a fold over each files the whole archive takes between 1.5 to 0.5 seconds depending on the which checkseum's backend you use (1.5 for the ocaml backend and 0.5 for the C backend). Here is the code for the curious eyes: https://github.com/kit-ty-kate/ocaml-tar-playground/commit/0f3b31516aaa887e643f4fcd8bf382d66c58cf97

The major pain-point in the opam code that I could see on switching to use that, is that currently we diff the previous state of the repository against the new one so if we want to keep doing that we'd need to reimplement diffing between two archives manually. However, I'm not sure this is useful so we could take the opportunity to simplify the repository backends (as in src/repository) code to avoid using this overlay of diff+patch.

Following every use of OpamRepositoryPath.tar and OpamRepositoryConfig.repo_tarring should give a full enough picture to know where to change things. Reading is done in OpamRepositoryState.load_opams_from_dir so this should be the function to change to use the ocaml-tar PoC above.

There is a chance this current issue is required to fix https://github.com/ocaml/opam/issues/5741 which is currently slotted for 2.2.0~rc1 so I might bite the bullet and take the time to implement if no-one does it beforehand (if you do please ping me so we can synchronize)