mamba-org / mamba

The Fast Cross-Platform Package Manager
https://mamba.readthedocs.io
BSD 3-Clause "New" or "Revised" License
6.76k stars 347 forks source link

Solving fully specified dependencies should be faster #2824

Open jonashaag opened 1 year ago

jonashaag commented 1 year ago

If you have an environment like:

dependencies:
  - pkg1=v1=b1
  - pkg2=v2=b2
  ...
  - pkg3 # potentially

With and without the non-fully-specified pkg3 the dependency solving should be relatively fast.

For an example 300 packages environment with only fully specified dependencies ("no pkg3") solving takes around 15 s, although there is not a single decision to be made by the solver.

I think the reason is that we always "send" all packages to libsolv by reading in the .solv files and libsolv is not optimized for this scenario.

We might be able to dedicate some time to this. Design question: Would we rather want to "send" fewer things to libsolv or would be want to optimize this in libsolv?

For the case that all dependencies are fully specified ("no pkg3") we could "send" only the relevant subset to libsolv. Although that requires that we know that all dependencies are fully specified. We could try solving with the subset first and if that doesn't work fall back to the slow solve.

cc @delsner

wolfv commented 1 year ago

I think @AntoinePrv is currently working on alternative ways of loading the repodata.

In rattler, we are loading the repodata sparsely (only the records that we need, recursively) and we actually make use of the fact that the packages are sorted alphabetically (doing a binary search for them).

In the future we are working on a sparse repodata protocol where there would be one index file that only contains all the available package names from a subdir and multiple smaller files that contain the metadata. This would mean more requests but less data to download in total and can hopefully keep up better with the conda-forge growth (also inspired by cargo, actually, where they introduced such a sparse protocol).

jonashaag commented 1 year ago

Interesting. For the new protocol couldn't you just use something like SQLite with proper indices?

wolfv commented 1 year ago

Maybe! Do you know how the proper indices would look like? Could be cool to write a prototype e.g. in python. In rattler we can easily feed it deserialized repodata from any source :)

jonashaag commented 1 year ago

Very naive implementation: https://gist.github.com/jonashaag/575e89f6e9493297e350902121518ff0

Takes 0.1 s to "pre-solve" the example 300 packages env (reducing the number of builds to send to libsolv by 90%)

❯ du -h 497deca9.*
272M    497deca9.json
 55M    497deca9.solv
160M    497deca9.sqlite  # Includes all of the info that's in the json file

With zstd compression, we can get all of these files down to < 30 MB at ~ no runtime cost.

Is there an easy way to add proper solving to this prototype?

wolfv commented 1 year ago

Cool! I think that I was thinking about remotely mounting the file somehow (instead of having to download the whole file). But I guess I am realizing that it is not really something sqlite supports. Although I found this project: https://github.com/phiresky/sql.js-httpvfs and one could have a "VFS" implemented that does HTTP range requests.

We have some additional ideas on repodata layout that would make it a single file where one would need to do range requests to download the parts that one need (a bit like zchunk works but with each chunk being a specific package name).

With zchunk, you have first a header that tells you the size of the index-header, then the index-header that tells you the number and size of all chunks including the hashes. Then the job of zchunk is to download all chunks it doesn't have and decompress them (each chunk is individually compressed with zstd with an optional compression dictionary).

jonashaag commented 1 year ago

There is a bunch of SQLite replication implementations.

AntoinePrv commented 1 year ago

On reading repodata.json

We are trying to parse it ourselves instead of libsolv. Nlohmann is giving us a lot a trouble on Windows so we are investigating other libraries.

Pin implementation

Currently the pin is very lightweight on mamba side, simply relying on libsolv constraints. As with all things solver, it doesn't mean doing a different way wouldn't be faster.

On the issue

For an example 300 packages environment with only pinned dependencies ("no pkg3") solving takes around 15 s, although there is not a single decision to be made by the solver.

@jonashaag I am not quite sure the steps in which your environment evolves. We do send everything to libsolv, because if you add a pin, that might change the packages you need. Do you want to validate that the current environment satisfies the pin ?

or would be want to optimize this in libsolv?

Unfortunately right now the repo is not replying to the most basic PRs...

On presolv

Could be very nice to have a presolv, but I am curious why libsolv does not do it...

On new protocol

@wolfv Amazing! I've been thinking that repodata.json is a monster and that repodata patches are just a band-aid that does not solv all problems. I've been wanting to write a CEP for exaclty what you propose (+maybe timestamps). Benefits include parallel downloads, no need for big allocation (or lazy reading), possibility to presolve only a subset pf packages... By all means we shouldn't wait for the CEP to start working on this.

On SQLite

How does SQLite relate to all this?

jonashaag commented 1 year ago

On SQLite: It was just a random idea to use SQLite instead of rolling our own repodata format, be it .json or .solv or something else. I think having a proper database for the repodata would make many things easier to implement locally. Not sure how well suited it is for fetching incremental updates from conda-forge etc.

Do you want to validate that the current environment satisfies the pin ?

Sorry, "pinned" in my original post means a dependency constraint of the form pkg=version=build. So not necessarily something in the pinned file.

So in my scenario we have a fully solved environment with all the packages uniquely specified in the environment. So there is no solving work to be done for libsolv. I don't expect libsolv to detect this and have special support for it, but I wonder why it's so slow. A version of libsolv that is optimized for this scenario could first pick all of the "pinned" dependencies and then look for any dependencies not yet matched/found, realize that all the work is already done, validate the solve result, and finish.

AntoinePrv commented 1 year ago

I don't know well the api/install.cpp code, but there should be a way for that that does not call the solver.

Like here, but not sure where this "explicit_install" get changed. https://github.com/mamba-org/mamba/blob/a527511c14ade1db03b3a88cf9c603eb35469663/libmamba/src/api/install.cpp#L411

jonashaag commented 1 year ago

No, we need to call the solver because we don't actually know the environment is fully solved and consistent.

We just want solving to be faster by reducing the search space to a minimum.

jonashaag commented 1 year ago

I prototyped the idea of precomputing a full transitive dependency matrix whenever we update the repodata. With that matrix the "pre-solve" step is free because it's just a single lookup.

The idea is that you use an adjacency matrix of all packages and their dependencies. With conda-forge's linux-64 + noarch packages that is a $20000 \times 20000$ 99% sparse matrix $M$ where each row and column represents a package name. (Package names have to be categorized.) You can compute the reachability matrix $R := M + M^2 + M^3 + …$. My scipy prototype does this in ~ 1 s. Disk usage of the matrix is ~ 5 MiB uncompressed and ~ 0.5 MiB compressed.

With this matrix $R$, knowing the set of possible transitive dependencies of a package is instant

AntoinePrv commented 1 year ago

I prototyped the idea of precomputing a full transitive dependency matrix whenever we update the repodata. With that matrix the "pre-solve" step is free because it's just a single lookup.

Cool! We also have a graph class, and I think there is value to be able to construct a generic dependency graph with a set of packages. E.g. it is already done in prefix_data.cpp and I know of at least one other place where it could be useful.

jonashaag commented 1 year ago

Another version of this is if you add another dependency to an existing, medium sized env (300 pkgs) it can take up to 60 s to solve, even if the final transaction will be only a single added pkg.

jonashaag commented 1 year ago

Interestingly, pixi takes < 10 s for the same addition.

AntoinePrv commented 1 year ago

Could you share a MRE?

jonashaag commented 1 year ago

Here's a reduced MRE that takes 10 s to solve, I'll see if I can share a larger one:

janxkoci commented 11 months ago

For an example 300 packages environment with only fully specified dependencies ("no pkg3") solving takes around 15 s, although there is not a single decision to be made by the solver.

Wait - the solver doesn't check that the specs are actually compatible (and not a nonsense written by a user)?