haskell / cabal

Official upstream development repository for Cabal and cabal-install
https://haskell.org/cabal
Other
1.62k stars 697 forks source link

`new-build` dependency resolution performance problems with custom builds #6309

Open alex-rozenshteyn-leapyear opened 5 years ago

alex-rozenshteyn-leapyear commented 5 years ago

Context I'm converting a project that uses custom builds and stack to use cabal new-build. I'm running into performance problems.

repro.tar.gz

Describe the bug A clear and concise description of what the bug is.

On a project with 22 packages, dependency resolution is taking 2 minutes. The structure of the project is as follows:

To Reproduce Steps to reproduce the behavior:

$ mkdir test
$ wget https://github.com/haskell/cabal/files/3774090/repro.tar.gz
$ tar xf repro.tar.gz
$ time cabal v2-build --dry-run

Please use version-prefixed commands (e.g. v2-build or v1-build) to avoid ambiguity.

Expected behavior A clear and concise description of what you expected to happen.

It should be faster.

System information

Additional context The actual project I'm trying to migrate has 72 packages.

alex-rozenshteyn-leapyear commented 5 years ago

This results in a long wait time every time I change any .cabal file. My guess as to what's happening is as follows:

I think this leads to the following solution:

Let me know if I misunderstood something.

grayjay commented 4 years ago

I ran the repro, and I think you have identified the cause of the issue. The verbose log showed that cabal spent the whole time solving for the 22 packages without having to backtrack. It had to solve for ex-base and ex-test with different qualifiers for each of dep01 through dep20 (13826 steps!), though it ended up linking all of the instances so that each package only had to be built once.

cabal doesn't currently have support for reusing part of the install plan. I think that could be difficult to implement, because cabal also tries to avoid building multiple versions of the same package for different qualifiers where possible, using the dependency solver's "linking" feature. That means that the dependencies for different executables or setup scripts are not treated as independent. I'm not sure if linking helps in projects like this, though. In addition to preventing partial invalidation of the install plan, solving for all of the dependencies at once increases the size of the state (for example, active constraints), which makes solving slower than O(N) even without conflicts.

I tried to estimate the run time for solving for the packages independently. I compared the time to solve for only ex-base and ex-test with the time to solve for dep01, ex-base, and ex-test in order to estimate the time to solve for one dep* package independently. The difference was ~1 second on my machine. Solving for all of the packages independently should take ~20 seconds. That is still a long time, but it's better than the ~150 seconds that the whole repro project takes now.

Maybe we could add a flag to solve for executables and setup scripts independently and then see how often it causes multiple versions to be built unnecessarily.

I also tried profiling cabal, but nothing stood out as the main cause of slowness: https://gist.github.com/grayjay/25393d9adb0a515189eed623c509c80e

phadej commented 4 years ago

@grayjay, I wonder why not reusing part of install plan is difficult when we specifically try to avoid building multiple versions of the same package?

Is the counter-example is where three (or more) packages depend on the same build-tool-depends: someexe:someexe, but with constraints so the single version if someexe cannot satisfy all: you need at least two different and the choice is not obvious (neither to the solver)?

grayjay commented 4 years ago

I'm not sure I understand. Avoiding building multiple versions of a package means that the solver needs to be aware of all build targets so that it can try to reuse their dependencies. The solver could take the cached part of the install plan as input, but then it would be difficult to ensure that the result doesn't depend on which part of the install plan was reused. That would at least require a significant change to the solver's design.

phadej commented 4 years ago

@grayjay I'm thinking more in direction that solver shouldn't try that hard to reuse the dependencies of qualified goals. It all will be cached, and one can argue that more independent subplans would possibly improve cache hits across different unrelated projects (e.g. which all use hspec-discover build tool).

grayjay commented 4 years ago

I agree, that's what I meant by solving for executables and setup scripts independently. I think that we could solve for an executable by rerunning the solver with only the executable name and the constraint from its reverse dependency as inputs.

Solving for qualified goals independently would require removing the single instance restriction, so I opened #6459.