haskell / cabal

Official upstream development repository for Cabal and cabal-install
https://haskell.org/cabal
Other
1.61k stars 690 forks source link

Make the cabal-install solver multilibs-aware #6039

Open fgaz opened 5 years ago

fgaz commented 5 years ago

The Cabal part is done, but cabal-install's solver is not aware of the multiple public libraries feature yet, leading to late errors and issues like #6038

23Skidoo commented 5 years ago

/cc @grayjay @kosmikus -- How hard do you think this would be to implement?

grayjay commented 5 years ago

I think this would be pretty straightforward, at least for source packages, because it could be implemented similarly to enforcing build tool dependencies. The visibility field could probably be implemented similarly to the current handling of unbuildable components. I have a few questions about how visibility should behave, though.

There may be an issue with enforcing visibility for installed packages, because there is a hack where the solver treats installed internal libraries as separate packages with munged names (https://github.com/haskell/cabal/blob/5f57d4d0ee14592cc488385151c15edef59f9bd6/cabal-install/Distribution/Solver/Modular/IndexConversion.hs#L106-L131). I'm not sure how difficult it will be to distinguish between inter- and intra-package dependencies.

fgaz commented 5 years ago

Should the solver enforce visibility now? I wasn't sure if the comments about disabling the visibility field in #5848 applied here.

Yes. In #5848 we're talking about disabling it at the syntax level, but internally it's still used

Should the solver just check the visibility fields in Library and InstalledPackageInfo? Can the main library be private?

Yes and no, but I'll have to check the second one.

grayjay commented 5 years ago

@fgaz Thanks. I made a PR (#6047), but it only checks the visibility of source packages so far. It reads the libVisibility field for all libraries, though I can easily change it to only check sub-libraries if that is the correct behavior.

grayjay commented 4 years ago

6836 enforces visibility for source packages, so the only remaining task is to handle installed packages.

gbaz commented 3 years ago

@grayjay it looks like you did most of the work to resolve this, can you elaborate on what's still missing?

grayjay commented 3 years ago

@gbaz The only thing that is missing in the dependency solver is enforcing visibility restrictions for installed packages, so that one package cannot depend on a private installed sub-library from another package. The change could be difficult due to the hack mentioned in https://github.com/haskell/cabal/issues/6039#issuecomment-491195495.

I also noticed that #7270 may relate to the solver, though I haven't looked into it yet.

grayjay commented 2 years ago

Issue #7270 shows that there is actually more to do here than enforcing visibility. The hack described above means that the solver handles sublibraries differently depending on whether they are source or installed. From reading the code, it looks like dependencies between packages of the same type (source or installed) are handled correctly, but dependencies from source to installed packages are handled incorrectly, due to the mismatch. Here is my current understanding of the code:

Source package -> Source package The solver represents each package as a single PInfo value. The PInfo lists the available components. Information about the libraries, such as their dependencies, is merged together, and intra-package dependencies can be safely ignored.

Installed package -> Installed package The solver represents each library in the package as a separate PInfo. Intra-package and inter-package dependencies are both represented similarly, as dependencies between PInfos. This behavior seems correct, but it is misleading, because PInfo is short for package info and was initially designed to represent a whole package. There is currently no distinction between main libraries and sublibraries or intra-package dependencies and inter-package dependencies.

Source package -> Installed package When a source package depends on an installed library, the dependency is treated as a dependency on the main library, even if the installed library is actually a sublibrary. I think that this is the cause of the solver issue in #7270.

I think that the best way to fix the issue would be to treat installed packages similarly to source packages, i.e., combine all libraries from a single installed package into a single PInfo. This design is compatible with the component based solving described in #4087. It would also require very few changes in the rest of the solver, because the solver could just treat installed sublibraries similarly to source sublibraries when checking existence and visibility. However, I'm not sure whether the design would work with the current structure of the installed package index. I don't know whether we have the information required to group installed libraries by package instance.

Mikolaj commented 2 years ago

Let's try to help @grayjay with this. I can make tea. :) @gbaz: could you answer the question about whether we have the information (in the package index presumably?) to group installed libraries by package instance? @fgaz: you are the multilib boss, any remarks about this plan?

grayjay commented 2 years ago

Thanks @Mikolaj!

Here are some more details about my question. Cabal passes the installed package index to the dependency solver using the type InstalledPackageIndex, as used in this function that converts all packages to type PInfo:

https://github.com/haskell/cabal/blob/1f84e56396009d21e91e1cfb95d3c540c35170cc/cabal-install-solver/src/Distribution/Solver/Modular/IndexConversion.hs#L56-L59

InstalledPackageIndex:

https://github.com/haskell/cabal/blob/1f84e56396009d21e91e1cfb95d3c540c35170cc/Cabal/src/Distribution/Simple/PackageIndex.hs#L154-L156

PackageIndex:

https://github.com/haskell/cabal/blob/1f84e56396009d21e91e1cfb95d3c540c35170cc/Cabal/src/Distribution/Simple/PackageIndex.hs#L124-L149

InstalledPackageInfo:

https://github.com/haskell/cabal/blob/1f84e56396009d21e91e1cfb95d3c540c35170cc/Cabal-syntax/src/Distribution/Types/InstalledPackageInfo.hs#L35-L39

I'd like to group libraries of the same package name, version, and instance. The ability to group the libraries is most important for choosing the dependencies of a source package that directly or indirectly depends on more than one library from the same installed package. The solver needs to ensure that all libraries chosen from that installed package are consistent (built with the same flags, dependencies, etc.).

grayjay commented 2 years ago

We discussed the issue with grouping installed libraries by package instance at the Cabal meeting. It sounds like the information that we would need is probably not available within cabal, so we would need to add more information to the installed package database. This information would also be visible to GHC and other tools. One idea is to add an "InstanceUnitId" to every installed library. The InstanceUnitId would be equivalent to the UnitId of the package's main library, regardless of whether the package actually has a main library. Then all libraries that were installed together would have the same InstanceUnitId.

@fgaz I wanted to check with you before proceeding, since I want to make sure that this idea is consistent with the multilibs design and that there isn't already a separate mechanism for determining which libraries were installed together.

Mikolaj commented 2 years ago

@bgamari: could you confirm the "InstanceUnitId" feature is in GHC? is 9.4? @fgaz seems to have a good understanding of the exact GHC feature we need (and, let's hope, that is already in HEAD).

[Edit: @gbaz, @fgaz, feel free to correct my question or prove it's not needed.]

fgaz commented 2 years ago

Continuing from https://github.com/haskell/cabal/issues/7270#issuecomment-1092630544:

As opposed to Cabal, cabal-install has to deal with package instances, and that's definitely true for source packages. But is that true for preexisting installed packages too? For example, leaving multiple libraries aside, is having two preinstalled instances of random-1.2.1 in the global package db supported? If it isn't, can we --like in Cabal-- just assume that all libraries with the same (package-name, version) belong to the same instance?

fgaz commented 2 years ago

@grayjay

Relevant comment in Cabal:

https://github.com/haskell/cabal/blob/a5309888fc0a33e511504c105907e8603a572143/Cabal/src/Distribution/Simple/Configure.hs#L1312

Cabal-the-library just picks an arbitrary instance of the package if there are multiple with the same name+version and --exact-configuration/--dependency is not used

Mikolaj commented 1 year ago

Hi! What's the status of this one? Is anybody actively working on it or planning to?

grayjay commented 1 year ago

@Mikolaj The next step is to determine how cabal-install handles multiple installed instances of a package (especially packages with multiple libraries) to answer @fgaz's questions in https://github.com/haskell/cabal/issues/6039#issuecomment-1092642955. If cabal-install does need to handle multiple instances, then we need to add an "InstanceUnitId" field to packages in the installed package database. Then the solver needs to be updated to group installed libraries by package name, version, and possibly instance and enforce dependencies on them similarly to source libraries. I'm not sure when I'll have more time to work on this.

Mikolaj commented 1 year ago

And is " how cabal-install handles multiple installed instances" some ancient knowledge of the existing codebase that has been lost, or is it something for us to decide and design afresh, or something in-between --- there were plans, but they have not been implemented or not fully enough to assure they make sense?

grayjay commented 1 year ago

@Mikolaj I was thinking that we could just determine how cabal-install currently handles multiple installed instances of a package by testing. I would like to test two things:

Mikolaj commented 1 year ago

I wonder if a volunteer could easily help with the investigation. How would one inspect the "solver's view of the installed package index"? Run with -v3 or look at dist-newstyle/cache/plan.json or is there some special debug method (e.g., decoding any of the binary dist-newstyle/cache/*-plan files)?

grayjay commented 1 year ago

The most direct way to see how the solver handles multiple installed instances is to look up the package by name in the Index returned by convPIs and print all returned instances. I think that -v3 would also work, but the test case would need to cause the solver to reject all versions of the installed package (e.g., with an unsatisfiable version constraint) in order to make the solver print them all.

I noticed that cabal has a --shadow-installed-packages flag, so the behavior may actually be configurable.

If the index returned by convPIs contains both instances of the package in the test, then I think it would also be useful to see whether it is possible to depend on either one of them.

andreabedini commented 1 year ago

I have done some preparatory work and I am planning to work on this issue this week. @grayjay @Mikolaj Let me know if there's been any progress not reported here.

andreabedini commented 1 year ago

Regarding package shadowing:

cabal-install/src/Distribution/Client/ProjectConfig.hs 258: --solverSettingShadowPkgs = fromFlag projectConfigShadowPkgs

cabal-install/src/Distribution/Client/ProjectPlanning.hs 1086: -- . setShadowPkgs solverSettingShadowPkgs

Mikolaj commented 1 year ago

I don't know of any unreported progress. There is certainly reported interest in related things on #hackage, in https://discourse.haskell.org/t/new-hackage-server-features/2621/42, in https://github.com/haskell/hackage-server/issues/1119.

grayjay commented 1 year ago

@andreabedini Thanks for volunteering to work on this issue! I don't know of any additional progress, but I can give my current thoughts about test cases and the desired behavior.

I mentioned two test cases in https://github.com/haskell/cabal/issues/6039#issuecomment-1318399645. For the first case, I'm assuming that regardless of the value of the flag --shadow-installed-packages, the solver won't choose an install plan that uses two instances of a non-multilibs package as dependencies of the same component. (I don't think we need to test this, due to the single instance restriction mentioned in #6459.) In my opinion, this means that the multilibs feature should enforce similar consistency and should prevent two instances of a multilibs package from being used as dependencies of the same component.

For the second test case in https://github.com/haskell/cabal/issues/6039#issuecomment-1318399645, I wanted to give a more concrete example:

  1. Install package A-1.0 in the global package database. Package A-1.0 only contains one component, sublibrary B.
  2. Modify the source code of package A-1.0 so that it instead has sublibrary C as its only component. Install it in the global package database, too.
  3. Try to build package D-1.0 using "cabal build". Package D-1.0 has build-depends dependencies on both sublibrary B and sublibrary C. The build command should fail, since there is no instance of A-1.0 that contains both components.

Alternatively, the test case could avoid modifying source code by giving package A-1.0 a build flag that makes one sublibrary unbuildable when it is true and the other unbuildable when it is false and then installing both configurations.

andreabedini commented 1 year ago

Here some notes from today. Some of this might be well-know but it wasn't for me :)

First question:

Install two instances of a non-multilibs package and see how they appear in the solver's view of the installed package index. If cabal already doesn't handle two installed instances of a non-multilibs package, then there is less reason to fully support multiple instances of multilibs packages.

❯ cat cabal.project 
active-repositories: :none
package-dbs: clear, global, /home/andrea/.local/state/cabal/store/ghc-9.2.7/package.db
extra-packages: zlib
constraints: zlib <0
❯ cabal build -v3 --dry-run zlib | grep '^\['
...
[__0] rejecting:
zlib-0.6.3.0/installed-6e56dc43f3ec2be12bd5546861b15931b096e743f56e443fc2faae3bceb1327a,
zlib-0.6.3.0/installed-727383b633c94a5c0b336dffa9dc4d80869f0f864348ed507f9050a29cacf41b,
zlib-0.6.3.0/installed-e3964c6e976321c89432925b0aa9fc494f0907ce8b7202161f602167c20533c2
(constraint from project config /home/andrea/tmp/two-cases/prj/cabal.project
requires <0)
...
❯ ghc-pkg --package-db /home/andrea/.local/state/cabal/store/ghc-9.2.7/package.db field zlib-0.6.3.0 id 
id: zlib-0.6.3.0-6e56dc43f3ec2be12bd5546861b15931b096e743f56e443fc2faae3bceb1327a
id: zlib-0.6.3.0-e3964c6e976321c89432925b0aa9fc494f0907ce8b7202161f602167c20533c2
id: zlib-0.6.3.0-727383b633c94a5c0b336dffa9dc4d80869f0f864348ed507f9050a29cacf41b

Multiple instances do not seem to be a problem.

Interlude:

The solver represents each library in the package as a separate PInfo. Intra-package and inter-package dependencies are both represented similarly, as dependencies between PInfos. This behavior seems correct, but it is misleading, because PInfo is short for package info and was initially designed to represent a whole package. There is currently no distinction between main libraries and sublibraries or intra-package dependencies and inter-package dependencies.

Indeed sublibraries are installed in the packagedb with one entry per component, with a mangled name. E.g.

❯ ghc-pkg --package-db /home/andrea/.local/state/cabal/store/ghc-9.2.7/package.db list | grep plutus-core
    plutus-core-1.1.1.0
    (z-plutus-core-z-index-envs-1.1.1.0)

where index-envs is a private sub-library of plutus-core package and the main library depends on the sublibrary (via its unit-id). The two package-db entries look like this:

❯ ghc-pkg --package-db /home/andrea/.local/state/cabal/store/ghc-9.2.7/package.db field plutus-core-1.1.1.0 name,version,id
name: plutus-core
version: 1.1.1.0
id: plutus-core-1.1.1.0-d85fe72bf2ff0728e98538cf83f2240595161f111577c50ccb9d9e9c4f3eff21
❯ ghc-pkg --package-db /home/andrea/.local/state/cabal/store/ghc-9.2.7/package.db field z-plutus-core-z-index-envs-1.1.1.0 name,version,package-name,lib-name,id
name: z-plutus-core-z-index-envs
version: 1.1.1.0
package-name: plutus-core
lib-name: index-envs
id: plutus-core-1.1.1.0-l-index-envs-6b7f40b957785c8930efb43ea13c63e65f84ebf44c61618f8ce55506cdd8acbb

(Note that the sub-library has package-name and lib-name while the main library does not.)

From the solver point of view, all entries in ghc-pkg are separate instances. The index (result of convPIs) indeed shows two entries:

plutus-core-1.1.1.0 Inst (UnitId "plutus-core-1.1.1.0-d85fe72bf2ff0728e98538cf83f2240595161f111577c50ccb9d9e9c4f3eff21")
z-plutus-core-z-index-envs-1.1.1.0 Inst (UnitId "plutus-core-1.1.1.0-l-index-envs-6b7f40b957785c8930efb43ea13c63e65f84ebf44c61618f8ce55506cdd8acbb")

Second question:

For the second test case in https://github.com/haskell/cabal/issues/6039#issuecomment-1318399645, I wanted to give a more concrete example:

Despite the clear instructions this test turned out a bit trickier for me. I made a package with two sub-libraries and installed them in the user packagedb (I had to use cabal act-as-setup -- install --user but that's ok). There's a bit of a problem because Setup.hs install calls ghc-pkg update which overrites entries with the same name, so I could not easily install more instances.

That said. Since each instance contains only one component, it looks meaningless to say "instance of A-1.0 that contains both components".

Plan for tomorrow:

Looking at the solver index it looks like it's not demangling the component names. IIRC Distribution.Client.IndexUtils.getInstalledPackages does the demangling correctly. I need to double check.

The original problem I am running into is that the solver does not seem to be able to re-use public sublibraries in the packagedb, and ends up recompiling anything that depends on them. This is the behaviour I described in https://github.com/input-output-hk/haskell.nix/issues/1662#issuecomment-1529558668. Tomorrow I'll investigate how the solver sees the situation. IIRC the solver was complaining it could not use the installed package because some component was missing. Other thing to double check :)

grayjay commented 1 year ago

Thanks for investigating. I didn't realize that sublibrary names were still mangled in the package db.

Despite the clear instructions this test turned out a bit trickier for me. I made a package with two sub-libraries and installed them in the user packagedb (I had to use cabal act-as-setup -- install --user but that's ok). There's a bit of a problem because Setup.hs install calls ghc-pkg update which overrites entries with the same name, so I could not easily install more instances.

That said. Since each instance contains only one component, it looks meaningless to say "instance of A-1.0 that contains both components".

I'm not sure I understand this part, so I wanted to clarify that the definition of "instance" that I was using is a single installation of a given package and version. Instances can differ by having different build flags, versions of dependencies, etc. If the package has more than one library, then the instance will be split across multiple entries in the package db. A major part of the remaining work for this issue is allowing the solver to group installed libraries into instances.

The reason that package D depends on both sublibrary B and sublibrary C in the test case is that I wanted to ensure that cabal doesn't incorrectly group the two installed sublibraries from Package A-1.0 together as one instance when they are actually from two different instances.

The original problem I am running into is that the solver does not seem to be able to re-use public sublibraries in the packagedb, and ends up recompiling anything that depends on them. This is the behaviour I described in https://github.com/input-output-hk/haskell.nix/issues/1662#issuecomment-1529558668. Tomorrow I'll investigate how the solver sees the situation. IIRC the solver was complaining it could not use the installed package because some component was missing. Other thing to double check :)

If I understand correctly, this is the problem I described in https://github.com/haskell/cabal/issues/6039#issuecomment-1046565510. The solver doesn't currently know how to satisfy dependencies from source packages to installed sublibraries. I think that the best way to solve this issue is to represent each installed package as a single PInfo, similarly to source packages. Then the solver could use the PInfo's map of components to track the availability of specific sublibraries.

andreabedini commented 1 year ago

I'm not sure I understand this part, so I wanted to clarify that the definition of "instance" that I was using is a single installation of a given package and version. Instances can differ by having different build flags, versions of dependencies, etc. If the package has more than one library, then the instance will be split across multiple entries in the package db. A major part of the remaining work for this issue is allowing the solver to group installed libraries into instances.

Thank you for clarifying. I was indeed a bit confused about the meaning of "instance" in this discussion but I ended up adopting the one used in the solver (which identifies instances with packagedb entries). Now I understand you are using the term in a more general way and the difference is exactly what we need to teach to the solver.

This note is also relevant. https://github.com/haskell/cabal/blob/367490076f395c802cf459502056de5745fefc7e/cabal-install-solver/src/Distribution/Solver/Modular/IndexConversion.hs#L111-L143

andreabedini commented 1 year ago

The reason that package D depends on both sublibrary B and sublibrary C in the test case is that I wanted to ensure that cabal doesn't incorrectly group the two installed sublibraries from Package A-1.0 together as one instance when they are actually from two different instances.

Thinking about this, are we sure this is a problem? two installed instances of Package A-1.0 will have all their dependencies fixed. Even if they end up being split into multiple libraries in the packagedb, their dependencies will either match or they won't. This only regards dependencies, I am not sure what other degrees of freedom we have and what coherence properties we expect.

grayjay commented 1 year ago

Thank you for clarifying. I was indeed a bit confused about the meaning of "instance" in this discussion but I ended up adopting the one used in the solver (which identifies instances with packagedb entries). Now I understand you are using the term in a more general way and the difference is exactly what we need to teach to the solver.

This note is also relevant.

The solver was originally designed without sublibraries, so it expected there to be one library per installed package. Then that hack was added for internal (private) libraries. It worked because it isn't possible for a source package to depend on an installed library that is private. Now I think we should change the solver's installed package index to act as a map of packages, each with one or more libraries.

Thinking about this, are we sure this is a problem? two installed instances of Package A-1.0 will have all their dependencies fixed. Even if they end up being split into multiple libraries in the packagedb, their dependencies will either match or they won't. This only regards dependencies, I am not sure what other degrees of freedom we have and what coherence properties we expect.

I don't think that we should mix libraries from different installations of a package, since they could differ in more ways than dependencies, such as cabal build flags, compiler flags, or even source code. In my opinion, it would be similar to mixing libraries from different versions of the package.

I also think that recording the instance of an installed library (with the InstanceUnitId idea from above) would simplify forcing the dependencies to match.

angerman commented 9 months ago

A large question around this seems to be: "what does compatible" mean? From a purely linking perspective, we shouldn't be mixing GHC ways. You can't have a profiling library work with a vanilla one.

From a dependency perspective, any "consumer" of a package find that package compatible (with any other instance of that package) as long as the exposed API (symbols, and relevant signatures) match; anything else is a blackbox to the consumer.

This ignore the semantic part where the symbols and signatures could stay the same, but the behaviour/meaning is different. That's what we usually have the PVP for?

grayjay commented 9 months ago

@angerman Is this comment for a different issue? This issue doesn't relate to GHC ways, except that they may both relate to InstalledPackageInfo.

angerman commented 9 months ago

@grayjay possible. @andreabedini had been flooding me with issues 🙈

andreabedini commented 9 months ago

This is the right issue, but maybe things got a bit confused. Let me try to remember and summarise the discussion. The solver being aware of multiple libraries means that the it will have to decide what to do in the situation where

  1. pkg-a-1.2.3.4:lib1 is preinstalled
  2. pkg-a-1.2.3.4:lib2 depends on pkg-a:lib1 and has to be compiled

Under what conditions the preinstalled package can be re-used?

The current answer is never, thanks to the name mangling introduced with private sublibs, the solver does not even see the preinstalled pkg-a-1.2.3.4:lib1 so it always recompiles it.

The precise information about how pkg-a-1.2.3.4:lib1 was build has been lost but we could hash this information into the unit-id.

The solver already knows that pre-installed packages have their dependencies fixed. So the preinstalled pkg-a-1.2.3.4:lib1 will never be used if its dependencies are incompatible with pkg-a-1.2.3.4:lib2.

Flags is something we could pack into the unit-id (and maybe we do already, Cabal and cabal-install use two different schemas and I always forget what goes into Cabal's one).

@grayjay suggests other things can go wrong:

I don't think that we should mix libraries from different installations of a package, since they could differ in more ways than dependencies, such as cabal build flags, compiler flags, or even source code. In my opinion, it would be similar to mixing libraries from different versions of the package.

Am I wrong understanding that in the above quote you imply that current behaviour is the only safe one? That is, we should never re-use a preinstalled sibiling library? I belive this would be too conservative.

grayjay commented 9 months ago

@angerman Now I understand. GHC ways is just an example of how installed packages can vary by installation. I think that we need to capture everything that you mentioned in the field that we use to identify an instance.

@andreabedini I actually hadn't considered the case that you described, where a package is only partially installed (only some of its sublibraries are installed). Do you think that that is likely to be a common case? I think that cabal should be able to handle it, but it seems like an edge case to me, like handling a broken installation.

I also realized that I don't know exactly what information goes into the unit ID. The InstanceUnitId design above relies on the unit ID of the main library capturing everything about how the whole package was built, including information about the other components. If the unit ID of the main library doesn't change when other components change, then maybe the unit ID isn't the best way to identify an instance.

The InstanceUnitId design could allow cabal to use two components of a package that were installed at different times, if the components were built in a compatible way (having the same unit ID for the main library).

The safest design would be for cabal to create a random unique ID for the instance whenever it installs a package, but I don't know if that is practical.

andreabedini commented 9 months ago

@andreabedini I actually hadn't considered the case that you described, where a package is only partially installed (only some of its sublibraries are installed). Do you think that that is likely to be a common case? I think that cabal should be able to handle it, but it seems like an edge case to me, like handling a broken installation.

From my POV, this whole issue was always about this but now I see the distinction you make. Following up my own comment:

Am I wrong understanding that in the above quote you imply that current behaviour is the only safe one? That is, we should never re-use a preinstalled sibiling library? I belive this would be too conservative.

Yes, I was wrong indeed. If we make public sublibraries visible to the solver and use a random (but common) unit id like you say, ~the solver will be able to pick a pre-installed sublibrary (while now it cannot) with absolutely zero risks~.

Edit: no, the solver does not look at any id when it comes to pre-installed packages. It only matches on package name and version. Perhaps the first step is to make the solver aware of the configuration of pre-installed packages.