pubgrub-rs / pubgrub

PubGrub version solving algorithm implemented in Rust
https://pubgrub-rs.github.io/pubgrub/pubgrub/
Mozilla Public License 2.0
365 stars 34 forks source link

Guidance for adding graph location aware exclusions and overrides #131

Closed baron1405 closed 6 months ago

baron1405 commented 1 year ago

I have reached out to the pub folks but thought I'd try you folks as well since you are very, very familiar with PubGrub.

I have ported the PubGrub solver from pub/Dart to Java with Maven and NPM artifacts in order to explore version solving in a personal project. I have the solver working and passing all relevant unit tests including functional tests that go out to Maven Central and the NPM registry. What is missing is support for Maven's exclusions and NPM overrides. I plan on using these on the artifacts as filters before supplying the artifacts to the solver (i.e. incompatiblitiesFor). The problem is that these constructs depend on the an artifacts path in the dependency graph. For example, an exclusion specified in the dependency A applies to any transitive dependency of A no matter how deep in the graph it occurs. So in order to know if an exclusion applies to a transitive dependency, I need to know the location in the dependency graph that is causing that dependency to be considered. After much investigation into the source code, I still cannot figure out how to obtain that graph and path during solving. Obviously the graph and thus the path will change as new assignments are made and when backtracking. The PubGrub implementation in pub supports overrides but they are global and independent of an artifact's location in the dependency graph. Any guidance on how to proceed would be greatly appreciated. I plan to make my repository public once I get this functionality implemented, so any help will likely help others as they consider using PubGrub in their projects.

Thanks in advance for any help!

Eh2406 commented 1 year ago

So in the npm dependency model you can have different versions of the same package at different points in the dependency graph. You can even have the same version of the same package built multiple times at different points in the dependency graph. How did you map that onto PubGrubs one-per-package semantics?

PubGrub is built on incompatibilities. Where an incompatibility must be a universal statement. A statement whose truth value does not depend on the decisions made in resolution. Thus the next resolution to try is the first thing that does not hit any of the known incompatibilities. If incompatibilities/dependencies/constraints depend on the state of resolution then backtracking becomes extremely complicated, you need to determine which incompatibilities are invalidated by the backtracking. You would then need a system to prevent/detect infinite loops where a decision is impossible because of an incompatibility that will be removed when the decision is invalidated. Down this direction lies ~madness~ a fascinating algorithm inspired by but fundamentally different from PubGrub.

An alternative approach is related to my first question about the npm dependency model. Perhaps the "package name" for needs to be a tuple of the (underlying name, exclusions). Where if a package has an exclusion it has a synthetic dependency on the exact messaging package without an exclusion, and each of its actual dependencies with the exclusion. If the package adds a new set of exclusions then the things it depends on have the intersection of exclusions. The synthetic dependency on the unexcluded package ensures that the version numbers match even if it comes through different sources.

Just some thoughts worth exactly what you paid for them. Always happy to continue the conversation.

baron1405 commented 1 year ago

Thanks @Eh2406 for your thoughtful reply!

In the case of my support for NPM, I implemented it solely to balance my Maven support so that I would not inadvertently tie the architecture to one versioning and metadata scheme. I am actually not intending to be a full fledged NPM client so I did not support the multiple versions of the same package. So I just treat the NPM registry as a source of packages and apply PubGrub to version solving as if only one version of a package was allowed.

Since I am barely in SAT solver kindergarten, I am in no position to even consider crafting essentially a new solver algorithm. Thanks for providing the warning about taking that path.

Your alternative approach appears very intriguing. I had not considered essentially synthesizing a dependency based on its published name and other characteristics such as exclusions and then having that depend on the original dependency. To help me understand your approach, could you show me how it would be applied to the following graph, where B and D specify exclusions.

stateDiagram-v2
    state "B (exclude F)" as B
    state "D (exclude H)" as D
    A --> B
    A --> D
    B --> E
    B --> D
    D --> G
    G --> H: H ^5.0.0
    G --> F: F <2.0.0
    E --> H: H ^3.0.0
    E --> F: F ^4.0.0

In this example, package H ^3.0.0 is selected because the exclusion on D prevents H ^5.0.0 from being selected. In other words, H ^3.0.0 is selected through path A-B-E-H. Package F <2.0.0 is selected through path A-D-G-F.

Thanks again for your thoughts!

baron1405 commented 1 year ago

Rereading your comment and the PubGrub documentation for the n'th, is the issue whether exclusions being applied based on the path in the dependency graph violate the fundamental principle of incompatibilities being globally valid? Hence, is your tuple approach is trying to provide that global validity? I guess the fundamental question is, can PubGrub be used when there are constructs such as exclusions, which depend on a given partial solution's path through the dependency graph? Clearly, there are solvers that handle these constructs (e.g. Maven, Coursier). Do you have an insight into how they accommodate these constructs? Do they just construct dependency graphs, backtrack when a conflict happens, rinse and repeat? By having a dependency graph around during resolution, they can determine whether a given exclusion applies.

mpizenberg commented 1 year ago

Your alternative approach appears very intriguing. I had not considered essentially synthesizing a dependency

You might find our guide interesting in this regard, especially the limitations section that explore some of that "virtual dependencies" space https://pubgrub-rs-guide.netlify.app/limitations/intro

And ours "internals" section might be of interest also to complement the official pubgrub docs. https://pubgrub-rs-guide.netlify.app/internals/intro

mpizenberg commented 1 year ago

In the original pubgrub shape, exclusion cannot be expressed when specifying a dependency. But if you want to allow for that freedom, you could enable specifications of dependency by writing incompatibilities directly. Then you have the full possibilities offered by what can be expressed in incompatibilities. We haven’t gone this road yet in the rust impl, so there might be other gotchas.

Eh2406 commented 1 year ago

Rereading your comment and the PubGrub documentation for the n'th, is the issue whether exclusions being applied based on the path in the dependency graph violate the fundamental principle of incompatibilities being globally valid? Hence, is your tuple approach is trying to provide that global validity?

Yes. For example the incompatibility derived from "G depends on F" is not universally true, it's only true when there is no exclusion for F. How to make the statement always true require some engineering. I was proposing a statement of the form "G ( with exceptions for H&F ) have no dependencies" and "G ( with exceptions for H ) depends on F" which are always true. The hard part is making sure the different flavors of G are interconnected when and only when they should be.

(BTW, suggestions/PR's to our documentation are always welcome. If the next reader can achieve understanding in n-2 readings they would appreciate it. )

I guess the fundamental question is, can PubGrub be used when there are constructs such as exclusions, which depend on a given partial solution's path through the dependency graph?

For some use cases this can be done by extending the kinds of inputs you give to PubGrub. (Or, by modifying the kinds of true statements incompatibilities are allowed to hold.) There is no general recommended way to make this modification for all graph dependent constraints. These conversations help us figure out what is known to be possible.

Clearly, there are solvers that handle these constructs (e.g. Maven, Coursier). Do you have an insight into how they accommodate these constructs?

I have not had time to look at how those solvers are implemented. Please report back anything you find, I would love to learn more. A common solution is to reduce the problem to SAT (or SMT) and use an existing Solver. The limitation of this approach is that the error messages then are in terms of the reduced form and can be very hard to explain to a user. (Although comparing to a SAT solver can be a strong tool for testing the other solvers get correct answers.)

Do they just construct dependency graphs, backtrack when a conflict happens, rinse and repeat? By having a dependency graph around during resolution, they can determine whether a given exclusion applies.

What you describe is a straightforward backtracking algorithm. It works well in practice, as most real-world problems are small. You can make it more robust by only looking at constraints that are still relevant. In the SAT problem domain this algorithm family is known as DPLL solvers.

Error messages (or unsatisfied ability proofs) are difficult to add to DPLL solvers, because when they don't find a solution all they have is "Our internal data structures say we no longer have candidates to try. Therefore we have tried everything without finding a solution. Therefore there is no solution". Which is not particularly helpful to the human who was waiting for an answer. DPLL solvers can be made to construct auditable proofs by keeping track of everything they tried and why it didn't work in a separate data structure that can be turned into a proof upon failure. This works, but is very difficult to maintain. It is very easy for "what your algorithm tried" and "what your algorithm recorded itself is doing" to come out of sync. For example skipping a candidate without recording a reason, or recording a reason who's "proof" is wrong.

To avoid this drift you can make sure that your "proofs" are correct and up-to-date by using them as the data structure for keeping track of your internal state. If you do a thorough enough job of this your main loop becomes "Try the first thing that could still work given the proofs we've generated so far. If that does not work generate a new proof that it didn't work and start over." You end up needing a few more techniques for combining sub proofs, so that you can always construct a proof that the thing you tried should not be tried again. This family of algorithms is known in the SAT problem domain as CDCL.

(Of course all of these descriptions are a little oversimplified. It is possible to have a backtracking algorithm that occasionally relies on proofs to skip work, or even has the ability to combine proofs under some circumstances, while still mainly relying on it internal state to figure out when it's tried everything. Similarly, most proof based solvers don't actually start from the very beginning each loop, they figure out what is invalidated by the newest proof and only throw out that portion of the work.)

In the SAT context each variable is either true or false. If you extend a CDCL solver so that model to say each variable can be in one of N states (where N is different depending on the variable). You end up having to reimagine how "proof"s are stored and combined. But you can end up with PubGrub.

Anyway... What was the original question? What point was I making?

Yes, it's possible they're using a backtracking algorithm. or it's possible they're using some hybrid or extension inspired by any of the many solving techniques out there. Please let us know what you have learned when you investigate!

baron1405 commented 1 year ago

Thanks @mpizenberg and @Eh2406 for your replies! I have a lot to consider and study thanks to your thoughtful suggestions. I will report back as I continue on this journey.

baron1405 commented 1 year ago

@Eh2406 As you mentioned, keeping the versions of each of these synthetic dependencies (i.e. G exclude H & F, G exclude F) in sync is a real challenge. From reading through the Optional dependencies page, I see that the optional dependency versions can be elegantly kept in sync with their base package by having the synthetic dependency depend upon the real dependency. That possible because of the requirement that optional dependencies are purely additive. In contract, with exclusions, the dependencies are subtractive so the you cannot set up a dependency relationship to sync the versions. That seems to be the crux of the challenge. Does that sound like a correct characterization to you?

Eh2406 commented 1 year ago

Yes. The solution for optional dependencies assumes they are additive and therefore cannot be directly applied to exclusions. But it's possible they can be adapted. Perhaps we have a package G (with all excluded), which logically has no dependencies. Then a package G ( with exceptions for H ) can add dependencies on everything that is not H with a synthetic dependency on exactly G (with all excluded) for coordination purposes. 🤔

baron1405 commented 1 year ago

That sounds promising! To illustrate the approach, let's say we have the following graph where two dependencies have exclusions:

graph TB
    style B_excl_F stroke-width: 5px
    style D_excl_H stroke-width: 5px
    A --> G
    A --> B_excl_F
    A --> D_excl_H
    D_excl_H --> G
    B_excl_F --> D_excl_H
    B_excl_F --> E
    G --> F
    G --> H
    E --> H

The nodes with thick boarders are the ones with exclusions specified. The notation is <node>_excl_<excluded node>_.... For example, the node D_excl_H_F is the node D which excludes nodes H and 'F`.

Now applying the approach where there is a synthetic all excluded dependency, and also carrying forward the union of predecessor exclusions, this graph becomes:

graph TB
    style D_excl_all stroke-dasharray: 5 5
    style G_excl_all stroke-dasharray: 5 5
    style B_excl_all stroke-dasharray: 5 5
    style E_excl_all stroke-dasharray: 5 5
    style F_excl_all stroke-dasharray: 5 5
    style H_excl_all stroke-dasharray: 5 5
    style B_excl_F stroke-width: 5px
    style D_excl_H stroke-width: 5px
    A ---> G
    A --> B_excl_F
    A --> D_excl_H
    D_excl_H_F --> G_excl_H_F
    D_excl_H --> D_excl_all
    D_excl_H_F --> D_excl_all
    D_excl_H ---> G_excl_H
    B_excl_F --> E_excl_F
    B_excl_F --> D_excl_H_F
    B_excl_F --> B_excl_all
    G_excl_H --> G_excl_all
    G_excl_H --> F_excl_H
    G --> F
    G -----> G_excl_all
    G --> H
    H --> H_excl_all
    G_excl_H_F --> G_excl_all 
    E_excl_F --> H_excl_F
    H_excl_F --> H_excl_all
    E_excl_F --> E_excl_all
    F_excl_H --> F_excl_all
    F -----> F_excl_all

The nodes with dashed boarders are the synthetic nodes that exclude all dependencies. Similar to the optional dependency implementation, the synthetic incompatibility will be generated at the same time as the exclude incompatibility. Exclusions cause quite an increase in the number of incompatibilities but that seems to be typical for translation of conditions into logic predicates for feeding to a SAT solver.

Can anyone on this issue see any inherent gotchas or have tips that might guide me in implementing this approach?

Eh2406 commented 6 months ago

I'm going to close this issue to make it easier to keep track of what work is outstanding on this library. Nonetheless I am still watching the issue, and eagerly look forward to any updates on your research or experiments.