Closed Eh2406 closed 11 months ago
Instead of having one choose_package_version
function, giving full choice to the dependency provider everytime a decision needs to be taken, the decision process is split in two.
prioritize
which ranks (Package, Range) pairs in a priority listchoose_version
which given a (Package, Range) pair simply pick a versionThe big advantage of the prioritize
function is that picking a given package is much more efficient when there is a lot of choice. For example, without priority queues, when a package introduces 10 new dependencies, we will have to choose one package among 10, then among 9, etc. Basically a complexity in N^2 if N is the number of potential packages. If instead we sort them in a priority queue, the complexity is N*log(N) since we just need to sort the dependencies once according to a computed priority. Then we can simply pick them one after the other.
One drawback I can think of is that the evaluated priority is not re-evaluated when gaining new knowledge after picking new dependencies (is it?). If IĀ understood correctly though, any backtrack clears the prioritized potential packages, so the drawback is not that problematic. I didn't understood fully though the play with indices and decision levels and exactly how you choose to update the packages in the priority list.
So I have one main question after reading this. How much of this idea can be implemented without changing the current API of choose_package_version
? Could we build a priority list fully in the dependency provider? If not exactly like this, how similar in terms of compute complexity?
By the way, the changes in extract_solution
to return the solution instead of an Option
is unrelated to the PR, right? The change basically assumes that the PubGrubError::Failure("How did we end up with no package to choose but no solution?")
cannot happen if I understand correctly.
I didn't understood fully though the play with indices and decision levels and exactly how you choose to update the packages in the priority list.
indexmap
gives us a HashMap with the added property that all keys have an index and the iteration is in index order. I use this for two things.
current_decision_level
. This means that the beginning of the iteration will be all the decisions made in order. extract_solution
takes advantage of that to get slightly more efficient.changed_this_decision_level
so that it is smaller or equal to the index of the changed package.The result is that we can efficiently update the prioritized_potential_packages
:
changed_this_decision_level
, witch includes all the things we modifiedCould we build a priority list fully in the dependency provider?
I don't think so. To get the perf benefits choose_package_version
needs to do work proportional to the number of changed ranges in packages
, not proportional to the len(packages)
. But it has no way of knowing what has changed.
Theoretically if there are never two things with the same Priority
and the heuristic implemented by choose_package_version
is the same as the one in prioritize+choose_version
then every step of the entire algorithm should be exactly the same. Using my WIP-POC crates runner (based on the crates2 branch) I have started to experiment to see if this works in practice. The first try was to record the wonderful new logging data (thanks #107), but it filled up my SSD. š¢ So next I tried adding a impl Hasher
as an argument to resolve
and calling hash.write(format!("{}:{}:{}:{:?}", decision.0, decision.1, state.incompatibility_store.len(), state.partial_solution.current_decision_level).as_bytes()),
just after the call to log::info!("DP chose: {} @ {:?}", decision.0, decision.1);
. I then recorded the result for each of 285204 crates with both versions dev
and que
. The results, all but 4 crates (solana-stake-o-matic
v1.2.18
, v1.2.19
, and solana-install
v1.2.18
, v1.2.19
) had the same hash! I will look into how and way they are different, but it is clear that the time savings is not because this PR is calculating something different!
That is great! and indeed a little weird that 4 crates had a different output.
One objection to this PR is that there are algorithms that can be implemented by choose_package_version
that can not be implemented in this API. Two examples (with not all the details worked out):
let v: Vec<_> = packages.collect();
let mut hasher = FxHasher::new();
v.iter().forEatch(|(p, r)| hasher.hash((p.borrow(), v.borrow()));
let hash = hasher.finish();
let (p, r) = v[hash % (v.len() - 1)];
or
let v: Vec<_> = packages.collect();
let has_a = v..iter().find(|(p, _)| p == package_named_a));
let has_b = v..iter().find(|(p, _)| p == package_named_b));
let has_c = v..iter().find(|(p, _)| p == package_named_c));
if has_a.is_sum() && has_b.is_sum() {
return has_b;
} else if has_b.is_sum() && has_c.is_sum() {
return has_c;
} else if has_a.is_sum() && has_c.is_sum() {
return has_a;
} else {
return v.last();
}
These examples do prove that there are things limiting about the new API. But they are also chaotic and incoherent; they are not examples of someone doing something reasonable or justified. So are there reasonable examples; examples where someone could explain why they are writing the code? I don't think so. Mostly because I cannot think of any examples. But I would not be surprised if the argument from this video can be adopted to prove that all implementations which do not have a consistent prioritization are internally contradictory.
rebased.
But I would not be surprised if the argument from this video can be adopted to prove that all implementations which do not have a consistent prioritization are internally contradictory.
This is assuming that the priority computed by the "utility function" (as the video calls it) has access to all the parts of the world state it needs until we ask to re-compute it again. Which is not the case, it's a tradeoff of when do we ask package priorities to be recomputed.
I thought this PRĀ needed rebasing but it seems ok actually. What would be nice is having better documentation of the new process detailing under which conditions a package priority is evaluated and re-evaluated, and finding the best suited place in doc comments to put these explanations. @Eh2406 once you've added more doc and feel this is ready for review, let me know and I'll make a more thorough reading and review.
Thanks for the new doc comments! I'll start reviewing and making some naming changes during this week. Is it ok with you @Eh2406 or do you still want to apply some changes here before to avoid conflicts.
I think its in good shape for review.
It would be really nice to have a helper like the old choose_package_with_fewest_versions
but I am unlikely to find time to figure that out before you have time to review.
I will try to get this updated and cleaned up soon. But I would like to get away from long lived branches. It there are problems with this, or ways it can be improved (and there are) lets merge it as is and open issues to follow up.
While reviewing and rebasing the code today it's clear that one complicated part of this PR is the many clever things I have done with the fact that package_assignments
is ordered. I have showed a lot of information into that structure. (Which packages have had decisions, and which packages have been modified since last time they were prioritized.) If that is what is preventing us merging this code, it would be possible to store that data in other forms. I haven't figured out the details, but a HashSet for "needs re-prioritization" would be pretty direct. If that would be more mergeable, please let me know.
I migrated our project over to this branch without much trouble. I think the separation in the API is actually pretty nice, even ignoring the potential performance benefits.
Just a thought that came to mind. The priority list is a nice way of knowing which concurrent requests to prioritize, if we are to enable concurrent dependency provider requests while waiting for the next solver's demand. Also, since the solver would always re-ask the dependency provider for an updated priority when backtracking happens, it also gives a way for the dependency provider to know which queued requests should be cancelled. Right?
The priority list is a nice way of knowing which concurrent requests to prioritize, if we are to enable concurrent dependency provider requests while waiting for the next solver's demand.
The API in this PR so far allows a pretty good approximation of this. If a request has already been cashed it gets a high priority, if the request has not yet returned it gets a low priority. The resolver will work through all of the known results before it asks for an unknown result. Unfortunately, when the resolver runs out of known results it will pick an item at random to block on which may or may not have completed in the meantime. This could be improved by providing a way for a command to say "by the way could we reprioritize x" (maybe something on should cancel?) or a way to say "if the highest priority is below x then reprioritize everything". I will look into whether there is a dependency queuing implementation that handles the async -> conversion for us.
Also, since the solver would always re-ask the dependency provider for an updated priority when backtracking happens, it also gives a way for the dependency provider to know which queued requests should be cancelled. Right?
Theoretically yes. It would be more ergonomic if we provided a reliable way for the dependency provider to know that we backtracked.
Using PubGrub-rs with async code is not intuitive under 0.2 nore dev nore this PR. Let's not let the conversation about how to make the async situation better block merging this PR.
I took a liberty to rebase and fix conflicts. Feels free to merge if you are satisfied @Eh2406
I'm going to try this new merged queue thing! If/when people find problems, we can open issues and get them addressed!
BREAKING CHANGE: Changes the api of
DependencyProvider
Still wip, it needs a lot of documentation, and some thinking about the ergonomics of the api. For example what should be able to return an error? How will this work with https://github.com/pubgrub-rs/advanced_dependency_providers/issues/6?
This is a big perf win on larger benchmarks. Some previous discussion:
For example Generating a ron file from all crates in a snapshot of Crates.io went from 21,648s with
v0.2.1
to 3,209s with this branch.