perf!: use a priority queue

Eh2406 commented 3 years ago

BREAKING CHANGE: Changes the api of DependencyProvider

Still wip, it needs a lot of documentation, and some thinking about the ergonomics of the api. For example what should be able to return an error? How will this work with https://github.com/pubgrub-rs/advanced_dependency_providers/issues/6?

This is a big perf win on larger benchmarks. Some previous discussion:

For example Generating a ron file from all crates in a snapshot of Crates.io went from 21,648s with v0.2.1 to 3,209s with this branch.

mpizenberg commented 3 years ago

Instead of having one choose_package_version function, giving full choice to the dependency provider everytime a decision needs to be taken, the decision process is split in two.

prioritize which ranks (Package, Range) pairs in a priority list
choose_version which given a (Package, Range) pair simply pick a version

The big advantage of the prioritize function is that picking a given package is much more efficient when there is a lot of choice. For example, without priority queues, when a package introduces 10 new dependencies, we will have to choose one package among 10, then among 9, etc. Basically a complexity in N^2 if N is the number of potential packages. If instead we sort them in a priority queue, the complexity is N*log(N) since we just need to sort the dependencies once according to a computed priority. Then we can simply pick them one after the other.

One drawback I can think of is that the evaluated priority is not re-evaluated when gaining new knowledge after picking new dependencies (is it?). If I understood correctly though, any backtrack clears the prioritized potential packages, so the drawback is not that problematic. I didn't understood fully though the play with indices and decision levels and exactly how you choose to update the packages in the priority list.

So I have one main question after reading this. How much of this idea can be implemented without changing the current API of choose_package_version? Could we build a priority list fully in the dependency provider? If not exactly like this, how similar in terms of compute complexity?

By the way, the changes in extract_solution to return the solution instead of an Option is unrelated to the PR, right? The change basically assumes that the PubGrubError::Failure("How did we end up with no package to choose but no solution?") cannot happen if I understand correctly.

Eh2406 commented 3 years ago

I didn't understood fully though the play with indices and decision levels and exactly how you choose to update the packages in the priority list.

indexmap gives us a HashMap with the added property that all keys have an index and the iteration is in index order. I use this for two things.

When a decision is made, then I move the package to the index of the current_decision_level. This means that the beginning of the iteration will be all the decisions made in order. extract_solution takes advantage of that to get slightly more efficient.
When a new derivation comes in we move the pointer changed_this_decision_level so that it is smaller or equal to the index of the changed package.

The result is that we can efficiently update the prioritized_potential_packages:

If we just backtracked, then we start over and re-prioritize everything that does not already have a decision.
else re-prioritize everything after the changed_this_decision_level, witch includes all the things we modified

Could we build a priority list fully in the dependency provider?

I don't think so. To get the perf benefits choose_package_version needs to do work proportional to the number of changed ranges in packages, not proportional to the len(packages). But it has no way of knowing what has changed.

Eh2406 commented 3 years ago

Theoretically if there are never two things with the same Priority and the heuristic implemented by choose_package_version is the same as the one in prioritize+choose_version then every step of the entire algorithm should be exactly the same. Using my WIP-POC crates runner (based on the crates2 branch) I have started to experiment to see if this works in practice. The first try was to record the wonderful new logging data (thanks #107), but it filled up my SSD. 😢 So next I tried adding a impl Hasher as an argument to resolve and calling hash.write(format!("{}:{}:{}:{:?}", decision.0, decision.1, state.incompatibility_store.len(), state.partial_solution.current_decision_level).as_bytes()), just after the call to log::info!("DP chose: {} @ {:?}", decision.0, decision.1);. I then recorded the result for each of 285204 crates with both versions dev and que. The results, all but 4 crates (solana-stake-o-matic v1.2.18, v1.2.19, and solana-install v1.2.18, v1.2.19) had the same hash! I will look into how and way they are different, but it is clear that the time savings is not because this PR is calculating something different!

mpizenberg commented 3 years ago

That is great! and indeed a little weird that 4 crates had a different output.

Eh2406 commented 3 years ago

One objection to this PR is that there are algorithms that can be implemented by choose_package_version that can not be implemented in this API. Two examples (with not all the details worked out):

let v: Vec<_> = packages.collect();
let mut hasher = FxHasher::new();
v.iter().forEatch(|(p, r)| hasher.hash((p.borrow(), v.borrow()));
let hash = hasher.finish();
let (p, r) = v[hash % (v.len() - 1)];

or

let v: Vec<_> = packages.collect();
let has_a = v..iter().find(|(p, _)| p == package_named_a));
let has_b = v..iter().find(|(p, _)| p == package_named_b));
let has_c = v..iter().find(|(p, _)| p == package_named_c));

if has_a.is_sum() &&  has_b.is_sum() {
   return has_b;
} else if has_b.is_sum() &&  has_c.is_sum() {
   return has_c;
} else if has_a.is_sum() &&  has_c.is_sum() {
   return has_a;
} else {
   return v.last();
}

These examples do prove that there are things limiting about the new API. But they are also chaotic and incoherent; they are not examples of someone doing something reasonable or justified. So are there reasonable examples; examples where someone could explain why they are writing the code? I don't think so. Mostly because I cannot think of any examples. But I would not be surprised if the argument from this video can be adopted to prove that all implementations which do not have a consistent prioritization are internally contradictory.

Eh2406 commented 2 years ago

rebased.

mpizenberg commented 1 year ago

But I would not be surprised if the argument from this video can be adopted to prove that all implementations which do not have a consistent prioritization are internally contradictory.

This is assuming that the priority computed by the "utility function" (as the video calls it) has access to all the parts of the world state it needs until we ask to re-compute it again. Which is not the case, it's a tradeoff of when do we ask package priorities to be recomputed.

mpizenberg commented 1 year ago

I thought this PR needed rebasing but it seems ok actually. What would be nice is having better documentation of the new process detailing under which conditions a package priority is evaluated and re-evaluated, and finding the best suited place in doc comments to put these explanations. @Eh2406 once you've added more doc and feel this is ready for review, let me know and I'll make a more thorough reading and review.

mpizenberg commented 1 year ago

Thanks for the new doc comments! I'll start reviewing and making some naming changes during this week. Is it ok with you @Eh2406 or do you still want to apply some changes here before to avoid conflicts.

Eh2406 commented 1 year ago

I think its in good shape for review. It would be really nice to have a helper like the old choose_package_with_fewest_versions but I am unlikely to find time to figure that out before you have time to review.

Eh2406 commented 11 months ago

I will try to get this updated and cleaned up soon. But I would like to get away from long lived branches. It there are problems with this, or ways it can be improved (and there are) lets merge it as is and open issues to follow up.

Eh2406 commented 11 months ago

While reviewing and rebasing the code today it's clear that one complicated part of this PR is the many clever things I have done with the fact that package_assignments is ordered. I have showed a lot of information into that structure. (Which packages have had decisions, and which packages have been modified since last time they were prioritized.) If that is what is preventing us merging this code, it would be possible to store that data in other forms. I haven't figured out the details, but a HashSet for "needs re-prioritization" would be pretty direct. If that would be more mergeable, please let me know.

charliermarsh commented 11 months ago

I migrated our project over to this branch without much trouble. I think the separation in the API is actually pretty nice, even ignoring the potential performance benefits.

mpizenberg commented 11 months ago

Just a thought that came to mind. The priority list is a nice way of knowing which concurrent requests to prioritize, if we are to enable concurrent dependency provider requests while waiting for the next solver's demand. Also, since the solver would always re-ask the dependency provider for an updated priority when backtracking happens, it also gives a way for the dependency provider to know which queued requests should be cancelled. Right?

Eh2406 commented 11 months ago

The priority list is a nice way of knowing which concurrent requests to prioritize, if we are to enable concurrent dependency provider requests while waiting for the next solver's demand.

The API in this PR so far allows a pretty good approximation of this. If a request has already been cashed it gets a high priority, if the request has not yet returned it gets a low priority. The resolver will work through all of the known results before it asks for an unknown result. Unfortunately, when the resolver runs out of known results it will pick an item at random to block on which may or may not have completed in the meantime. This could be improved by providing a way for a command to say "by the way could we reprioritize x" (maybe something on should cancel?) or a way to say "if the highest priority is below x then reprioritize everything". I will look into whether there is a dependency queuing implementation that handles the async -> conversion for us.

Also, since the solver would always re-ask the dependency provider for an updated priority when backtracking happens, it also gives a way for the dependency provider to know which queued requests should be cancelled. Right?

Theoretically yes. It would be more ergonomic if we provided a reliable way for the dependency provider to know that we backtracked.

Using PubGrub-rs with async code is not intuitive under 0.2 nore dev nore this PR. Let's not let the conversation about how to make the async situation better block merging this PR.

aleksator commented 11 months ago

I took a liberty to rebase and fix conflicts. Feels free to merge if you are satisfied @Eh2406

Eh2406 commented 11 months ago

I'm going to try this new merged queue thing! If/when people find problems, we can open issues and get them addressed!

pubgrub-rs / pubgrub

perf!: use a priority queue #104