pubgrub-rs / pubgrub

PubGrub version solving algorithm implemented in Rust
https://pubgrub-rs.github.io/pubgrub/pubgrub/
Mozilla Public License 2.0
403 stars 36 forks source link

Support for `Range::union_many` or similar #249

Open zanieb opened 2 months ago

zanieb commented 2 months ago

I find myself constructing a range in a loop with union, which we worry is not performant. It'd be nice to have a union_many that takes an iterable of ranges and unions them in a single efficient operation.

Some example usage

https://github.com/astral-sh/uv/blob/0335efd0a901ee2351e663d0488b1ce87fd4cc80/crates/pep508-rs/src/marker/algebra.rs#L638-L645

https://github.com/astral-sh/uv/blob/e1a8beb64b9f6eb573b3af616ffff680c282a316/crates/uv-resolver/src/pubgrub/report.rs#L1098-L1102

I started to take a look at this myself, but realized I'm out of my depth :)

Eh2406 commented 2 months ago

Yes, union is linear in the length of the ranges being merged so doing it in a loop is quadratic. The fundamental union algorithm is:

  1. Create a combined list of all of the segments from all inputs.
  2. Sort this list based on segments start.
  3. Merge consecutive overlapping segments.
  4. Declare what's left the union range.

Some optimizations can be done with the sort because we know the inputs are themselves sorted. Similarly we can duplicate the list as we constructe it for the classic 2 argument case.

A union_many could use a binary heap to find the next segment to process.

mpizenberg commented 2 months ago

Well summarized!

It’s been a long time since I looked at this. I suppose a rather simple adaptation of the current code would just need to adjust the smaller_interval selection among a list of segments iterators (one per range). Instead of just checking the "left" and "right", it would check every iterator head.

https://github.com/pubgrub-rs/pubgrub/blob/68fd200cf6e1acfb4431fc9c04698d76b00173d3/src/range.rs#L437

That would be better than repeatedly computing union, but still not optimal. An optimal one I guess would also remember how the iterator heads are sorted among themselves. Is that what you meant with the binary heap @Eh2406 ?

Eh2406 commented 2 months ago

Yes. smaller_interval is "among all non-exhausted iterators pick the first segment from the one who's got the smallest start point". If we had a IterWithOrdBySmallestStart then we could calculate smaller_interval by popping a BinaryHeap<IterWithOrdBySmallestStart>. If the IterWithOrdBySmallestStart is not exhausted then re-add what's left to the BinaryHeap. If the binary keep is empty then you're done.