rust-itertools / itertools

Extra iterator adaptors, iterator methods, free functions, and macros.
https://docs.rs/itertools/
Apache License 2.0
2.71k stars 309 forks source link

Add k_smallest_relaxed and variants #925

Closed adamreichold closed 3 months ago

adamreichold commented 5 months ago

This implements the algorithm described in which consumes twice the amount of memory as the existing k_smallest algorithm but achieves linear time in the number of elements in the input.

I expect this to fail the MSRV job as select_nth_unstable was stabilized in 1.49 and is therefore not available in 1.43.1. I decided to propose it anyway for the time when an MSRV increase is planned or if deemed sufficiently useful as reason for one.

codecov[bot] commented 5 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 94.65%. Comparing base (6814180) to head (88e5a2d). Report is 108 commits behind head on master.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #925 +/- ## ========================================== + Coverage 94.38% 94.65% +0.26% ========================================== Files 48 49 +1 Lines 6665 7215 +550 ========================================== + Hits 6291 6829 +538 - Misses 374 386 +12 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

adamreichold commented 5 months ago

I note that we would have 12 methods... k_(smallest|largest)[_relaxed][_by[_key]] which is a lot!

I agree and if pressed to make a choice, I would argue that the relaxed would be a reasonable general choice, i.e. could completely replace the heap-based versions.

For one, k usually is bounded by small fixed numbers and I do not really see a lot of situations where I have the memory to compute the top 100 but not the enough to invest into the top 200. And since the implementation can directly work with select_nth_unstable_by, the complications of having a custom heap implementation could be avoided.

However, making this choice does not appear obviously better than having twelve methods though.

Alternatively, the combinatorics could be reigned in by adding a "strategy-like" type parameter, e.g. Strict and Relaxed to indicate the desired memory bound.

Finally, I am not sure the k_largest* methods really carry their weight now that the k_smallest_by and k_smallest_by_key methods exist as in applications of these methods that I am aware of, adapting the way items are sorted is required in any case and hence choosing ascending or descending order does not really increase call site complexity.

Philippe-Cholet commented 5 months ago

@phimuemue @jswrenn I don't know if you would agree on the current PR but have 12 similar methods seems a bit much to me. k_smallest variants are in itertools 0.13.0 which is not released yet, so we could still change some things without breaking and avoid the 12 methods. We could have

iter.top_k(k).smallest() // redirected from the current k_smallest (that we could deprecate)
iter.top_k(k).smallest_by(f)
iter.top_k(k).smallest_by_key(key)
iter.top_k(k).largest()
iter.top_k(k).largest_by(f)
iter.top_k(k).largest_by_key(key)
And once MSRV >= 1.49, this PR would introduce:
iter.top_k(k).relaxed().smallest()
iter.top_k(k).relaxed().smallest_by(f)
iter.top_k(k).relaxed().smallest_by_key(key)
iter.top_k(k).relaxed().largest()
iter.top_k(k).relaxed().largest_by(f)
iter.top_k(k).relaxed().largest_by_key(key)
Implementation draft (click to see) ```rust trait Itertools: Iterator { fn top_k(self, k: usize) -> Top { ... } #[deprecate...] fn k_smallest(self, k: usize) -> ... { self.top_k().smallest() } } pub struct GeneralTop { iter: I, k: usize, marker: PhantomData, } pub type Top = GeneralTop; pub type TopRelaxed = GeneralTop; impl Top { pub fn relaxed(self) -> TopRelaxed { ... } } impl GeneralTop { pub fn smallest(self) -> ... pub fn smallest_by(self, f: F) -> ... pub fn smallest_by_key(self, key: F) -> ... pub fn largest(self) -> ... pub fn largest_by(self, f: F) -> ... pub fn largest_by_key(self, key: F) -> ... } ``` The relaxed version would be added once the MSRV is big enough.

"Minimal allocation" vs "linear complexity" are both valuable tradeoffs here, but maybe it would be nice to not multiply methods.

EDIT: There is also alternatives @adamreichold described above. We could also temporarily withdraw k_smallest variants from the next release and add them back with relaxed versions. It might be better if we want the relaxed versions to be the default ones and heap versions the alternative.

iter.top_k(k) // relaxed
iter.top_k(k).min_alloc() // the currently pushed versions
// both would have the 6 methods  (smallest|largest)[_by[_key]]
adamreichold commented 5 months ago

Looking at the results of the benchmark pushed as part of the second commit, I would summarize them as no difference for the ascending/best case

grafik

a small but consistent lead in the random/average case

grafik

and a large improvement for the descending/worst case

grafik

phimuemue commented 5 months ago

Hi there, nice algorithm. Unsure about what to do best. And the algorithm begs the question of the scaling factor: Could a capacity of 3k or 4k even be more performant?

As for the 12 methods. I think that - if we decide to keep all of them - we should just offer 12 methods right on Itertools. I still feel the same for our grouping_map thing: It adds an indirection, and - from what I've seen so far - does not help in composability or reusability (see our trials to extend it to custom hashers). Sure, it occupies lots of space in docs.rs, but I do not see a real problem with that.

Sidenote: What's the unstable/stable behavior of the current implementation?

adamreichold commented 5 months ago

Sidenote: What's the unstable/stable behavior of the current implementation?

It is unstable, at least because it calls into the unstable selection/sorting routines from std.

adamreichold commented 5 months ago

And the algorithm begs the question of the scaling factor: Could a capacity of 3k or 4k even be more performant?

The original issue and I think I agree with the assessment there: Further increasing the capacity is not too interesting as it will only change the constant factor attached to the O(n) part of the runtime, i.e. I don't think it is worth it complicating the API by giving this choice to the calling code.

Philippe-Cholet commented 5 months ago

Ok about having 12 methods. The scaling factor 2 is fine.

TODO:

  • Close this OR add this as an alternative OR replace our heap-based versions with it. Personally I would choose to keep this as an alternative as both tradeoffs are valuable enough?
  • A bit unsure about the name: change "relaxed"? OR should we make the relaxed versions the default and the heap-based ones the ones with a longer name instead? It might be a better choice perf-wise but do so would be a breaking change IMHO.
  • Review docs.
  • Wait a bit for MSRV>=1.49 to merge this.
adamreichold commented 5 months ago

A bit unsure about the name: change "relaxed"?

Note that I am very much not invested in that name and made it up on the spot when I had to name the methods somehow to avoid conflicting with the existing ones. I think the only requirement for the name is that they should reflect the space-time trade-off somehow.

Philippe-Cholet commented 3 months ago

@adamreichold The long wait has ended, thanks again for this!