Closed adamreichold closed 3 months ago
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 94.65%. Comparing base (
6814180
) to head (88e5a2d
). Report is 108 commits behind head on master.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
I note that we would have 12 methods... k_(smallest|largest)[_relaxed][_by[_key]] which is a lot!
I agree and if pressed to make a choice, I would argue that the relaxed would be a reasonable general choice, i.e. could completely replace the heap-based versions.
For one, k
usually is bounded by small fixed numbers and I do not really see a lot of situations where I have the memory to compute the top 100 but not the enough to invest into the top 200. And since the implementation can directly work with select_nth_unstable_by
, the complications of having a custom heap implementation could be avoided.
However, making this choice does not appear obviously better than having twelve methods though.
Alternatively, the combinatorics could be reigned in by adding a "strategy-like" type parameter, e.g. Strict
and Relaxed
to indicate the desired memory bound.
Finally, I am not sure the k_largest*
methods really carry their weight now that the k_smallest_by
and k_smallest_by_key
methods exist as in applications of these methods that I am aware of, adapting the way items are sorted is required in any case and hence choosing ascending or descending order does not really increase call site complexity.
@phimuemue @jswrenn I don't know if you would agree on the current PR but have 12 similar methods seems a bit much to me. k_smallest
variants are in itertools 0.13.0 which is not released yet, so we could still change some things without breaking and avoid the 12 methods. We could have
iter.top_k(k).smallest() // redirected from the current k_smallest (that we could deprecate)
iter.top_k(k).smallest_by(f)
iter.top_k(k).smallest_by_key(key)
iter.top_k(k).largest()
iter.top_k(k).largest_by(f)
iter.top_k(k).largest_by_key(key)
And once MSRV >= 1.49, this PR would introduce:
iter.top_k(k).relaxed().smallest()
iter.top_k(k).relaxed().smallest_by(f)
iter.top_k(k).relaxed().smallest_by_key(key)
iter.top_k(k).relaxed().largest()
iter.top_k(k).relaxed().largest_by(f)
iter.top_k(k).relaxed().largest_by_key(key)
"Minimal allocation" vs "linear complexity" are both valuable tradeoffs here, but maybe it would be nice to not multiply methods.
EDIT: There is also alternatives @adamreichold described above.
We could also temporarily withdraw k_smallest
variants from the next release and add them back with relaxed versions. It might be better if we want the relaxed versions to be the default ones and heap versions the alternative.
iter.top_k(k) // relaxed
iter.top_k(k).min_alloc() // the currently pushed versions
// both would have the 6 methods (smallest|largest)[_by[_key]]
Looking at the results of the benchmark pushed as part of the second commit, I would summarize them as no difference for the ascending/best case
a small but consistent lead in the random/average case
and a large improvement for the descending/worst case
Hi there, nice algorithm. Unsure about what to do best. And the algorithm begs the question of the scaling factor: Could a capacity of 3k or 4k even be more performant?
As for the 12 methods. I think that - if we decide to keep all of them - we should just offer 12 methods right on Itertools
. I still feel the same for our grouping_map thing: It adds an indirection, and - from what I've seen so far - does not help in composability or reusability (see our trials to extend it to custom hashers). Sure, it occupies lots of space in docs.rs, but I do not see a real problem with that.
Sidenote: What's the unstable/stable behavior of the current implementation?
Sidenote: What's the unstable/stable behavior of the current implementation?
It is unstable, at least because it calls into the unstable selection/sorting routines from std
.
And the algorithm begs the question of the scaling factor: Could a capacity of 3k or 4k even be more performant?
The original issue and I think I agree with the assessment there: Further increasing the capacity is not too interesting as it will only change the constant factor attached to the O(n)
part of the runtime, i.e. I don't think it is worth it complicating the API by giving this choice to the calling code.
Ok about having 12 methods. The scaling factor 2 is fine.
TODO:
A bit unsure about the name: change "relaxed"?
Note that I am very much not invested in that name and made it up on the spot when I had to name the methods somehow to avoid conflicting with the existing ones. I think the only requirement for the name is that they should reflect the space-time trade-off somehow.
@adamreichold The long wait has ended, thanks again for this!
This implements the algorithm described in which consumes twice the amount of memory as the existing
k_smallest
algorithm but achieves linear time in the number of elements in the input.I expect this to fail the MSRV job as
select_nth_unstable
was stabilized in 1.49 and is therefore not available in 1.43.1. I decided to propose it anyway for the time when an MSRV increase is planned or if deemed sufficiently useful as reason for one.