rust-itertools / itertools

Extra iterator adaptors, iterator methods, free functions, and macros.
https://docs.rs/itertools/
Apache License 2.0
2.72k stars 309 forks source link

Adding _by, by_key, largest variants of k_smallest #654

Closed ejmount closed 7 months ago

ejmount commented 2 years ago

This PR resolves #586 by adding the following public functions

k_smallest_by
k_smallest_by_key
k_largest
k_largest_by
k_largest_by_key

I used a custom binary heap to implement them (and rewrite k_smallest) so that the logic could be agnostic as to where the memory came from. This is because I was originally motivated to make this PR by implementing k_smallest for const sizes so as to not require an allocator. Using a custom heap implementation makes that extension straightforward, but I've not included it just now to make this first PR more manageable. Unfortunately, this requires an MSRV of 1.36, although the clippy config seems to indicate that's already the case.

(The heap type is not publicly exposed to crate users, it is only implementation detail.)

The quickcheck test for k_smallest is completely untouched and still passes, so I am confident the heap logic is correct. I've also added doctests to some but not all of the new functions, which also pass, Let me know if you feel more documentation (in- or externally facing) is useful.

phimuemue commented 1 year ago

Hi there! Thanks for tackling this - I'm happy you went for it.

I hope I don't miss the forest for the trees, but I am pondering whether unsafe is really needed here. It seems that it is used here because MaxHeap manually tracks memory via storage: &mut [MaybeUninit<T>] and len: usize. But wouldn't a Vec be perfectly fine for this? (Rust's own BinaryHeap uses Vec, too.)

Moreover I think that we could benefit from inlining some functions:

ejmount commented 1 year ago

I hope I don't miss the forest for the trees, but I am pondering whether unsafe is really needed here. It seems that it is used here because MaxHeap manually tracks memory via storage: &mut [MaybeUninit<T>] and len: usize. But wouldn't a Vec be perfectly fine for this? (Rust's own BinaryHeap uses Vec, too.)

It would do for the current implementation, but in the future I wanted to be able to do something like,

fn k_smallest_static<'a, T, I, F, const K: usize>(iter: I, order: F) -> impl Iterator<Item = T>
where
    T: 'a,
    I: Iterator<Item = T>,
    F: Fn(&T, &T) -> std::cmp::Ordering,
{
    let mut storage: [MaybeUninit<T>; K] = /* ... */;
    let num_elements = capped_heapsort(iter, &mut storage, order);
    storage
        .into_iter()
        .take(num_elements)
        .map(|e| unsafe { e.assume_init() })
}

and have it be usable without any dependency on alloc.

Moreover I think that we could benefit from inlining some functions:

I generally try to write code that's robust against the caller making mistakes, although I'd agree with you that that's created overhead here, so I've changed the flow of creating and consuming the heap with the main differences being

phimuemue commented 1 year ago

Hi, thanks for the feedback.

It would do for the current implementation, but in the future I wanted to be able to do something like,

fn k_smallest_static<'a, T, I, F, const K: usize>(iter: I, order: F) -> impl Iterator<Item = T>

Noble motives, I like that (pretty much, actually). However, I'd still argue to simply go with Vec for now, because if we want to support static-variants of our algorithms, we probably also want other algorithms to work with Vec or (essentially) ArrayVec (e.g. min_set and its cousins). And in that case, it's probably much easier to go through the codebase, look for Vec, and replace those by a trait VecOrArrayVec instead of tying the static/dynamic decision to one particular algorithm (in this case: k_smallest). That is, i'd like to move the memory management issues out of the algorithms into the containers used by the algorithms, and make the algorithms accept different containers. @jswrenn Do you have thoughts about this?

Then, I'd imagine the algorithm unsafe-free and something like:

On top, I'd suggest more inlining:

ejmount commented 1 year ago

Noble motives, I like that (pretty much, actually). However, I'd still argue to simply go with Vec for now, because if we want to support static-variants of our algorithms, we probably also want other algorithms to work with Vec or (essentially) ArrayVec (e.g. min_set and its cousins). And in that case, it's probably much easier to go through the codebase, look for Vec, and replace those by a trait VecOrArrayVec instead of tying the static/dynamic decision to one particular algorithm (in this case: k_smallest). That is, i'd like to move the memory management issues out of the algorithms into the containers used by the algorithms, and make the algorithms accept different containers.

I think this is a much better strategy than what I had in mind, so I've restructured the heap logic to use a plain Vec for now with an eye to replacing it with a trait in the future. That also neatly eliminated all of the unsafe calls.

if iterator already exhausted, sort container using a sorting function (possibly avoiding heapify if it helps speed)

While I don't have benchmarks, I can't imagine many circumstances where heapify would be slower, since it works in O(k) time rather than the O(k*ln k) required for a sort.

  • capped_heapsort: It internally checks for storage.is_empty, which could be done in the caller (avoiding to call it entirely) by testing k==0. Also, for someone new, the meaning of the range returned from capped_heapsort is probably still non-obvious, (in addition of its begin (that is always 0) being ignored at the single call site).

Most of the reason for that check being placed in capped_heapsort rather than earlier was because I was expecting that function to have a static sized caller as well, which is no longer needed. In the new version, capped_heapsort has disappeared entirely into the caller, and being able to rely on truncate means there's nothing to return even if it did still exist.

  • from_iter: As it stands now, it seems to rely e.g. on heap.len being 0 - something that's obvious if it was inlined.

I'm afraid I don't follow - from_iter knows that the heap begins empty because it is created there in the first place. This is hopefully clearer in the new version that uses collect as you were suggesting earlier.

  • push_pop: Calls e.g. self.get(0), which lifts values into Option. However, push_pop is only called for non-empty storage, which would be more obvious if it was inlined.
  • To add to the previous point: Having a compare going through Options might be a good idea, but I was thinking if we could get rid of the detour through Option entirely (without too much hassle).

compare goes through Option primarily for the benefit of sift_down, where the difference between elements not existing and already being in the right order is immaterial. It does have the benefit that push_pop works on empty storage, but as you say, that never happens in practice, so I'm not sure if it's worth bringing attention to.

phimuemue commented 1 year ago

Probably this should also include tests (maybe via quickcheck, comparing k_smallest... against collecting into a vector, sorting, and then picking the first/last k elements.

ejmount commented 1 year ago

I've been making updates to this as time allows, and I've just pushed the following changes

I've not added any new tests, but will see about doing so when I get more time.

Re; keyed sorts, I'm afraid I don't quite understand the part of your comment about the Vec<T> being fine. Just to make sure we're all on the same page, my understanding is that there are two ways to do this with a single allocation:

  1. time/eval efficiently:
    fn k_smallest_by_key<T, I, F, K>(iter: I, k: usize, key: F) -> impl Iterator<Item=T>
    where
        I: Iterator<Item = T>,
        F: Fn(&T) -> K,
        K: Ord,
    {
        let iter = iter.map(|v| Pair(key(&v), v));
        k_smallest_general(iter, k, Ord::cmp).map(|Pair(_, t)| t)
    }
  2. space efficiently:
    fn k_smallest_by_key<T, I, F, K>(iter: I, k: usize, key: F) -> VecIntoIter<T>
    where
        I: Iterator<Item = T>,
        F: Fn(&T) -> K,
        K: Ord,
    {
        k_smallest_general(iter, k, |a,b| key(&a).cmp(key(&b)))
    }

There is, AFAIK, no way to do this time efficiently using only a Vec<T>

If we only have one of them, my inclination would be that it should be the first one, because I expect the use cases where callers care about the extra space required would be rare enough that it wouldn't be a burden to use the 2nd version explicitly. I also don't see the return type change being too much of a problem, because I expect that almost all uses of the function will want to iterate over the results rather than collect them. (And that k is unlikely to be large enough that it's a major problem even if you did want to collect them)

Additionally, in the standard library, sort_by_cached_key came much later than sort_by_key, which (given the amount of overhead involved in new features) suggests to me that sort_by_key not caching originally wasn't a design choice that was considered and rejected but rather missed entirely.

However, the call is up to you, and I'd appreciate if you could specify whether you want the first, second or both options included.

phimuemue commented 1 year ago

Re; keyed sorts, I'm afraid I don't quite understand the part of your comment about the Vec being fine. Just to make sure we're all on the same page, my understanding is that there are two ways to do this with a single allocation:

Well described. Please go for the space efficient solution.

ejmount commented 1 year ago

Done. With regard to tests, would you be able to be more specific about what you think should be covered by quickcheck? I assume you don't mean all six variations separately, since they're mostly implemented in terms of each other.

phimuemue commented 1 year ago

Maybe you can look what's already there and extend this. If there are no quicktests yet, you could e.g. collect into a vector, sort the vector, and get the first/last k items - and assert that the result is the same as if you'd done k_smallest (or one of its cousins).

I'd appreciate if you'd write tests for each variation. You're right, they are implemented in terms of each other, but the tests shouldn't assume this.

ejmount commented 1 year ago

I've just added a quickcheck that tests all 6 variations against fully sorting the input data, let me know if you think there should be anything else added

ejmount commented 1 year ago

The build failed earlier because of an accidental reference to std::vec::Vec;, which is now corrected

TheDudeFromCI commented 1 year ago

What else is this PR waiting on?

phimuemue commented 1 year ago

Hi there, first of all: Thanks @ejmount for staying in touch.

I saw that you implement k_smallest in terms of your custom heap (i.e. it delegates to k_smallest_by). However, this potentially degrades performance, as Rust's BinaryHeap uses "holes" to improve performance.

Writing an own binary heap turns out to be tricky, so I'd lean to go with a safe implementation for now, sacrificing the holes for the more general variants, but possibly state this in the documentation. We can then try later to write a safe custom heap.

That is, before merging I'd suggest:

Before continuing on this, I'd ask @jswrenn and/or @scottmcm for their opinion on this.

ejmount commented 1 year ago

This came back to my attention recently and I've made the changes you've suggested for k_smallest, (although not k_largest, it having no existing callers) but I couldn't see any particularly good way to share functionality between that and the other variants, given that k_smallest_general is very self-contained.

AFAICT, the code as it stands is identical, in both functionalty and performance, from the POV of any existing users. Assuming my upate just now addresses your concerns in the previous post, are there any other blockers for this being merged? An earlier comment is still flagged as outstanding, but I don't think it's still applicable

douweschulte commented 8 months ago

Great work! I was just looking for this functionality as I need k_largest_by for an algorithm I am working on. I see the PR has become a bit stale, is there something I can do to help this PR over the finish line?

Philippe-Cholet commented 8 months ago

@douweschulte I don't know the details about this one. It's definitely something I'd like to see merged at some point though. @phimuemue Can you shed some light on those extensions of k_smallest?

Philippe-Cholet commented 8 months ago

I would like to revive this (most-loved) PR, and update this myself if needed. Based on https://github.com/rust-itertools/itertools/pull/654#issuecomment-1349675510:

@jswrenn and/or @scottmcm What is your opinion on this?

douweschulte commented 8 months ago

I have some additional info now as well. Since the above comment I have copy pasted this PRs code into a couple of my own projects. I have only used the k_largest function but it has worked wonders!

Philippe-Cholet commented 8 months ago

@phimuemue I assigned myself here so I agree to take this off your hands. Agree to make the code more compact. (about Vec, see other comment)

@ejmount Hello, I was not sure you would still be around so it's nice to see you here. I'll be around as well and this is definitely top-prio for me. Since the time you worked on this, the repo has been finally rustfmt'ed (and CI reject compiler and clippy warnings) so I think the current conflicts are just about formatting. Personally, I would first squash everything and rebase on master to start clean. Or we can end by that, as you wish. The main interest would be to be able run the CI.

ejmount commented 7 months ago

Hi @Philippe-Cholet

Since the time you worked on this, the repo has been finally rustfmt'ed (and CI reject compiler and clippy warnings) so I think the current conflicts are just about formatting. Personally, I would first squash everything and rebase on master to start clean. Or we can end by that, as you wish.

I've rebased all of this branch on top of the latest master, so that should clear up any formatting conflicts. (And that's very handy that I don't have to be careful not to check in unrelated formatting, since I run fmt by default)

I've also done my best to address all the review comments you and @phimuemue have left, let me know if anything still needs to be changed. I'm not sure if GH understands what just happened to the history, but here is the combined diff since phimuemue's last comment for convenience.

codecov[bot] commented 7 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 94.61%. Comparing base (6814180) to head (2cd44fb). Report is 18 commits behind head on master.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #654 +/- ## ========================================== + Coverage 94.38% 94.61% +0.22% ========================================== Files 48 48 Lines 6665 6701 +36 ========================================== + Hits 6291 6340 +49 + Misses 374 361 -13 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

ejmount commented 7 months ago

Github seems to want @phimuemue's original comment checked off before this can land - I'm not sure if there's anything I can do about that on my end given all their comments have been resolved in the meantime.

Philippe-Cholet commented 7 months ago

@phimuemue Prior to this and now, k_smallest returns VecIntoIter (change that would be a breaking change) so for me variants should do the same. And if the user collect to a vector immediately then there is no reallocation, so that's not an issue. It's true that different return types for such similar things is a bit inconsistent.

EDIT: sorted (& Co.) also return VecIntoIter. min_set (& Co.) returning vectors are the exception, not the rule.