Closed ejmount closed 7 months ago
Hi there! Thanks for tackling this - I'm happy you went for it.
I hope I don't miss the forest for the trees, but I am pondering whether unsafe
is really needed here. It seems that it is used here because MaxHeap
manually tracks memory via storage: &mut [MaybeUninit<T>]
and len: usize
. But wouldn't a Vec
be perfectly fine for this? (Rust's own BinaryHeap
uses Vec
, too.)
Moreover I think that we could benefit from inlining some functions:
push_pop
: It checks self.storage.is_empty
internally, but I think we could use an early out for this case at the call site.fn extend<T: IntoIterator<Item = A>>(&mut self, iter: T)
: I think the users do not benefit if we implement this trait for an internal data structure. It deals with stuff that is not used in practice (such as respecting self.len
, although the only call to it assures that self.len==0
), and thus makes things a bit more complex than necessary.total_sort
/capped_heapsort
: They transport information of internal data structures, and as it stands now, one has to go follow function calls to find out what these values actually mean. If everything was locally, this would be a bit easier to grasp.I hope I don't miss the forest for the trees, but I am pondering whether
unsafe
is really needed here. It seems that it is used here becauseMaxHeap
manually tracks memory viastorage: &mut [MaybeUninit<T>]
andlen: usize
. But wouldn't aVec
be perfectly fine for this? (Rust's ownBinaryHeap
usesVec
, too.)
It would do for the current implementation, but in the future I wanted to be able to do something like,
fn k_smallest_static<'a, T, I, F, const K: usize>(iter: I, order: F) -> impl Iterator<Item = T>
where
T: 'a,
I: Iterator<Item = T>,
F: Fn(&T, &T) -> std::cmp::Ordering,
{
let mut storage: [MaybeUninit<T>; K] = /* ... */;
let num_elements = capped_heapsort(iter, &mut storage, order);
storage
.into_iter()
.take(num_elements)
.map(|e| unsafe { e.assume_init() })
}
and have it be usable without any dependency on alloc
.
Moreover I think that we could benefit from inlining some functions:
I generally try to write code that's robust against the caller making mistakes, although I'd agree with you that that's created overhead here, so I've changed the flow of creating and consuming the heap with the main differences being
capped_heapsort
short-circuits on zero-length storage, so all the checks for that case disappear from the heap itselfextend
logic is merged into the construction process, with no trait. This also means it is (locally) obvious that the heap will start empty, so the adjustments for self.len
disappear too. total_sort
is inlined into capped_heapsort
, which now returns a (local) Range
for better context of what the return value signifiesHi, thanks for the feedback.
It would do for the current implementation, but in the future I wanted to be able to do something like,
fn k_smallest_static<'a, T, I, F, const K: usize>(iter: I, order: F) -> impl Iterator<Item = T>
Noble motives, I like that (pretty much, actually). However, I'd still argue to simply go with Vec
for now, because if we want to support static
-variants of our algorithms, we probably also want other algorithms to work with Vec
or (essentially) ArrayVec
(e.g. min_set
and its cousins). And in that case, it's probably much easier to go through the codebase, look for Vec
, and replace those by a trait VecOrArrayVec
instead of tying the static/dynamic decision to one particular algorithm (in this case: k_smallest
). That is, i'd like to move the memory management issues out of the algorithms into the containers used by the algorithms, and make the algorithms accept different containers. @jswrenn Do you have thoughts about this?
Then, I'd imagine the algorithm unsafe
-free and something like:
let container = iter.by_ref().take(k).collect()
container
using a sorting function (possibly avoiding heapify if it helps speed)push_pop
the remaining elements, and sort afterwards (possibly exploiting heap-structure if it helps speed). (In this case, we know that all elements in the container are initialized.)On top, I'd suggest more inlining:
capped_heapsort
: It internally checks for storage.is_empty
, which could be done in the caller (avoiding to call it entirely) by testing k==0
. Also, for someone new, the meaning of the range
returned from capped_heapsort
is probably still non-obvious, (in addition of its begin (that is always 0) being ignored at the single call site).from_iter
: As it stands now, it seems to rely e.g. on heap.len
being 0
- something that's obvious if it was inlined.push_pop
: Calls e.g. self.get(0)
, which lifts values into Option
. However, push_pop
is only called for non-empty storage, which would be more obvious if it was inlined.compare
going through Option
s might be a good idea, but I was thinking if we could get rid of the detour through Option
entirely (without too much hassle).Noble motives, I like that (pretty much, actually). However, I'd still argue to simply go with
Vec
for now, because if we want to supportstatic
-variants of our algorithms, we probably also want other algorithms to work withVec
or (essentially)ArrayVec
(e.g.min_set
and its cousins). And in that case, it's probably much easier to go through the codebase, look forVec
, and replace those by a traitVecOrArrayVec
instead of tying the static/dynamic decision to one particular algorithm (in this case:k_smallest
). That is, i'd like to move the memory management issues out of the algorithms into the containers used by the algorithms, and make the algorithms accept different containers.
I think this is a much better strategy than what I had in mind, so I've restructured the heap logic to use a plain Vec for now with an eye to replacing it with a trait in the future. That also neatly eliminated all of the unsafe
calls.
if iterator already exhausted, sort
container
using a sorting function (possibly avoiding heapify if it helps speed)
While I don't have benchmarks, I can't imagine many circumstances where heapify would be slower, since it works in O(k)
time rather than the O(k*ln k)
required for a sort.
capped_heapsort
: It internally checks forstorage.is_empty
, which could be done in the caller (avoiding to call it entirely) by testingk==0
. Also, for someone new, the meaning of therange
returned fromcapped_heapsort
is probably still non-obvious, (in addition of its begin (that is always 0) being ignored at the single call site).
Most of the reason for that check being placed in capped_heapsort
rather than earlier was because I was expecting that function to have a static sized caller as well, which is no longer needed. In the new version, capped_heapsort
has disappeared entirely into the caller, and being able to rely on truncate
means there's nothing to return even if it did still exist.
from_iter
: As it stands now, it seems to rely e.g. onheap.len
being0
- something that's obvious if it was inlined.
I'm afraid I don't follow - from_iter
knows that the heap begins empty because it is created there in the first place. This is hopefully clearer in the new version that uses collect
as you were suggesting earlier.
push_pop
: Calls e.g.self.get(0)
, which lifts values intoOption
. However,push_pop
is only called for non-empty storage, which would be more obvious if it was inlined.- To add to the previous point: Having a
compare
going throughOption
s might be a good idea, but I was thinking if we could get rid of the detour throughOption
entirely (without too much hassle).
compare
goes through Option
primarily for the benefit of sift_down
, where the difference between elements not existing and already being in the right order is immaterial. It does have the benefit that push_pop
works on empty storage, but as you say, that never happens in practice, so I'm not sure if it's worth bringing attention to.
Probably this should also include tests (maybe via quickcheck, comparing k_smallest...
against collecting into a vector, sorting, and then picking the first/last k
elements.
I've been making updates to this as time allows, and I've just pushed the following changes
sift_down
is now iterativeOption
have been removed and all the bounds checking logic is now explicitlen
actually disappeared entirely because its part of the slice being passed.I've not added any new tests, but will see about doing so when I get more time.
Re; keyed sorts, I'm afraid I don't quite understand the part of your comment about the Vec<T>
being fine. Just to make sure we're all on the same page, my understanding is that there are two ways to do this with a single allocation:
fn k_smallest_by_key<T, I, F, K>(iter: I, k: usize, key: F) -> impl Iterator<Item=T>
where
I: Iterator<Item = T>,
F: Fn(&T) -> K,
K: Ord,
{
let iter = iter.map(|v| Pair(key(&v), v));
k_smallest_general(iter, k, Ord::cmp).map(|Pair(_, t)| t)
}
fn k_smallest_by_key<T, I, F, K>(iter: I, k: usize, key: F) -> VecIntoIter<T>
where
I: Iterator<Item = T>,
F: Fn(&T) -> K,
K: Ord,
{
k_smallest_general(iter, k, |a,b| key(&a).cmp(key(&b)))
}
There is, AFAIK, no way to do this time efficiently using only a Vec<T>
If we only have one of them, my inclination would be that it should be the first one, because I expect the use cases where callers care about the extra space required would be rare enough that it wouldn't be a burden to use the 2nd version explicitly. I also don't see the return type change being too much of a problem, because I expect that almost all uses of the function will want to iterate over the results rather than collect them. (And that k
is unlikely to be large enough that it's a major problem even if you did want to collect them)
Additionally, in the standard library, sort_by_cached_key
came much later than sort_by_key
, which (given the amount of overhead involved in new features) suggests to me that sort_by_key
not caching originally wasn't a design choice that was considered and rejected but rather missed entirely.
However, the call is up to you, and I'd appreciate if you could specify whether you want the first, second or both options included.
Re; keyed sorts, I'm afraid I don't quite understand the part of your comment about the Vec
being fine. Just to make sure we're all on the same page, my understanding is that there are two ways to do this with a single allocation:
Well described. Please go for the space efficient solution.
Done. With regard to tests, would you be able to be more specific about what you think should be covered by quickcheck? I assume you don't mean all six variations separately, since they're mostly implemented in terms of each other.
Maybe you can look what's already there and extend this. If there are no quicktests yet, you could e.g. collect into a vector, sort the vector, and get the first/last k
items - and assert that the result is the same as if you'd done k_smallest
(or one of its cousins).
I'd appreciate if you'd write tests for each variation. You're right, they are implemented in terms of each other, but the tests shouldn't assume this.
I've just added a quickcheck that tests all 6 variations against fully sorting the input data, let me know if you think there should be anything else added
The build failed earlier because of an accidental reference to std::vec::Vec;
, which is now corrected
What else is this PR waiting on?
Hi there, first of all: Thanks @ejmount for staying in touch.
I saw that you implement k_smallest
in terms of your custom heap (i.e. it delegates to k_smallest_by
). However, this potentially degrades performance, as Rust's BinaryHeap
uses "holes" to improve performance.
Writing an own binary heap turns out to be tricky, so I'd lean to go with a safe implementation for now, sacrificing the holes for the more general variants, but possibly state this in the documentation. We can then try later to write a safe custom heap.
That is, before merging I'd suggest:
k_smallest
should use Rusts own BinaryHeap
to ensure performance does not degrade.k_smallest_...
-variants should for now use a safe custom binary heap (I don't feel competent to review an unsafe
binary heap), but should state this in the docs.k_smallest
and the new variants should share code, possibly abstracting the heap type via traits.Before continuing on this, I'd ask @jswrenn and/or @scottmcm for their opinion on this.
This came back to my attention recently and I've made the changes you've suggested for k_smallest
, (although not k_largest
, it having no existing callers) but I couldn't see any particularly good way to share functionality between that and the other variants, given that k_smallest_general
is very self-contained.
AFAICT, the code as it stands is identical, in both functionalty and performance, from the POV of any existing users. Assuming my upate just now addresses your concerns in the previous post, are there any other blockers for this being merged? An earlier comment is still flagged as outstanding, but I don't think it's still applicable
Great work! I was just looking for this functionality as I need k_largest_by
for an algorithm I am working on. I see the PR has become a bit stale, is there something I can do to help this PR over the finish line?
@douweschulte I don't know the details about this one. It's definitely something I'd like to see merged at some point though.
@phimuemue Can you shed some light on those extensions of k_smallest
?
I would like to revive this (most-loved) PR, and update this myself if needed. Based on https://github.com/rust-itertools/itertools/pull/654#issuecomment-1349675510:
@jswrenn and/or @scottmcm What is your opinion on this?
I have some additional info now as well. Since the above comment I have copy pasted this PRs code into a couple of my own projects. I have only used the k_largest
function but it has worked wonders!
@phimuemue I assigned myself here so I agree to take this off your hands.
Agree to make the code more compact.
(about Vec
, see other comment)
@ejmount Hello, I was not sure you would still be around so it's nice to see you here. I'll be around as well and this is definitely top-prio for me. Since the time you worked on this, the repo has been finally rustfmt'ed (and CI reject compiler and clippy warnings) so I think the current conflicts are just about formatting. Personally, I would first squash everything and rebase on master to start clean. Or we can end by that, as you wish. The main interest would be to be able run the CI.
Hi @Philippe-Cholet
Since the time you worked on this, the repo has been finally rustfmt'ed (and CI reject compiler and clippy warnings) so I think the current conflicts are just about formatting. Personally, I would first squash everything and rebase on master to start clean. Or we can end by that, as you wish.
I've rebased all of this branch on top of the latest master, so that should clear up any formatting conflicts. (And that's very handy that I don't have to be careful not to check in unrelated formatting, since I run fmt by default)
I've also done my best to address all the review comments you and @phimuemue have left, let me know if anything still needs to be changed. I'm not sure if GH understands what just happened to the history, but here is the combined diff since phimuemue's last comment for convenience.
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 94.61%. Comparing base (
6814180
) to head (2cd44fb
). Report is 18 commits behind head on master.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Github seems to want @phimuemue's original comment checked off before this can land - I'm not sure if there's anything I can do about that on my end given all their comments have been resolved in the meantime.
@phimuemue Prior to this and now, k_smallest
returns VecIntoIter
(change that would be a breaking change) so for me variants should do the same. And if the user collect to a vector immediately then there is no reallocation, so that's not an issue.
It's true that different return types for such similar things is a bit inconsistent.
EDIT: sorted
(& Co.) also return VecIntoIter
. min_set
(& Co.) returning vectors are the exception, not the rule.
This PR resolves #586 by adding the following public functions
I used a custom binary heap to implement them (and rewrite
k_smallest
) so that the logic could be agnostic as to where the memory came from. This is because I was originally motivated to make this PR by implementingk_smallest
for const sizes so as to not require an allocator. Using a custom heap implementation makes that extension straightforward, but I've not included it just now to make this first PR more manageable. Unfortunately, this requires an MSRV of 1.36, although the clippy config seems to indicate that's already the case.(The heap type is not publicly exposed to crate users, it is only implementation detail.)
The quickcheck test for
k_smallest
is completely untouched and still passes, so I am confident the heap logic is correct. I've also added doctests to some but not all of the new functions, which also pass, Let me know if you feel more documentation (in- or externally facing) is useful.