Open Kerollmops opened 6 years ago
This issue tracks:
impl<T> [T] {
pub fn partition_dedup(&mut self) -> (&mut [T], &mut [T])
where
T: PartialEq,
{
self.partition_dedup_by(|a, b| a == b)
}
pub fn partition_dedup_by<F>(&mut self, mut same_bucket: F) -> (&mut [T], &mut [T])
where
F: FnMut(&mut T, &mut T) -> bool,
{…}
pub fn partition_dedup_by_key<K, F>(&mut self, mut key: F) -> (&mut [T], &mut [T])
where
F: FnMut(&mut T) -> K,
K: PartialEq,
{
self.partition_dedup_by(|a, b| key(a) == key(b))
}
}
The doc-comment of partition_dedup_by
(the others are similar) is:
Moves all but the first of consecutive elements to the end of the slice satisfying a given equality relation.
Returns two slices. The first contains no consecutive repeated elements. The second contains all the duplicates in no specified order.
The
same_bucket
function is passed references to two elements from the slice and must determine if the elements compare equal. The elements are passed in opposite order from their order in the slice, so ifsame_bucket(a, b)
returnstrue
,a
is moved at the end of the slice.If the slice is sorted, the first returned slice contains no duplicates.
Examples
#![feature(slice_partition_dedup)] let mut slice = ["foo", "Foo", "BAZ", "Bar", "bar", "baz", "BAZ"]; let (dedup, duplicates) = slice.partition_dedup_by(|a, b| a.eq_ignore_ascii_case(b)); assert_eq!(dedup, ["foo", "BAZ", "Bar", "baz"]); assert_eq!(duplicates, ["bar", "Foo", "BAZ"]);
From the implementation PR:
The benefits of adding these methods is that it is now possible to:
- deduplicate a slice without having to allocate and possibly clone elements on the heap, really useful for embedded stuff that can't allocate for example.
- not loose duplicate elements, because, when using
Vec::dedup
, duplicates elements are dropped. These methods add more flexibillity to the user.
These methods feel very niche to me (given we already have the Vec::dedup
family of methods), similar to some things we currently have in https://crates.io/crates/itertools rather then std::iter
. Still, multiple people were involved in the implementation PR and were evidently happy enough to land it. So I’d be inclined to propose FCP to stabilize.
@rust-lang/libs, any thoughts?
Note that Vec::dedup
is actually implemented in terms of these methods, so it's very important to anyone trying to implement vec-like replacements that these become available on stable.
@leonardo-m brought up the question of whether the second element of the returned tuple is even used. If not, it would simplify the API to return only one slice instead of a tuple. So it would be really useful to know how people use this feature. If it's merely used as a way to dedup
without allocation, then yes, the first slice is sufficient. All usages I was able to find via a quick GH search indeed only used the first returned slice.
But as I agree this is pretty niche, I think it's fine if the API is a bit more complicated. Niche APIs have a stronger focus on being powerful than being easy to use, I think.
My initial thought also was that this API is too niche and shouldn't be included, but apparently it is used a lot (even excluding Rust forks, there are quite a few projects using it on the first few pages of the search results). So I'd be fine with stabilizing this as soon as the return type question is resolved.
As far as I can tell if the method returns one slice the other behavior can still be achieved by only keeping the length of that returned slice and then using split_at_mut
on the original slice.
I thought that it would be great to also add the #[must_use]
attribute to these functions.
The link above gives ~1500 results. Excluding the word "slice_internals" and "fmt_internals" and forcing the language to be Rust leaves only 12 results. So 99% of this "lot" are very likely just rustc clones.
A more accurate search would be:
https://github.com/search?l=Rust&q=partition_dedup+NOT+slice_internals+NOT+fmt_internals+NOT+test_binary_search_implementation_details&type=Code (12 results) https://github.com/search?l=Rust&q=partition_dedup_by+NOT+slice_internals+NOT+fmt_internals+NOT+test_binary_search_implementation_details+NOT+vec_intoiter_debug&type=Code (10 results) https://github.com/search?l=Rust&q=partition_dedup_by_key+NOT+slice_internals+NOT+fmt_internals+NOT+vec_intoiter_debug&type=Code (1 result)
Only 3 examples actually try to use the second element:
https://github.com/treuherz/AdventOfCode/blob/152da4d32c4d6a258ec669f2fc175db99a4e1df8/18/rs/src/bin/2.rs#L66-L84 uses the second part of partition_dedup()
to locate the first duplicated element.
let (_, dupes) = working.partition_dedup();
if !dupes.is_empty() {
assert_eq!(dupes.len(), 1);
return dupes.first().unwrap().to_string();
}
but I don't think it is a good example, as it's easier to just write
if let Some(first_dupe) = working.windows(2).find(|(p, q)| p == q) {
return first_dupe.to_string();
}
https://github.com/sachaarbonel/poker.rs/blob/1f4ef45872307b1a3c40795ff5b125b6d82cd68f/src/bin.rs uses the second part to determine if the poker hand has pairs/triples/4 of a kind. But the message already shows how unreliable it is for the purpose.
let (dedup, _dup) = numbers.partition_dedup();
println!("dedup {:?} dup {:?}",dedup, _dup );
match _dup.len() {
0 => println!("no pair"),
1 => println!("one pair"),
2 => println!("three of a kind or two pairs, do an other match"),
3 => println!("four of a kind"),
_ => println!("not handled"),
}
The actual implementation in https://github.com/sachaarbonel/poker.rs/blob/1f4ef45872307b1a3c40795ff5b125b6d82cd68f/src/hand/mod.rs uses two calls of partition_dedup
to distinguish two-pair (AA22) vs triples (AAA) and four-of-a-kind (AAAA) vs full-house (AAA22).
IMO partition_dedup
is far from useful as a poker hand classifier since it can't handle flushes or straights.
https://github.com/smmalis37/aoc2019/blob/c1015c7d2bba8075176650ac84f4f3c53b63e1d3/src/days/day10.rs#L86-L101 Probably the only legit use case though I can't quite figure out what it's trying to do.
loop {
subangles.sort_unstable_by_key(|ad| ad.1);
subangles.sort_by_key(|ad| ad.0);
let (left, right) =
subangles.partition_dedup_by(|&mut ad1, &mut ad2| float_equals(ad1.0, ad2.0));
if seek < left.len() {
let (angle, distance) = left[seek];
let coord = angle_distance_to_coord(part1_coord, angle, distance);
return (coord, coord.x * 100 + coord.y);
} else {
seek -= left.len();
subangles = right;
}
}
Conclusion: The second element is extremely niche, and encourages bad algorithm. As mentioned in https://github.com/rust-lang/rust/issues/54279#issuecomment-572527658, the original slice is not destroyed, so the two-element version can be easily reimplemented by the one-element version using split_at_mut
(heck, the current implementation also uses split_at_mut
):
fn partition_dedup(&mut self) -> (&mut Self, &mut Self) {
let p = self.dedup_in_place().len();
self.split_at_mut(p)
}
There is no reason to keep the second element from the return type.
@kennytm Wow, thanks for the research! I clearly didn't put enough time into my comment and was indeed fooled by my search.
In case we change the return type to only return the first part of the slice, maybe we should also think about renaming the methods. Shepmaster suggested the current name partially because of the 2-tuple return type. I don't mind the current name (the algorithm still does some kind of partitioning), but another name might be more appropriate: the return type is different than Iterator::partition
and as we seem to have established, almost no one is interested in the second bucket. What about dedup_in_place
?
Note: The side effect of this method is that the input slice is swapped all around on success so that the initial span of the spice is the deduped portion. Thus, must_use would be inappropriate.
Oh hey I've been mentioned. Yeah that use case is very much artificial, it was part of the Advent of Code, a series of puzzles. The puzzle for day 10 can be found at https://adventofcode.com/2019/day/10, although you'd need to solve part 1 before it reveals part 2 which this was part of. However as you said i could easily just use split_at_mut
and be fine.
I would be inclined to remove these methods. This is an uncommon use case. The algorithm can be implemented in ~20 lines of safe code if you don't mind the bounds checks. I think the amount of benefit we'd deliver by stabilizing these is much less than what it costs to put these fairly complicated names and signatures in front of everyone in the documentation of slices.
I'd vote against removing these methods. Even if their use is uncommon they are genuinely useful. They also fit with other dedup*
methods that are already in std.
As I said, those are implemented in terms of these, so these cannot be removed entirely.
Users of Vec already see this complexity, in Stable, there's no reason that it can't be available to slices too.
The ones on Vec are simpler signatures, simpler names, and simpler behavior to understand, and serve a somewhat more common use case. So I don't find that those necessarily provide justification for the new slice methods.
Regarding "cannot be removed entirely": they can be removed entirely from the API of the standard library.
I would be inclined to remove these methods.
Are you talking about only the "complicated" version:
fn partition_dedup(&mut self) -> (&mut [T], &mut [T])
or also the "simplified" version
fn partition_dedup(&mut self) -> &mut [T]
I would like to see them put on normal slices so that other crates with vec-like APIs can simply use the slice versions.
But i'm fine with the simple version, since that's what Vec offers.
If the intended audience is crates that provide a vec-like API, I really don't think that justifies adding these to slice. It's an uncommon use case and they can write the 20 lines themselves. I don't think providing these methods makes a dent in the actual hard parts of designing and implementing a vec-like type downstream.
Or just anyone using slices who wants to be able to do this.
Unless your argument is that adding this in the first place to Vec was itself a mistake, then the same methods should be on slices too (the simplified versions that is).
FWIW I found a partition_dedup
method useful just now when completing an advent-of-code problem, for stably rearranging of a slice like [a, a, a, b, b, c, c, c]
into [a, b, c, a, b, c, a, c]
. (I'm sure there's a word for that rearrangement, but can't think of it right now...)
I know it's not a significant use case, but it was convenient to have it available to me (albeit only in nightly), and I did make use of both returned slices. It allowed me to express the rearrangement in only couple of lines of relatively simple code that was O(N) time / constant memory overhead.
Ignore me, I failed to read the documentation properly:
The second contains all the duplicates in no specified order.
Indeed, the relative order of the duplicates is not preserved, and I just got lucky on the AoC problem :(
Hey @SimonSapin,
Do you think that if I update the function name to be dedup_in_place
and the signature to return a single mutable slice it could be possible to merge this PR soon or later?
I think dedup_in_place
should return a tuple. The reason is that you'd expect it to work like it's Vec
counterpart, whereas it doesn't, and it would lead to potentially hard to debug bugs:
//this is ok to do with vec:
let vec = vec![1, 1, 2, 3];
vec.dedup()
// can use vec, no problem
// same case with slices
let mut slice = [1, 1, 2, 3];
slice.dedup_in_place();
// uses slice, oops
// or worst:
let mut slice = [1, 1, 2, 3];
let new = slice.dedup_in_place();
// one may use slice, ignore new and refer to “removed” elements
returning a tuple will prevent the later, making it #[must_use]
shall prevent the former
let mut slice = [1, 1, 2, 3]; let new = slice.dedup_in_place(); // one may use slice, ignore new and refer to “removed” elements
You can't ignore new
without also ignoring the unused_variables
warning.
first case holds
which suggests #[must_use]
is enough.
and returning a tuple prevents none of the issues raised in your sample code anyway.
Fair enough. I am just concerned with the potential misuse of this function, but #[must_use]
should be enough.
One of my projects could definitely profit from having these functions stabilized. To answer the above questions from my point of view as potential user:
Should the methods only return one slice instead of a tuple with two slices? (comment)
For my personal use case I could use the second returned value but infering all information I need can also be done by only the first without overhead. The main use of these functions is for constrained or performance critical code. If we can infer all information from just the deduplicated slice without noticable overhead we do not need the second slice returned imo.
Annotate it with
#[must_use]
?
As said earlier in this discussion I do not see a value in putting #[must_use]
here since its side effects of scrambling the underlying slice in a certain way could already be useful to some users. Could we not stabilize it with #[must_use]
and later remove it if we find use cases that would profit from having it removed? Should be a backwards compatible change, right? So if we stabilize this with #[must_use]
we simply delay decisions for or against it. Does not need to be answered in order to stabilize this.
I see a clear value in having those definitions available since they are 20 lines of code that I do not have to write myself and that I can simply trust in my code base to be correctly implemented. If I were to write them myself I would maybe even end up with an unsafe
implementation (after all it needs to be efficient) and beyond those 20 lines of mentioned code there is sitting technical doubt and thus a lot of test code.
If we choice to only return the dedup part, I believe we should rename dedup_in_place
or something like that
So, I was very confused that this method is the only one with "partition" in the name for slice when it only does deduplication, only to find that the actual partition method is called partition_in_place
and is on Iterator
(#62543).
The naming of these methods is a bit confusing, and I would be in favour of removing the "partition" terminology to help avoid that confusion.
Wanted to chime in and say that, for my purposes, having it return two vectors has been very useful (both the deduplicated result, and the items that were removed). If the opinion goes in the opposite direction (to only return the deduplicated result, not the removed duplicates), how would I go about getting a vector of removed duplicates? Sorry if that's a silly question.
@Kerollmops , it seems I found bug.
I usually use partition_dedup_by_key
to merely assert that (presumably sorted) vector doesn't have duplicates.
I. e. my code usually looks like this: https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=ee5704261d8bcb084b30e8a7de2b28ff . So far, so good.
But if I change type of key
to String
, problems appear. Now partition_dedup_by_key(|x|x.key)
doesn't compile, and partition_dedup_by_key(|x|&x.key)
doesn't compile, either. Here is two playgrounds:
https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=fee945c939074d4d1373f3a52e1b2c8c https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=9a3ff4934bd45b910b3e925fbfdac875
But surprisingly partition_dedup_by
works! Look: https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=8556cc139a6f75c575dd0d5300211120
So, it seems that wrapper partition_dedup_by_key
has wrong prototype. Same applies to sort_by_key
@safinaskar it's by design not a bug, are you asking these method should return a reference like F: FnMut(&mut T) -> &K,
? I don't think it's possible. If we remove this &mut
is should be possible.
PS: F: FnMut(&mut T) -> K,
Thus the fact it's give you a mutable reference to T is weird for me.
A bit of a pedantic observation, but the sort_by_key
method has T: FnMut(&T) -> K
, should that be matched or should &mut T
be handed out because it can be?
That would still not solve the "bug" of course, and I don't think that bug can really be solved. Sure, you could make the signature F: FnMut(&mut T) -> &K
(since they are separate elements even a mutable reference should be fine for this), but this would essentially require that K is a field of T, which makes the method very limited. And it wouldn't match the signature for the other _by_key
methods.
Let me say another way. In my code I often want to dedup vectors of structs, which have String
fields. Most of the time I just need to verify that there is no duplicates (i. e. what I really need in most of cases is is_dedupped
).
Unfortunately the only way I can do this is to clone()
this strings. This is slow. See here:
So I want some API that works in this case without cloning. Or at least which give ability to verify is_dedupped
@safinaskar
#![feature(slice_partition_dedup)]
fn main() {
struct Foo {
f: String,
}
let mut a: Vec<Foo> = vec![];
a.partition_dedup_by(|a, b| a.f == b.f);
}
@Stargateur , I don't like this solution. I'm big fan of "don't repeat yourself" principle. But, yes, this will go as pragmatic solution
The issue with the by_key
functions is tracked in #34162.
What is the status of this feature?
Add three methods to the
slice
type, thepartition_dedup
,partition_dedup_by
andpartition_dedup_by_key
. The two other methods are based onslice::partition_dedup_by
.Steps / History
Unresolved Questions
#[must_use]
?