rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
98.32k stars 12.72k forks source link

Tracking issue for `slice::partition_dedup/by/by_key` #54279

Open Kerollmops opened 6 years ago

Kerollmops commented 6 years ago

Add three methods to the slice type, the partition_dedup, partition_dedup_by and partition_dedup_by_key. The two other methods are based on slice::partition_dedup_by.

Steps / History

Unresolved Questions

SimonSapin commented 4 years ago

This issue tracks:

impl<T> [T] {
    pub fn partition_dedup(&mut self) -> (&mut [T], &mut [T])
    where
        T: PartialEq,
    {
        self.partition_dedup_by(|a, b| a == b)
    }

    pub fn partition_dedup_by<F>(&mut self, mut same_bucket: F) -> (&mut [T], &mut [T])
    where
        F: FnMut(&mut T, &mut T) -> bool,
    {…}

    pub fn partition_dedup_by_key<K, F>(&mut self, mut key: F) -> (&mut [T], &mut [T])
    where
        F: FnMut(&mut T) -> K,
        K: PartialEq,
    {
        self.partition_dedup_by(|a, b| key(a) == key(b))
    }
}

The doc-comment of partition_dedup_by (the others are similar) is:

Moves all but the first of consecutive elements to the end of the slice satisfying a given equality relation.

Returns two slices. The first contains no consecutive repeated elements. The second contains all the duplicates in no specified order.

The same_bucket function is passed references to two elements from the slice and must determine if the elements compare equal. The elements are passed in opposite order from their order in the slice, so if same_bucket(a, b) returns true, a is moved at the end of the slice.

If the slice is sorted, the first returned slice contains no duplicates.

Examples

#![feature(slice_partition_dedup)]

let mut slice = ["foo", "Foo", "BAZ", "Bar", "bar", "baz", "BAZ"];

let (dedup, duplicates) = slice.partition_dedup_by(|a, b| a.eq_ignore_ascii_case(b));

assert_eq!(dedup, ["foo", "BAZ", "Bar", "baz"]);
assert_eq!(duplicates, ["bar", "Foo", "BAZ"]);

From the implementation PR:

The benefits of adding these methods is that it is now possible to:

  • deduplicate a slice without having to allocate and possibly clone elements on the heap, really useful for embedded stuff that can't allocate for example.
  • not loose duplicate elements, because, when using Vec::dedup, duplicates elements are dropped. These methods add more flexibillity to the user.
SimonSapin commented 4 years ago

These methods feel very niche to me (given we already have the Vec::dedup family of methods), similar to some things we currently have in https://crates.io/crates/itertools rather then std::iter. Still, multiple people were involved in the implementation PR and were evidently happy enough to land it. So I’d be inclined to propose FCP to stabilize.

@rust-lang/libs, any thoughts?

Lokathor commented 4 years ago

Note that Vec::dedup is actually implemented in terms of these methods, so it's very important to anyone trying to implement vec-like replacements that these become available on stable.

LukasKalbertodt commented 4 years ago

@leonardo-m brought up the question of whether the second element of the returned tuple is even used. If not, it would simplify the API to return only one slice instead of a tuple. So it would be really useful to know how people use this feature. If it's merely used as a way to dedup without allocation, then yes, the first slice is sufficient. All usages I was able to find via a quick GH search indeed only used the first returned slice.

But as I agree this is pretty niche, I think it's fine if the API is a bit more complicated. Niche APIs have a stronger focus on being powerful than being easy to use, I think.

My initial thought also was that this API is too niche and shouldn't be included, but apparently it is used a lot (even excluding Rust forks, there are quite a few projects using it on the first few pages of the search results). So I'd be fine with stabilizing this as soon as the return type question is resolved.

SimonSapin commented 4 years ago

As far as I can tell if the method returns one slice the other behavior can still be achieved by only keeping the length of that returned slice and then using split_at_mut on the original slice.

Kerollmops commented 4 years ago

I thought that it would be great to also add the #[must_use] attribute to these functions.

kennytm commented 4 years ago

The link above gives ~1500 results. Excluding the word "slice_internals" and "fmt_internals" and forcing the language to be Rust leaves only 12 results. So 99% of this "lot" are very likely just rustc clones.

A more accurate search would be:

https://github.com/search?l=Rust&q=partition_dedup+NOT+slice_internals+NOT+fmt_internals+NOT+test_binary_search_implementation_details&type=Code (12 results) https://github.com/search?l=Rust&q=partition_dedup_by+NOT+slice_internals+NOT+fmt_internals+NOT+test_binary_search_implementation_details+NOT+vec_intoiter_debug&type=Code (10 results) https://github.com/search?l=Rust&q=partition_dedup_by_key+NOT+slice_internals+NOT+fmt_internals+NOT+vec_intoiter_debug&type=Code (1 result)

Only 3 examples actually try to use the second element:

  1. https://github.com/treuherz/AdventOfCode/blob/152da4d32c4d6a258ec669f2fc175db99a4e1df8/18/rs/src/bin/2.rs#L66-L84 uses the second part of partition_dedup() to locate the first duplicated element.

        let (_, dupes) = working.partition_dedup();
        if !dupes.is_empty() {
            assert_eq!(dupes.len(), 1);
            return dupes.first().unwrap().to_string();
        }

    but I don't think it is a good example, as it's easier to just write

        if let Some(first_dupe) = working.windows(2).find(|(p, q)| p == q) {
            return first_dupe.to_string();
        }
  2. https://github.com/sachaarbonel/poker.rs/blob/1f4ef45872307b1a3c40795ff5b125b6d82cd68f/src/bin.rs uses the second part to determine if the poker hand has pairs/triples/4 of a kind. But the message already shows how unreliable it is for the purpose.

    let (dedup, _dup) = numbers.partition_dedup();
    println!("dedup {:?} dup {:?}",dedup, _dup );
    match _dup.len() {
         0 => println!("no pair"),
         1 => println!("one pair"),
         2 => println!("three of a kind or two pairs, do an other match"),
         3 => println!("four of a kind"),
         _ => println!("not handled"),
    }

    The actual implementation in https://github.com/sachaarbonel/poker.rs/blob/1f4ef45872307b1a3c40795ff5b125b6d82cd68f/src/hand/mod.rs uses two calls of partition_dedup to distinguish two-pair (AA22) vs triples (AAA) and four-of-a-kind (AAAA) vs full-house (AAA22).

    IMO partition_dedup is far from useful as a poker hand classifier since it can't handle flushes or straights.

  3. https://github.com/smmalis37/aoc2019/blob/c1015c7d2bba8075176650ac84f4f3c53b63e1d3/src/days/day10.rs#L86-L101 Probably the only legit use case though I can't quite figure out what it's trying to do.

    loop {
        subangles.sort_unstable_by_key(|ad| ad.1);
        subangles.sort_by_key(|ad| ad.0);
        let (left, right) =
            subangles.partition_dedup_by(|&mut ad1, &mut ad2| float_equals(ad1.0, ad2.0));
        if seek < left.len() {
            let (angle, distance) = left[seek];
            let coord = angle_distance_to_coord(part1_coord, angle, distance);
            return (coord, coord.x * 100 + coord.y);
        } else {
            seek -= left.len();
            subangles = right;
        }
    }

Conclusion: The second element is extremely niche, and encourages bad algorithm. As mentioned in https://github.com/rust-lang/rust/issues/54279#issuecomment-572527658, the original slice is not destroyed, so the two-element version can be easily reimplemented by the one-element version using split_at_mut (heck, the current implementation also uses split_at_mut):

fn partition_dedup(&mut self) -> (&mut Self, &mut Self) {
    let p = self.dedup_in_place().len();
    self.split_at_mut(p)
}

There is no reason to keep the second element from the return type.

LukasKalbertodt commented 4 years ago

@kennytm Wow, thanks for the research! I clearly didn't put enough time into my comment and was indeed fooled by my search.

In case we change the return type to only return the first part of the slice, maybe we should also think about renaming the methods. Shepmaster suggested the current name partially because of the 2-tuple return type. I don't mind the current name (the algorithm still does some kind of partitioning), but another name might be more appropriate: the return type is different than Iterator::partition and as we seem to have established, almost no one is interested in the second bucket. What about dedup_in_place?

Lokathor commented 4 years ago

Note: The side effect of this method is that the input slice is swapped all around on success so that the initial span of the spice is the deduped portion. Thus, must_use would be inappropriate.

smmalis37 commented 4 years ago

Oh hey I've been mentioned. Yeah that use case is very much artificial, it was part of the Advent of Code, a series of puzzles. The puzzle for day 10 can be found at https://adventofcode.com/2019/day/10, although you'd need to solve part 1 before it reveals part 2 which this was part of. However as you said i could easily just use split_at_mut and be fine.

dtolnay commented 4 years ago

I would be inclined to remove these methods. This is an uncommon use case. The algorithm can be implemented in ~20 lines of safe code if you don't mind the bounds checks. I think the amount of benefit we'd deliver by stabilizing these is much less than what it costs to put these fairly complicated names and signatures in front of everyone in the documentation of slices.

smmalis37 commented 4 years ago

I'd vote against removing these methods. Even if their use is uncommon they are genuinely useful. They also fit with other dedup* methods that are already in std.

Lokathor commented 4 years ago

As I said, those are implemented in terms of these, so these cannot be removed entirely.

Users of Vec already see this complexity, in Stable, there's no reason that it can't be available to slices too.

dtolnay commented 4 years ago

The ones on Vec are simpler signatures, simpler names, and simpler behavior to understand, and serve a somewhat more common use case. So I don't find that those necessarily provide justification for the new slice methods.

Regarding "cannot be removed entirely": they can be removed entirely from the API of the standard library.

shepmaster commented 4 years ago

I would be inclined to remove these methods.

Are you talking about only the "complicated" version:

fn partition_dedup(&mut self) -> (&mut [T], &mut [T])

or also the "simplified" version

fn partition_dedup(&mut self) -> &mut [T]
Lokathor commented 4 years ago

I would like to see them put on normal slices so that other crates with vec-like APIs can simply use the slice versions.

But i'm fine with the simple version, since that's what Vec offers.

dtolnay commented 4 years ago

If the intended audience is crates that provide a vec-like API, I really don't think that justifies adding these to slice. It's an uncommon use case and they can write the 20 lines themselves. I don't think providing these methods makes a dent in the actual hard parts of designing and implementing a vec-like type downstream.

Lokathor commented 4 years ago

Or just anyone using slices who wants to be able to do this.

Unless your argument is that adding this in the first place to Vec was itself a mistake, then the same methods should be on slices too (the simplified versions that is).

Joey9801 commented 4 years ago

FWIW I found a partition_dedup method useful just now when completing an advent-of-code problem, for stably rearranging of a slice like [a, a, a, b, b, c, c, c] into [a, b, c, a, b, c, a, c]. (I'm sure there's a word for that rearrangement, but can't think of it right now...)

I know it's not a significant use case, but it was convenient to have it available to me (albeit only in nightly), and I did make use of both returned slices. It allowed me to express the rearrangement in only couple of lines of relatively simple code that was O(N) time / constant memory overhead.

Ignore me, I failed to read the documentation properly:

The second contains all the duplicates in no specified order.

Indeed, the relative order of the duplicates is not preserved, and I just got lucky on the AoC problem :(

Kerollmops commented 4 years ago

Hey @SimonSapin,

Do you think that if I update the function name to be dedup_in_place and the signature to return a single mutable slice it could be possible to merge this PR soon or later?

MarinPostma commented 4 years ago

I think dedup_in_place should return a tuple. The reason is that you'd expect it to work like it's Vec counterpart, whereas it doesn't, and it would lead to potentially hard to debug bugs:

//this is ok to do with vec:
let vec = vec![1, 1, 2, 3];
vec.dedup()
// can use vec, no problem
// same case with slices
let mut slice = [1, 1, 2, 3];
slice.dedup_in_place();
// uses slice, oops
// or worst:
let mut slice = [1, 1, 2, 3];
let new = slice.dedup_in_place();
//  one may use slice, ignore new and refer to “removed” elements

returning a tuple will prevent the later, making it #[must_use] shall prevent the former

kennytm commented 4 years ago
let mut slice = [1, 1, 2, 3];
let new = slice.dedup_in_place();
//  one may use slice, ignore new and refer to “removed” elements

You can't ignore new without also ignoring the unused_variables warning.

MarinPostma commented 4 years ago

first case holds

kennytm commented 4 years ago

which suggests #[must_use] is enough.

and returning a tuple prevents none of the issues raised in your sample code anyway.

MarinPostma commented 4 years ago

Fair enough. I am just concerned with the potential misuse of this function, but #[must_use] should be enough.

Robbepop commented 4 years ago

One of my projects could definitely profit from having these functions stabilized. To answer the above questions from my point of view as potential user:

Should the methods only return one slice instead of a tuple with two slices? (comment)

For my personal use case I could use the second returned value but infering all information I need can also be done by only the first without overhead. The main use of these functions is for constrained or performance critical code. If we can infer all information from just the deduplicated slice without noticable overhead we do not need the second slice returned imo.

Annotate it with #[must_use]?

As said earlier in this discussion I do not see a value in putting #[must_use] here since its side effects of scrambling the underlying slice in a certain way could already be useful to some users. Could we not stabilize it with #[must_use] and later remove it if we find use cases that would profit from having it removed? Should be a backwards compatible change, right? So if we stabilize this with #[must_use] we simply delay decisions for or against it. Does not need to be answered in order to stabilize this.

I see a clear value in having those definitions available since they are 20 lines of code that I do not have to write myself and that I can simply trust in my code base to be correctly implemented. If I were to write them myself I would maybe even end up with an unsafe implementation (after all it needs to be efficient) and beyond those 20 lines of mentioned code there is sitting technical doubt and thus a lot of test code.

Stargateur commented 3 years ago

If we choice to only return the dedup part, I believe we should rename dedup_in_place or something like that

clarfonthey commented 2 years ago

So, I was very confused that this method is the only one with "partition" in the name for slice when it only does deduplication, only to find that the actual partition method is called partition_in_place and is on Iterator (#62543).

The naming of these methods is a bit confusing, and I would be in favour of removing the "partition" terminology to help avoid that confusion.

emmyoh commented 1 year ago

Wanted to chime in and say that, for my purposes, having it return two vectors has been very useful (both the deduplicated result, and the items that were removed). If the opinion goes in the opposite direction (to only return the deduplicated result, not the removed duplicates), how would I go about getting a vector of removed duplicates? Sorry if that's a silly question.

safinaskar commented 1 year ago

@Kerollmops , it seems I found bug.

I usually use partition_dedup_by_key to merely assert that (presumably sorted) vector doesn't have duplicates.

I. e. my code usually looks like this: https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=ee5704261d8bcb084b30e8a7de2b28ff . So far, so good.

But if I change type of key to String, problems appear. Now partition_dedup_by_key(|x|x.key) doesn't compile, and partition_dedup_by_key(|x|&x.key) doesn't compile, either. Here is two playgrounds:

https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=fee945c939074d4d1373f3a52e1b2c8c https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=9a3ff4934bd45b910b3e925fbfdac875

But surprisingly partition_dedup_by works! Look: https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=8556cc139a6f75c575dd0d5300211120

So, it seems that wrapper partition_dedup_by_key has wrong prototype. Same applies to sort_by_key

Stargateur commented 1 year ago

@safinaskar it's by design not a bug, are you asking these method should return a reference like F: FnMut(&mut T) -> &K, ? I don't think it's possible. If we remove this &mut is should be possible.

PS: F: FnMut(&mut T) -> K, Thus the fact it's give you a mutable reference to T is weird for me.

TrolledWoods commented 1 year ago

A bit of a pedantic observation, but the sort_by_key method has T: FnMut(&T) -> K, should that be matched or should &mut T be handed out because it can be?

That would still not solve the "bug" of course, and I don't think that bug can really be solved. Sure, you could make the signature F: FnMut(&mut T) -> &K (since they are separate elements even a mutable reference should be fine for this), but this would essentially require that K is a field of T, which makes the method very limited. And it wouldn't match the signature for the other _by_key methods.

safinaskar commented 1 year ago

Let me say another way. In my code I often want to dedup vectors of structs, which have String fields. Most of the time I just need to verify that there is no duplicates (i. e. what I really need in most of cases is is_dedupped).

Unfortunately the only way I can do this is to clone() this strings. This is slow. See here:

https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=7d753d1967e7413d1fa38d3727acab85

So I want some API that works in this case without cloning. Or at least which give ability to verify is_dedupped

Stargateur commented 1 year ago

@safinaskar

#![feature(slice_partition_dedup)]
fn main() {
    struct Foo {
        f: String,
    }
    let mut a: Vec<Foo> = vec![];
    a.partition_dedup_by(|a, b| a.f == b.f);
}
safinaskar commented 1 year ago

@Stargateur , I don't like this solution. I'm big fan of "don't repeat yourself" principle. But, yes, this will go as pragmatic solution

Jules-Bertholet commented 1 year ago

The issue with the by_key functions is tracked in #34162.

c410-f3r commented 10 months ago

What is the status of this feature?