rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
96.63k stars 12.48k forks source link

Tracking Issue for the GroupBy and GroupByMut iterators #80552

Closed Kerollmops closed 7 months ago

Kerollmops commented 3 years ago

Feature gate: #![feature(slice_group_by)]

This is a tracking issue for the GroupBy and GroupByMut iterators.

This feature exposes the group_by and group_by_mut methods on the slice and mutable slice types, these methods return the GroupBy and GroupByMut iterators structs respectively. Those two iterators return subslices of the original slice where a user-defined function returns true for two following elements of the slice.

Public API

These methods can return subslices that contains equal elements:

let slice = &[1, 1, 1, 3, 3, 2, 2, 2];

let mut iter = slice.group_by(|a, b| a == b);

assert_eq!(iter.next(), Some(&[1, 1, 1][..]));
assert_eq!(iter.next(), Some(&[3, 3][..]));
assert_eq!(iter.next(), Some(&[2, 2, 2][..]));
assert_eq!(iter.next(), None);

they can also be used to extract the sorted subslices:

let slice = &[1, 1, 2, 3, 2, 3, 2, 3, 4];

let mut iter = slice.group_by(|a, b| a <= b);

assert_eq!(iter.next(), Some(&[1, 1, 2, 3][..]));
assert_eq!(iter.next(), Some(&[2, 3][..]));
assert_eq!(iter.next(), Some(&[2, 3, 4][..]));
assert_eq!(iter.next(), None);

Steps / History

Unresolved Questions

purpleP commented 3 years ago

It's probably too late, but would you consider renaming this functions to a more proper name like split_on_change or group_on_change?

group_by kind of should give a transitive closure of a relationship.

mrpink76 commented 3 years ago

Just an alternative name suggestion: split_by_neighbour

marcospb19 commented 3 years ago

I think that split_on_change and split_by_neighbour are confusing because other split_* iterators skip elements, except for split_inclusive*.

Here a "Group" is a subslice of contiguous elements, each of the subslices is delimited by the element(s) where the predicate fails. Is there other important definition of a group that should be prioritized over this?

Minor thing, but I would prefer is there was a indication that the elements are an adjacent/contiguous subslice, is that a common definition of a group? Some names I thought about:

SkiFire13 commented 3 years ago

Is there other important definition of a group that should be prioritized over this?

I usually think of a group as a set of elements, not necessarily contiguous. This is the meaning it has for example in Java and SQL.

In the original RFC someone proposed calling them chunks_by or split_by, which IMO seems more fitting. Between the two I would go for the chunks_by. split_by seems to be overlapping with the Needle API, although you could also argue that this may be part of it.

lacasaprivata2 commented 3 years ago

Yes - group_by in most languages (and my own macros in c/c++ by example) does not require a contiguous range but rather the entire width of the iterator

lacasaprivata2 commented 3 years ago

In response to the comment in Java, the common approach in that language is to apply a

collect(___)

to the end of a stream (iterator)

and inside the parens choose the type of grouping, such as Collections.toList() or Collections.toMap(key_fn, value_fn)

vbfox commented 2 years ago

I'm looking at this issue after a confusion with itertool's group_by on Vec that behave the same (group only contiguous elements) and was wondering if the naming of the std proposal was better.

So i'm adding my voice here to a change of name, group_by doesn't evoke the fact that only contiguous elements are grouped. group_contiguous from @marcospb19 suggestions would feel clearer to me.

mjhanninen commented 2 years ago

Couple more candidates

rsalmei commented 2 years ago

For me the name is perfect, it has the same meaning as Python's groupby: https://docs.python.org/3/library/itertools.html#itertools.groupby Looking forward to this!

purpleP commented 2 years ago

For me the name is perfect, it has the same meaning as Python's groupby: https://docs.python.org/3/library/itertools.html#itertools.groupby Looking forward to this!

Not to argue with you as you’re just expressing your subjective opinion, but it got me thinking about it more deeply.

I’m not a linguist, but I think the function with group inside it’s name should produce groups by a particular criterion which doesn’t have anything to do with the ordering (kind of what SQL group by clause). And what pythons groupby (which I think was inspired by Haskell) is doing is more appropriately described as splitting the sequence and not grouping it’s elements.

I would like to use objective criteria here and not historical reasons. Are there any linguists or mathematicians here?

By the way I almost never seen pythons groupby used without sorting the sequence first. I mean it’s main use IMHO is to produce equivalence classes from the sequence.

obscurans commented 2 years ago

Mathematician here - don't think you're going to get much from math:

  1. The overriding meaning of "group" is the noun for abstract algebra
  2. Haskellers had no problem with going for groupBy

The clearest parallel, in my 2c, is to the str::split series, which returns iterator over slices. My proposal is to invert the sense of the predicate and make it split_between. group_between also seems reasonable.

jblachly commented 2 years ago

D calls this operation chunkBy, although Rust already uses chunk to mean fixed group of N elements in std::slice IIRC.

Re @purpleP comment about language and meaning: Agree that "group" does not semantically convey precisely what's happening. chunk_by IMO better conveys that the operation will "fail" if the Vec is not already sorted. However, given naming constraints (chunk already having a separate meaning in rust) and other languages' naming conventions, group_by seems the most sense at this time.

PS: Looking very much forward to this feature as I'm porting functional-style D to Rust; what's blocking besides naming?

[0] https://dlang.org/library/std/algorithm/iteration/chunk_by.html

SkiFire13 commented 2 years ago

In my opinion split_by (or some other variation starting with split) is the one that makes most sense.

I would like to voice against group_by because even though some languages use it for this exact same function, others don't. Many people have the expectation that a group is not limited to consecutive elements, and it would be broken if this method was named group_by. The current code examples (the first one in particular) aggravate this naming issue even more by making it look like the group will contain all the elements that match the predicate, matching the expectation of users whose previous language gave another meaning to group_by.

although Rust already uses chunk to mean fixed group of N elements in std::slice IIRC.

That's not entirely true, chunks can return a slice of less than N elements in the last iteration. And if you take away the fixed size constraint then the meaning is the same, just an iterator yielding subslices.

leonardo-m commented 2 years ago

I'm finding this function quite useful in my code.

It's similar to a split, but it takes a function (predicate) with two arguments (diadic) instead of a function with one argument. So if you don't like the "group_by" Python-inspired name (that takes a single argument predicate) then do you like split_on_adjacent? :-)

Regarding the implementation, currently it performs a linear scan:

if (self.predicate)(l, r) { len += 1 } else { break }

When (after the sort) the runs are long enough, a more aggressive strategy could be faster. Something like galloping of D language (exponential grow, followed by binary search), or an arithmetic grow followed by binary search. This can't work if the input slice isn't sorted.

Kerollmops commented 2 years ago

Hey @leonardo-m,

If you need something more polished than this nightly method in the standard library, I have also worked on a crate with something similar to the galloping algorithm you describe here. The library is called slice-group-by and is available on crates.io.

leonardo-m commented 2 years ago

I've just tried that crate and it works well (in my case it gives no speedup probably because my runs are generally very short).

leonardo-m commented 2 years ago

Another note, inside next() of GroupBy there's this line:

let (head, tail) = self.slice.split_at(len);

Inside slice::split_at there's this assert:

assert!(mid <= self.len());

I've seen that such assert isn't removed in the asm of my group_by usages.

I think it's because of this missed LLVM optimization: https://users.rust-lang.org/t/hare-and-tortoise-indexes-analysis/63985

So are benchmarks telling us that it is performance-wise worth giving a hint (inside next() of GroupBy) to help LLVM remove that assert on mid?

SkiFire13 commented 2 years ago

Note that with a carefully placed assert you can convince LLVM to remove any unnecessary bound check. https://godbolt.org/z/c4TxMM9Tr

leonardo-m commented 2 years ago

SkiFire13, despite I have some experience with optimizing Rust code and finding ways to remove bound tests, your code feels a bit magical. Very nice. I think that change is worth pushing in the Rust stdlib.

sdroege commented 2 years ago

One thing that I needed in my use-case in addition to what GroupBy (and the slice-group-by crate) provide is access to the remainder of the iterator at any point in time, i.e. an accessor for the GroupBy::slice field.

arniu commented 2 years ago

The chunks are slices and do not overlap.

So the name chunks_by / chunks_by_mut is a better one.

devsnek commented 1 year ago

i wanted to sort items from an iterator into groups so i searched "rust group iterator", found these methods, and was extremely confused for about 90 seconds until i had processed that these do something entirely different from what i am trying to do.

Thinking about this for a bit, find_runs or runs_by would be better names, in my opinion at least.

beck commented 1 year ago

Taking a cue from more-itertools, call it split_when and negate the predicate.

QuarticCat commented 1 year ago

Can't believe we've been stuck on naming a function for 2.5 years.

marcospb19 commented 1 year ago

Do you have a solution?

jblachly commented 1 year ago

Do you have a solution?

If indeed all that remains is to select a name, the solution is for someone with authority to do so to make an executive decision. If there is no such person with authority, that points to a larger organizational problem.

Kerollmops commented 1 year ago

Stabilization Report

Implementation History

A summary of the History of this function is included at the top of this tracking issue.

API Summary

A summary of the API is included at the top of this tracking issue.

Experience Reports

The Wasmtime project is using the slice-group-by library and, more specifically, what this PR introduces: the group_by version. I maintain this crate and is downloaded more than 297k times per month.


I think, at this point, we can reasonably stabilize this function. We have evidence that people want this function. Shall we stabilize this function on the slice type?

joshtriplett commented 1 year ago

@Kerollmops Thank you for the stabilization report!

Reviewing the long discussion on naming, here and on the original RFC, I think the two names that make the most sense are group_by (precedent in Python, among others), and chunk_by (similarity to existing use of "chunk" in Rust, and precedent in Ruby and D).

Stating in advance: I am fine with either name and have no objection to stabilizing under either name.

With that caveat stated, I personally think chunk_by is more intuitive, based on similarity with numerous other "chunk" functions in Rust.

I'm going to propose that we stabilize under the name chunk_by.

@rfcbot merge

rfcbot commented 1 year ago

Team member @joshtriplett has proposed to merge this. The next step is review by the rest of the tagged team members:

No concerns currently listed.

Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

BurntSushi commented 1 year ago

I also like chunk_by better as well, for consistency with our use of "chunks" in other APIs.

jblachly commented 1 year ago

I also like chunk_by better as well, for consistency with our use of "chunks" in other APIs.

also semantically hinting that it would fail not return expected results if the collection is not sorted ordered as assumed, whereas "group" doesn't carry this implication

marcospb19 commented 1 year ago

it would fail if the collection is not sorted

The proposed functions do not make any assumptions about ordering.

chunk_by communicates that matched elements are sequential, is this what you meant?

jblachly commented 1 year ago

it would fail if the collection is not sorted

The proposed functions do not make any assumptions about ordering.

chunk_by communicates that matched elements are sequential, is this what you meant?

I apologize for imprecision. I've updated my comment to reflect my intention. (namely, that "group by" could carry an implicit meaning that it might not operate sequentially, as you suggest)

Zenithsiz commented 11 months ago

Would it be possible for group_by_mut to allow a Fn(&mut T, &mut T) instead? I don't have a specific use case, but looking at Vec::retain vs Vec::retain_mut, there are use-cases for predicates receiving unique references.

For example, a type might need to cache a result to use in it's PartialEq impl, which might require &mut to do efficiently without inner mutability. This would allow not having to calculate this cached value up-front before group_by_mut, as well as not have to do the expensive calculation twice during group_by_mut (since it wouldn't be able to cache it with a &T).

In terms of implementation, since group_by_mut never calls it's predicate with the same element in both arguments, this should be fine.

I can see how there might be some "weirdness" about passing a mutable reference to a predicate, but if we think that it's a unique reference instead, then it make sense, since the algorithm is designed in such a way that each item passed to the predicate is unique (i.e. both arguments are never the same and won't overlap with yielded slices).

rfcbot commented 10 months ago

:bell: This is now entering its final comment period, as per the review above. :bell:

marcospb19 commented 10 months ago

@Zenithsiz oi Filipe :wave:.

I don't think Fn(&mut T, &mut T) is a great idea because most elements are actually called twice.

0 1 2 3 4
(0 1)
(1 2)
(2 3)
(3 4)

First as the right element, then, as the left element, with the exception of the first and last elements.

If you mutate an element in the previous predicate call, this might lead to a confusing pitfall where the yielded groups do not respect the predicate (as expected).

For the caching usage case you mentioned, I'd say that you can do this instead:

slice.iter_mut().for_each(perform_cache_operation);
// then, call `group_by`
marcospb19 commented 10 months ago

I also like chunk_by better as well, for consistency with our use of "chunks" in other APIs.

Looks like most people do agree.

Do we need to wait for FCP to finish before creating a PR to rename group_by to chunk_by?

rfcbot commented 10 months ago

The final comment period, with a disposition to merge, as per the review above, is now complete.

As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed.

This will be merged soon.

Ten0 commented 8 months ago

As suggested here by the original author, I'm porting my comment to here:

fn linear_group_by_key<F, K>(&self, func: F) -> LinearGroupByKey<T, F>
    where F: FnMut(&T) -> K,
          K: PartialEq

Does not allow:

let strings = vec!["A".to_owned()];
strings.linear_group_by_key(|s| s.as_str());

In order for that to work, we need:

fn linear_group_by_key<'a, F, K>(&'a self, func: F) -> LinearGroupByKey<T, F>
    where F: FnMut(&'a T) -> K,
          K: PartialEq

Since nothing moves in memory it should be possible.

While there is (unfortunately) no by_key variant ATM proposed in the std, I'm guessing that chunk_by might also benefit from having lifetime information about the source slice (e.g. if it's storing the reference somewhere) so the same thing somewhat applies.

That would also match better what is done on binary_search_by (it would feel weird that it works with one and not the other).

SkiFire13 commented 8 months ago

I'm guessing that chunk_by/chunk_by_mut might also benefit from having lifetime information about the source slice

I agree it might be useful for chunk_by, but unfortunately I don't think this is sound for chunk_by_mut

Ten0 commented 8 months ago

I agree it might be useful for chunk_by, but unfortunately I don't think this is sound for chunk_by_mut

Aha you're right, whoops 😅

Yeah so only chunk_by of course. (updated message above)