Closed Kerollmops closed 7 months ago
It's probably too late, but would you consider renaming this functions to a more proper name like split_on_change
or group_on_change
?
group_by kind of should give a transitive closure of a relationship.
Just an alternative name suggestion: split_by_neighbour
I think that split_on_change
and split_by_neighbour
are confusing because other split_*
iterators skip elements, except for split_inclusive*
.
Here a "Group" is a subslice of contiguous elements, each of the subslices is delimited by the element(s) where the predicate fails. Is there other important definition of a group that should be prioritized over this?
Minor thing, but I would prefer is there was a indication that the elements are an adjacent/contiguous subslice, is that a common definition of a group? Some names I thought about:
group_subslices
group_contiguous
subslice_by
group_subsequences
Is there other important definition of a group that should be prioritized over this?
I usually think of a group as a set of elements, not necessarily contiguous. This is the meaning it has for example in Java and SQL.
In the original RFC someone proposed calling them chunks_by
or split_by
, which IMO seems more fitting. Between the two I would go for the chunks_by
. split_by
seems to be overlapping with the Needle API, although you could also argue that this may be part of it.
Yes - group_by in most languages (and my own macros in c/c++ by example) does not require a contiguous range but rather the entire width of the iterator
In response to the comment in Java, the common approach in that language is to apply a
collect(___)
to the end of a stream (iterator)
and inside the parens choose the type of grouping, such as Collections.toList() or Collections.toMap(key_fn, value_fn)
I'm looking at this issue after a confusion with itertool
's group_by
on Vec that behave the same (group only contiguous elements) and was wondering if the naming of the std proposal was better.
So i'm adding my voice here to a change of name, group_by
doesn't evoke the fact that only contiguous elements are grouped. group_contiguous
from @marcospb19 suggestions would feel clearer to me.
Couple more candidates
runs_by
streaks_by
For me the name is perfect, it has the same meaning as Python's groupby: https://docs.python.org/3/library/itertools.html#itertools.groupby Looking forward to this!
For me the name is perfect, it has the same meaning as Python's groupby: https://docs.python.org/3/library/itertools.html#itertools.groupby Looking forward to this!
Not to argue with you as you’re just expressing your subjective opinion, but it got me thinking about it more deeply.
I’m not a linguist, but I think the function with group
inside it’s name should produce groups by a particular criterion which doesn’t have anything to do with the ordering (kind of what SQL group by clause). And what pythons groupby
(which I think was inspired by Haskell) is doing is more appropriately described as splitting the sequence and not grouping it’s elements.
I would like to use objective criteria here and not historical reasons. Are there any linguists or mathematicians here?
By the way I almost never seen pythons groupby
used without sorting the sequence first. I mean it’s main use IMHO is to produce equivalence classes from the sequence.
Mathematician here - don't think you're going to get much from math:
groupBy
The clearest parallel, in my 2c, is to the str::split
series, which returns iterator over slices. My proposal is to invert the sense of the predicate and make it split_between
. group_between
also seems reasonable.
D calls this operation chunkBy
, although Rust already uses chunk
to mean fixed group of N elements in std::slice
IIRC.
Re @purpleP comment about language and meaning:
Agree that "group" does not semantically convey precisely what's happening. chunk_by
IMO better conveys that the operation will "fail" if the Vec is not already sorted. However, given naming constraints (chunk
already having a separate meaning in rust) and other languages' naming conventions, group_by
seems the most sense at this time.
PS: Looking very much forward to this feature as I'm porting functional-style D to Rust; what's blocking besides naming?
[0] https://dlang.org/library/std/algorithm/iteration/chunk_by.html
In my opinion split_by
(or some other variation starting with split
) is the one that makes most sense.
split
because we already have split
, which has very similar semantics with the proposed method. The only difference between them is that split
takes an FnMut(&T) -> bool
while the proposed method takes an FnMut(&T, &T) -> bool
;_by
because we already have other methods that end in _by
and take a closure that takes two consecutive elements, the most clear example being Vec::dedup_by
. Following this naming a _by_key
would also make sense.I would like to voice against group_by
because even though some languages use it for this exact same function, others don't. Many people have the expectation that a group is not limited to consecutive elements, and it would be broken if this method was named group_by
.
The current code examples (the first one in particular) aggravate this naming issue even more by making it look like the group will contain all the elements that match the predicate, matching the expectation of users whose previous language gave another meaning to group_by
.
although Rust already uses chunk to mean fixed group of N elements in std::slice IIRC.
That's not entirely true, chunks
can return a slice of less than N elements in the last iteration. And if you take away the fixed size constraint then the meaning is the same, just an iterator yielding subslices.
I'm finding this function quite useful in my code.
It's similar to a split, but it takes a function (predicate) with two arguments (diadic) instead of a function with one argument. So if you don't like the "group_by" Python-inspired name (that takes a single argument predicate) then do you like split_on_adjacent? :-)
Regarding the implementation, currently it performs a linear scan:
if (self.predicate)(l, r) { len += 1 } else { break }
When (after the sort) the runs are long enough, a more aggressive strategy could be faster. Something like galloping of D language (exponential grow, followed by binary search), or an arithmetic grow followed by binary search. This can't work if the input slice isn't sorted.
Hey @leonardo-m,
If you need something more polished than this nightly method in the standard library, I have also worked on a crate with something similar to the galloping algorithm you describe here. The library is called slice-group-by
and is available on crates.io.
I've just tried that crate and it works well (in my case it gives no speedup probably because my runs are generally very short).
Another note, inside next() of GroupBy there's this line:
let (head, tail) = self.slice.split_at(len);
Inside slice::split_at there's this assert:
assert!(mid <= self.len());
I've seen that such assert isn't removed in the asm of my group_by usages.
I think it's because of this missed LLVM optimization: https://users.rust-lang.org/t/hare-and-tortoise-indexes-analysis/63985
So are benchmarks telling us that it is performance-wise worth giving a hint (inside next() of GroupBy) to help LLVM remove that assert on mid?
Note that with a carefully placed assert
you can convince LLVM to remove any unnecessary bound check. https://godbolt.org/z/c4TxMM9Tr
SkiFire13, despite I have some experience with optimizing Rust code and finding ways to remove bound tests, your code feels a bit magical. Very nice. I think that change is worth pushing in the Rust stdlib.
One thing that I needed in my use-case in addition to what GroupBy
(and the slice-group-by
crate) provide is access to the remainder of the iterator at any point in time, i.e. an accessor for the GroupBy::slice
field.
The chunks are slices and do not overlap.
So the name chunks_by
/ chunks_by_mut
is a better one.
i wanted to sort items from an iterator into groups so i searched "rust group iterator", found these methods, and was extremely confused for about 90 seconds until i had processed that these do something entirely different from what i am trying to do.
Thinking about this for a bit, find_runs
or runs_by
would be better names, in my opinion at least.
Taking a cue from more-itertools, call it split_when
and negate the predicate.
Can't believe we've been stuck on naming a function for 2.5 years.
Do you have a solution?
Do you have a solution?
If indeed all that remains is to select a name, the solution is for someone with authority to do so to make an executive decision. If there is no such person with authority, that points to a larger organizational problem.
A summary of the History of this function is included at the top of this tracking issue.
A summary of the API is included at the top of this tracking issue.
The Wasmtime project is using the slice-group-by
library and, more specifically, what this PR introduces: the group_by
version. I maintain this crate and is downloaded more than 297k times per month.
I think, at this point, we can reasonably stabilize this function. We have evidence that people want this function. Shall we stabilize this function on the slice type?
@Kerollmops Thank you for the stabilization report!
Reviewing the long discussion on naming, here and on the original RFC, I think the two names that make the most sense are group_by
(precedent in Python, among others), and chunk_by
(similarity to existing use of "chunk" in Rust, and precedent in Ruby and D).
Stating in advance: I am fine with either name and have no objection to stabilizing under either name.
With that caveat stated, I personally think chunk_by
is more intuitive, based on similarity with numerous other "chunk" functions in Rust.
I'm going to propose that we stabilize under the name chunk_by
.
@rfcbot merge
Team member @joshtriplett has proposed to merge this. The next step is review by the rest of the tagged team members:
No concerns currently listed.
Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!
See this document for info about what commands tagged team members can give me.
I also like chunk_by
better as well, for consistency with our use of "chunks" in other APIs.
I also like
chunk_by
better as well, for consistency with our use of "chunks" in other APIs.
also semantically hinting that it would fail not return expected results if the collection is not sorted ordered as assumed, whereas "group" doesn't carry this implication
it would fail if the collection is not sorted
The proposed functions do not make any assumptions about ordering.
chunk_by
communicates that matched elements are sequential, is this what you meant?
it would fail if the collection is not sorted
The proposed functions do not make any assumptions about ordering.
chunk_by
communicates that matched elements are sequential, is this what you meant?
I apologize for imprecision. I've updated my comment to reflect my intention. (namely, that "group by" could carry an implicit meaning that it might not operate sequentially, as you suggest)
Would it be possible for group_by_mut
to allow a Fn(&mut T, &mut T)
instead? I don't have a specific use case, but looking at Vec::retain
vs Vec::retain_mut
, there are use-cases for predicates receiving unique references.
For example, a type might need to cache a result to use in it's PartialEq
impl, which might require &mut
to do efficiently without inner mutability. This would allow not having to calculate this cached value up-front before group_by_mut
, as well as not have to do the expensive calculation twice during group_by_mut
(since it wouldn't be able to cache it with a &T
).
In terms of implementation, since group_by_mut
never calls it's predicate with the same element in both arguments, this should be fine.
I can see how there might be some "weirdness" about passing a mutable reference to a predicate, but if we think that it's a unique reference instead, then it make sense, since the algorithm is designed in such a way that each item passed to the predicate is unique (i.e. both arguments are never the same and won't overlap with yielded slices).
:bell: This is now entering its final comment period, as per the review above. :bell:
@Zenithsiz oi Filipe :wave:.
I don't think Fn(&mut T, &mut T)
is a great idea because most elements are actually called twice.
0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|
(0 | 1) | |||
(1 | 2) | |||
(2 | 3) | |||
(3 | 4) |
First as the right element, then, as the left element, with the exception of the first and last elements.
If you mutate an element in the previous predicate call, this might lead to a confusing pitfall where the yielded groups do not respect the predicate (as expected).
For the caching usage case you mentioned, I'd say that you can do this instead:
slice.iter_mut().for_each(perform_cache_operation);
// then, call `group_by`
I also like
chunk_by
better as well, for consistency with our use of "chunks" in other APIs.
Looks like most people do agree.
Do we need to wait for FCP to finish before creating a PR to rename group_by
to chunk_by
?
The final comment period, with a disposition to merge, as per the review above, is now complete.
As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed.
This will be merged soon.
As suggested here by the original author, I'm porting my comment to here:
fn linear_group_by_key<F, K>(&self, func: F) -> LinearGroupByKey<T, F>
where F: FnMut(&T) -> K,
K: PartialEq
Does not allow:
let strings = vec!["A".to_owned()];
strings.linear_group_by_key(|s| s.as_str());
In order for that to work, we need:
fn linear_group_by_key<'a, F, K>(&'a self, func: F) -> LinearGroupByKey<T, F>
where F: FnMut(&'a T) -> K,
K: PartialEq
Since nothing moves in memory it should be possible.
While there is (unfortunately) no by_key
variant ATM proposed in the std
, I'm guessing that chunk_by
might also benefit from having lifetime information about the source slice (e.g. if it's storing the reference somewhere) so the same thing somewhat applies.
That would also match better what is done on binary_search_by
(it would feel weird that it works with one and not the other).
I'm guessing that
chunk_by
/chunk_by_mut
might also benefit from having lifetime information about the source slice
I agree it might be useful for chunk_by
, but unfortunately I don't think this is sound for chunk_by_mut
I agree it might be useful for
chunk_by
, but unfortunately I don't think this is sound forchunk_by_mut
Aha you're right, whoops 😅
Yeah so only chunk_by
of course. (updated message above)
Feature gate:
#![feature(slice_group_by)]
This is a tracking issue for the
GroupBy
andGroupByMut
iterators.This feature exposes the
group_by
andgroup_by_mut
methods on the slice and mutable slice types, these methods return theGroupBy
andGroupByMut
iterators structs respectively. Those two iterators return subslices of the original slice where a user-defined function returnstrue
for two following elements of the slice.Public API
These methods can return subslices that contains equal elements:
they can also be used to extract the sorted subslices:
Steps / History
Unresolved Questions
group_by
? Or should we reserve that name for another higher-level combinator? https://github.com/rust-lang/rfcs/pull/2477#issuecomment-753905571