proposal: slices: partition slice into two slice by predicate

leaxoy commented 2 weeks ago

Proposal Details

In some scenarios, we need to divide the slice into two sub-slices that match and do not match by predicate, so it is proposed to provide the Partition and PartitionInPlace function.

Here are implementations

Partition: Alloc new slice, and keep order, don't modify the original slice.

// Partition split the original slice into two slice.
// The elements of the first slice all satisfy the predicate.
// The elements of the second slice none satisfy the predicate.
func Partition[Slice ~[]E, E any](s Slice, f func(E) bool) (S, S) {
    if s == nil {
    return nil, nil
}
truthSlice := make([]E, 0)
falseSlice := make([]E, 0)
for _, x := range s {
    if f(x) {
        truthSlice = append(truthSlice, x)
    } else {
        falseSlice = append(falseSlice, x)
    }
}
return truthSlice, falseSlice
}

PartitionInPlace: No alloc slice, also not keep order, modify original slice.

func PartitionInPlace[E any, S ~[]E](s S, f func(E) bool) (true, false S) {
if s == nil {
    return nil, nil
}

i, t := 0, len(s)-1
for i < t {
    if !f(s[i]) {
        i++
        continue
    }

    s[i], s[t] = s[t], s[i]
    t--
}
return s[:i], s[t:]
}

Usecases

Here is a simple search in github: https://github.com/search?q=%2Ffunc%5Cs*%28%28P%7Cp%29artition%7C%28S%7Cs%29plit%29.*func%5C%28.*%5C%29+bool%5C%29%5Cs*%5C%28%2F+language%3AGo+++NOT+is%3Aarchived+NOT+is%3Afork+&type=code

It may not be an exact match, but it is enough to prove that the usage scenarios of partition are rich enough.

Also includes the following situations, which cannot be quickly retrieved by github search:

There should be other function names that do similar things
There are forms that do similar things directly instead of defining functions

Some specific examples:

gopherbot commented 2 weeks ago

Change https://go.dev/cl/585018 mentions this issue: slices: add Partition

septemhill commented 2 weeks ago

It seems could only be two parts. Why don't we use GroupBy instead ?

seankhliao commented 2 weeks ago

how common is this operation? "some scenarios" is not strong justification

leaxoy commented 2 weeks ago

how common is this operation? "some scenarios" is not strong justification

@seankhliao

Here is a simple search in github: https://github.com/search?q=%2Ffunc%5Cs*%28P%7Cp%29artition.*bool%5C%29%2F+language%3AGo+&type=code

It may not be an exact match, but it is enough to prove that the usage scenarios of partition are rich enough.

Some specific examples:

leaxoy commented 2 weeks ago

It seems could only be two parts. Why don't we use GroupBy instead ?

@septemhill

Two main reason:

Partition is convenient than GroupBy in the case of two classifications
Usually groupby will return map, so I'm not sure if it makes sense to introduce map as a return value in slices. Before that, slices and maps didn't interact with each other.

seankhliao commented 2 weeks ago

That search result contains a lot of results irrelevant to this proposal a more accurate search may be something like https://github.com/search?q=%2Ffunc%5Cs*%28P%7Cp%29artition.*func%5C%28.*%5C%29+bool%5C%29+%5C%28%2F+language%3AGo+++NOT+is%3Aarchived+NOT+is%3Afork+&type=code

leaxoy commented 2 weeks ago

In addition, I also found that the split function also has similar functions. https://github.com/search?q=%2Ffunc%5Cs*%28%28P%7Cp%29artition%7C%28S%7Cs%29plit%29.*func%5C%28.*%5C%29+bool%5C%29%5Cs*%5C%28%2F+language%3AGo+++NOT+is%3Aarchived+NOT+is%3Afork+&type=code

And there should be other function names that do similar things
And there are forms that use similar functions directly instead of defining functions

adonovan commented 2 weeks ago

Potential optimization: allocate a bit vector containing f(i) for every index; then allocate a single array of length n and copy the true elements into the bottom and the false elements into the top; finally, return two cap-limited slices of this array. This would eliminate reallocations in append and reduce the amortized allocation overhead from 12.5% (append's average waste) to 1 bit per element (1.5% for a word-sized element, less for larger elements).

[Update: on reflection, a 12.5% allocation improvement doesn't seem worth the extra complexity. Never mind.]

go101 commented 2 weeks ago

It element order is not important ([edit] and the input one is allowed to be changed), then no allocation is needed.

leaxoy commented 2 weeks ago

Potential optimization: allocate a bit vector containing f(i) for every index; then allocate a single array of length n and copy the true elements into the bottom and the false elements into the top; finally, return two cap-limited slices of this array.

@adonovan This give me an idea to partition in place and return the index of the dividing point. I have seen this practice in rust, called as PartitionInPlace.

@go101 And if I understand correctly, what you expressed can also be considered as PartitionInPlace

DeedleFake commented 2 weeks ago

I wrote a Partition() function in an internal package in a project a little while back. I wound up removing it more recently though, mostly because it just wasn't useful anymore. It seems to me that an allocating version would make more sense in x/iter, while the slices implementation should be in-place.

For the fun of it, here's a zero-allocation implementation that runs in O(n) but does not preserve ordering.

Edit: Here's a version that does preserve ordering.

Edit 2: Version 3, now with overall easier-to-read code but annoyingly repetitive condition checks.

Edit 3: I wasn't paying attention to the elements of the true slice. Their order is not preserved. Whoops. I think it's possible to preserve it by keeping track of the index of the first one and reversing the sub-slice, but that'll reduce the efficiency just slightly.

leaxoy commented 2 weeks ago

I wrote a Partition() function in an internal package in a project a little while back. I wound up removing it more recently though, mostly because it just wasn't useful anymore. It seems to me that an allocating version would make more sense in x/iter, while the slices implementation should be in-place.

In fact, I write my version of Partition in my iter package, but is use a nested function as Seq, because I didn't find an efficient way to create seq without using slice.

Below is my implementations in my packgae iter. Here is my first version implementation:

func appendSeq[E any](s Seq[E], x E) Seq[E] {
    return func(yield func(x E) bool {
        for e := range s {
            if !yield(e) {
                return
            }
        }
        yield(x)
    })
}

func Partition[E any](s Seq[E], f func(E) bool) (Seq[E], Seq[E]) {
    true:=func(yield func(E) bool) {}
    false:=func(yield func(E) bool) {}
    for x := range s {
        if f(x) {
            true = appendSeq(true, x)
        }else {
            false = appendSeq(false, x)
        }
    }
    return true, false
}

And the second implementation: Partition, it iterate over Seq twice, so it is not available for non-repeatable Seq(like: io stream).

For the fun of it, here's a zero-allocation implementation that runs in O(n) but does not preserve ordering.

I think we should reverse the version that keep order. Leave the choice to the user.

Finally, your comments are quite valuable and I will integrate these versions of implementation into the original proposal

go101 commented 2 weeks ago

Will iter become another feature which is abused in Go (other than channels)? :D

leaxoy commented 2 weeks ago

Following the above discussion, we will have two variants:

func Partition[S ~[]E, E any](s S, f func(E) bool) (S, S)

func PartitionInPlace[S ~[]E, E any](s S, f func(E) bool) (S, S)

Is there anything else that needs to be discussed? If not, I will update the PR later to reflect the latest changes.

golang / go