google / go-cmp

Package for comparing Go values in tests
BSD 3-Clause "New" or "Revised" License
4.09k stars 209 forks source link

Add generic less function for SortSlices #67

Open rogpeppe opened 6 years ago

rogpeppe commented 6 years ago

It's a bit of a pain writing a comparison function when we want slices to compare equal regardless of order. How about making SortSlices use a generic less function when it's argument is nil, so that in the common case we can just use SortSlices(nil) to check slice contents regardless of order and type?

The generic less function might be something like this, perhaps: https://play.golang.org/p/NQ1u36jKsi-

dsnet commented 6 years ago

For Bool, Int, Int8, Int16, Int32, Int64, Uint, Uint8, Uint16, Uint32, Uint64, Uintptr, the comparison is obvious. For Float32 and Float64, we have to handle NaN, but that's trivial. For Complex64 and Complex128, we compare them as if they were a struct{Real, Imag float64}. For String, Slice, Array, and Struct, we use lexicographical sorting of their elements/fields. For Ptr and Interface, we compare the object being pointed to; treat nil as less-than non-nil; if types are different, sort by type name (yuck, but deterministic). For Chan, Func, UnsafePointer, the only sensible way to sort these is by pointer. I'm fairly opposed to resorting to pointer comparison since it is highly runtime dependent. It might be better to panic on such comparisons. For Maps, I don't have any idea how to sort that.

My overall thoughts/questions:

\cc @bcmills who originally pushed for cmp to include first-class support for ordering. (cmp was originally called eq, but renamed to allow for the possibility of ordering).

neild commented 6 years ago

Panic on chan and func. I'd just convert UnsafePointer to a uintptr; it's runtime dependent but so are most uintptrs.

Defining an ordering for maps is straightforward, but I don't see the use case. Maybe panic there as well. If you want an ordering, then: Take the union of all keys in both maps, sort it, compare the value for each key, first inequality defines the ordering. An absent key is less than a present one.

jba commented 6 years ago

If the only use case is for order-independent equality, then this feels like an implementation detail. Channels are comparable, so why can't I ask if my []chan ints are equal regardless of order? I'm uncomfortable with this package saying that complex values sort lexicographically (why not by magnitude?), but I wouldn't care what it does when it compares my []complexs order-independently.

So I vote for cmpopts.IgnoreOrder or the like, and leave SortSlices as is (or remove it).

rogpeppe commented 6 years ago

I like SortSlices because it also functions as a good way of displaying differences coherently with cmp.Diff. With cmpopts.IgnoreOrder, what would cmp.Diff print when there's a difference?

For semantics, how about saying that it will sort any values that are comparable with the == operator, and panic otherwise. We can say that the order for any multi-element type (struct, array, complex) is defined lexicographically. Pointer-like values would sort by pointer value - yes, it's highly runtime-dependent, but I don't think that's a problem here because all we really need is consistent order. One other thing we could do is grok values that implement a Less method (similar to what we do with the Equal method for equality), but unfortunately that wouldn't cover time.Time.

That would fit the use cases I've seen historically. It would still be possible to define custom comparison operations if one really wants to order non-comparable values.

How does that sound?

jba commented 6 years ago

IgnoreOrder could work by sorting, so it could produce good diffs. The docs could even say as much. But if those are the only two use cases, there's no need to specify the ordering relation. Have you seen other uses?

rogpeppe commented 6 years ago

IgnoreOrder could work by sorting, so it could produce good diffs. The docs could even say as much.

If that's the case, then I don't really see that spelling it cmpopts.IgnoreOrder rather than cmpOpts.SortSlices(nil) gives much advantage, but YMMV.

I've made a PR (#71) so that we've got something concrete to talk about.

dsnet commented 6 years ago

I've commented on the PR with my concerns, which can be summarized as:

Stepping back for a moment, I think we're solving a harder problem than what is necessary. The original problem only cares about treating a slice of comparable elements as a set. Sorting them and dealing with order is a red herring.

What if we provide the following instead:

// SliceToSets returns an Option that converts all []T to a map[T]struct{},
// where T is any comparable type, which is specified by passing in a value of each type.
func SlicesToHashSets(typ ...interface{}) cmp.Options

Obviously SliceToSets is a terrible name, so let's not bikeshed over it. Saying that this only operates on comparable types, gives a clear definition that it matches the definition of comparability according to Go. Requiring the user to specify what types the operates on alleviates my concerns about composability.

There is an open question whether the transformation should be a set (no duplicates) or a multi-set (duplicates allowed). Using a map as the output implies set, not multi-set.

rogpeppe commented 6 years ago

Using a map as the output implies set, not multi-set.

Other than that, I think that something like a reasonable idea. Unfortunately, I think that's a show stopper in quite a few cases, because we often do care about the total non-unique element count. In particular we almost always care that the result does not have more elements in than we expect, regardless of how many unique elements there are.

But there's a possible workaround:

// SliceToSets returns an Option that converts all []T to map[T]int where
// the resulting map holds a key for each unique element of the slice,
// and the value is the count of those elements.

I'm not sure about the typ specification though, because it doesn't allow the specification of interface types. One possibility might be to specify that actual slice instances be passed in:

cmpopts.SliceToSets([]int{}, []string{})

Another is to lose the implicit filter and just expect the user to filter appropriately. e.g.

cmp.FilterValues(func(_, _ []string) bool { return true}, cmpopts.SliceToSets())

BTW there are a few places that I might have used GenericLess when using sort.Sort (or sort.Slice) in the past, rather than defining a custom struct comparison function. Mostly for semi-throwaway commands, but still perhaps a use case worth bearing in mind. The cmp package is not a heavy dependency.

dsnet commented 6 years ago

I feel somewhat strongly about making the helpers have some form of implicit type filter. My experience with cmpopts inside Google is that users don't give enough thought to which options apply to what types leading to some of the issues that cmp was trying to avoid in the first place. Your suggestion of passing []T seems reasonable to me.

Switching it to map[T]int makes multi-set the default, but make the unique-set case possible but harder. Ideally, we want both to be easy.

Perhaps we could go with an API like the following?

// SortComparable returns an Option that sorts all []V, where V is any type
// assignable to a comparable type T, specified by passing in the slice of that type.
// SortComparable panics if the element type is not comparable.
//
// The exact ordering is unspecified except that it is equivalent to sorting
// according to Hash(v), where Hash returns a unique hash for every value.
//
// The dedup bool controls whether duplicate entries should be collapsed.
func SortComparable(dedup bool, typs ...interface{}) cmp.Option {
    var ts []reflect.Type
    for _, typ := range typs {
        t := reflect.TypeOf(typ)
        if t.Kind() != reflect.Slice {
            panic(fmt.Sprintf("%v must be slice", t))
        }
        t = t.Elem()
        if !t.Comparable() {
            panic(fmt.Sprintf("%v must be comparable", t))
        }
        ts = append(ts, t)
    }
    return cmp.FilterPath(func(p cmp.Path) bool {
        if t := p.Last().Type(); t != nil && t.Kind() == reflect.Slice {
            for _, t2 := range ts {
                if t.Elem().AssignableTo(t2) {
                    return true
                }
            }
        }
        return false
    }, cmpopts.SortSlices(genericLess))
}

The genericLess is the same as the one in your PR, although it might be nice to add some intentional randomization such that each program instance has a different order.

BTW there are a few places that I might have used GenericLess when using sort.Sort (or sort.Slice) in the past, rather than defining a custom struct comparison function.

If there are sufficient use-cases, it might be worth investigating in the future whether to make Less and first-class feature of the cmp package itself. At this present moment, I'd prefer not to expose a generic less function as part of the public API.

rogpeppe commented 6 years ago

That code doesn't implement the dedup parameter, though it wouldn't be hard.

I actually quite like your idea of using maps rather than defining a generic less function. Here's another API possibility:

// SliceToMap converts any slice []T with the same type as any of the arguments
// to a map[T]int with a key for each element of the slice and an
// associated value holding the number of slice elements with that key.
//
// It panics if a slice element is not comparable.
func SliceToMap(sliceTypes ...interface{}) cmp.Option

// SliceToSet is like SliceToMap except that the resulting map
// is of type map[T]struct{}. That is, the elements are treated
// as a set rather than a multi-set.
func SliceToSet(sliceTypes ...interface{}) cmp.Option

Then the de-duplication mechanism is clear, there's no arbitrary ordering, and we don't need to remember the sense of the dedup bool parameter. It also makes it possible to specify frequency counts directly as a map[T]int although I'm not sure how useful that actually is.

jba commented 6 years ago

What is the point of SliceToMap except to compare slices order-independently? It's just a roundabout way of writing IgnoreOrder.

rogpeppe commented 6 years ago

What is the point of SliceToMap except to compare slices order-independently? It's just a roundabout way of writing IgnoreOrder.

To my mind the name SliceToMap makes it clear how it works, and hence how to make sense of the information printed by Diff and how it fits within the Transformer/Comparer/filter framework documented in Equal.

If we have something named IgnoreOrder, is that a Comparer? A Transformer? A filter? How does it interact when combined with other cmp options? It sounds like it should be a Comparer, but if it really is a Comparer, then the Diff results won't be very useful, so it really has to be a Transformer, but then it becomes important to document what it transforms into.

Having said that, spelling SliceToMap as IgnoreOrder and SliceToSet as IgnoreOrderUnique, say, could still work OK.

I guess it depends on whether one should name something after what it does or how it is intended to be used. I can see arguments both ways.

dsnet commented 6 years ago

The consistent style so far is:

Since SliceToMap is implemented with a cmp.Transformer, it falls in the third category. Perhaps a better name could be BucketizeComparable or BucketizeElements

In regards to two separate functions (SliceToMap and SliceToSet), this is minor, but I don't feel like this functionality is common enough and the distinction large enough to warrant two separate functions. I think a dedup bool is probably good enough. Thus, if dedup is true, then it returns map[T]struct{} else returns map[T]int.

dieseldesai commented 6 years ago

Hi gophers, I stumbled upon this thread when searching for a solution to the same problem: I want to compare equality of two slices without order. cmp.SortSlices works, but is a bit of a pain when you don't care about the actual order (as described by the OP). When comparing slices of structs with multiple fields, the less function can grow to be large and unruly.

A lesser issue is that sorting slices takes O(n log n) time, whereas just comparing set equality should be linear using a map as a multi-set:

The other thing I'd add is that objects within a slice need not be comparable (i.e. less function doesn't necessarily have to be defined), they only need to be distinguishable. I would think of this as similar to overriding the hashcode/equals methods in a java object to place it into a HashSet, but not necessarily implementing Comparable.

If I'm understanding the previous messages correctly, it looks like SliceToMap would address all of these concerns.

For now, I've written a custom method in my code to do this, but it would be nice to have some standard method. FWIW python has assertCountEqual built-in to the unittest library, which I see used pretty often in tests.

dsnet commented 6 years ago

The other thing I'd add is that objects within a slice need not be comparable (i.e. less function doesn't necessarily have to be defined), they only need to be distinguishable.

Being "comparable" is necessary for the approach that places all these objects in a map as the key.

I would think of this as similar to overriding the hashcode/equals methods in a java object to place it into a HashSet, but not necessarily implementing Comparable.

That would be interesting, but is beyond the scope of this issue. Unless go2 enables users to provide user-defined equality and hashability for maps, we shouldn't support this in cmp. The original issue is https://github.com/golang/go/issues/283, but has since been subsumed into the generics proposal https://github.com/golang/go/issues/15292.

dsnet commented 4 years ago

Now that cmp has been available for a few years and there's been more experience with it. Is this option something still worth providing?

rogpeppe commented 4 years ago

FWIW I use a generic ordering function all the time - I ended up putting it into the quicktest package with this ugly and slow, but simple hack:

cmpopts.SortSlices(func(x, y interface{}) bool {
    return pretty.Sprint(x) < pretty.Sprint(y)
}))

It would be nice to do a bit better, I think. Even the fmt package has a generic sorting algorithm for map keys now - maybe we could just copy that?

dsnet commented 4 years ago

Thanks for the reference. I'll scan all usages of cmpopts.SortSlices within google and also in the public module proxy and see how often SortSlices is applied on a type that could benefit from generic comparison.

andrewloux commented 3 years ago

Just voicing an opinion here, moving over from https://godoc.org/github.com/stretchr/testify/assert to cmp, not having an option to ignore order for slices has been a real pain-point of the move.

dsnet commented 3 years ago

I was thinking about this more and I wanted to note a complication with generic less functions.

Suppose you had:

type S struct {
    F1, F2 int
} 

var ss1, ss2 []S = ...
cmp.Equal(ss1, ss2,
    cmpopts.SortSlices(GenericLess(S{})),
    cmpopts.IgnoreFields(S{}, "F1"),
)

There's a subtle problem here. The less operation used by cmpopts.SortSlices needs to be aware of any other options passed to cmp that may affect the semantics of equality. In this specific case, there's another option that ignores field "F1" and that needs to be taken into consideration when performing a sort, otherwise this semantically won't work as expected (it might sometimes get it right, and it might not; its flaky).

To make this work, we would probably need 1) some way to plumb options down to other options, and 2) generic less as a first-class feature in cmp itself. It's possible to define a generic less (e.g., cmp.Less), but a complication arises from the presence of Equal methods and Comparer options. Such functionality provides no means to derive ordering information. If there was a cmp.Less function, this property should hold:

(!cmp.Less(x, y, opts...) && !cmp.Less(y, x, opts...)) == cmp.Equal(x, y, opts...)
dsnet commented 3 years ago

Another subtle complication: it's possible today that cmp.Equal can break the transitive property that one would normally expect for equality.

That is, if a == b and b == c, then one would expect a == c. However, this property does not hold due to the use of approximate equality. For example, let a = 0.5s; b = 0.9s; c = 1.3s and we had some approximate comparison that treated two values as equal if they were within 0.5s of each other. In such a case, a != c since they differ by 0.8s. The fact that the transitive property is violated has significance on how two sequences of unordered elements are ordered in order to compare them as multi-sets.