tidyverse / funs

Collection of low-level functions for working with vctrs
Other
34 stars 7 forks source link

Think about equality #12

Open hadley opened 7 years ago

hadley commented 7 years ago

Vectorised identical()? Vectorised all.equal()?

jennybc commented 7 years ago

I am struggling at this moment with the unpredictability of all.equal() inside map() (I wanted to use map_lgl()). It returns:

Either TRUE (NULL for attr.all.equal) or a vector of mode "character" describing the differences between target and current.

Something more like what purrr's safely() functions do would be more useful. For example, a proper logical vector and a parallel character(?) vector that holds the description of the difference.

hadley commented 7 years ago

Yeah, you need isTRUE(all.equal(x, y)).

Here, I think I was thinking more about the floating point approximation and type coercion parts of all.equal() rather than the descriptive messages.

hadley commented 6 years ago

Another insight: in expressions like x == "a" where you're comparing a vector with a scalar, I would be more performance to coerce the scalar to the same type as the vector, rather than finding a common type and coercing both. Since the pattern is almost always vector == scalar (not scalar == vector) this can probably be done with out double-dispatch.

hadley commented 5 years ago

Remember the existence of vec_proxy_equality() and the C function vctrs_equal(). But vctrs_equal() doesn't have quite the right equality semantics since we'd compare doubles with tolerance, and NA shouldn't equal NA.

Also consider this table:

Vectorised FP tolerance Works with lists NAs equal
all.equal
identical
==
%%
near

At a minimum we need a vectorised all.equal() that coerces both inputs to common type, and applies tolerance when comparing doubles

hadley commented 5 years ago

Do we also need to think about types? i.e. do we want an equal() where equal(1, 1L) is false? (Maybe with a warning to make it clear that function is only useful for comparing objects of the same type)

hadley commented 5 years ago

Also need to think about string encoding - base R takes a very strong opinion that strings are equal if they're equal when re-encoded to UTF-8.

gaborcsardi commented 5 years ago

FWIW, === in JS means "equality w/o coercion", i.e. types, attributes must match. In my mind, this is what identical() should be.

hadley commented 5 years ago

Maybe we should have vec_equal() and vec_identity() for vectorised and scalar comparison, then options for comparing NAs and FP tolerance?

hadley commented 5 years ago

Hmmm, maybe FP tolerance can stay as part of a specialised function? It's a bit hard to know because so many ideas are tangled up in all.equal(): scalar equality, explanation of differences, and FP tolerance.

gaborcsardi commented 5 years ago

Btw. it feels like identical() is not really even a vector thing, since it is not vectorized. By definition it compares objects.

hadley commented 5 years ago

Yeah, agreed, it just kind of falls out here because of lists (since they can contain any object you have to handle everything here)

It does seem like it would be nice to be able to compare two data frames ignoring FP differences, which suggests that tol should be an argument, but I don't think the interpretation should be the same as for all.equal() since that applies to the relative mean difference between two vectors.

So maybe there should be four functions?

gaborcsardi commented 5 years ago

Yeah, equivalent() can be a generic, and the double method can have a tolerance argument, similarly to all.equal().

The data.frame method can have a tolerance argument as well, and it will just forward it to columns that need it (although this seems somewhat magical).

The equivalent() methods can be vectorized by default, maybe?

So it is three functions, sg like

hadley commented 5 years ago

I don't think they need to/should be generic because vctrs calls vec_equality_proxy() which can return an atomic vector, list, or data frame that is used for comparisons internally (we need this for performance of the hashing/equality functions that power vec_unique() etc).

I'd rather make it four functions to keep it totally symmetric (and the vec_ prefix is about vector input/the vctrs package not whether or not the function is vectorised)

lionel- commented 5 years ago

About functions, it's annoying that functions with different srcrefs are not identical() just because they were parsed in different files. Does this mean it should be handled by equivalent()? But will the latter also compare environments? By reference or by contents? It seems that a good principle is that objects should be identical even after a roundtrip of serialisation / unserialisation, so reference semantics are not appropriate (but can be used for shortcircuiting the computation).

gaborcsardi commented 5 years ago

Well, if the srcrefs are not identical, then the objects are not identical, right?

But if equivalent() is a generic, then equivalent.function() can easily default to ignore the srcrefs. Yes, equivalent.environment() would compare environments and these can have arguments to fine-tune how the comparison is done. Etc. For example equivalent.function() will have a srcref = FALSE argument, and equivalent.numeric() will have a tolerance() argument.

Re serialization, I am not sure if that's a good principle. Two objects being identical should not depend on the capabilities of the serialization. E.g. xptrs cannot be serialized, but it still makes sense to ask if two xptrs are identical.

lionel- commented 5 years ago

I think they are identical for almost all purposes. Similarly a bytecoded function and a function are identical for all purposes. Some things are just implementation details. Encoding is supposed to be an implementation detail as well. The address in memory is an implementation detail.

But maybe I should reconsider in terms of equivalent. If equivalent() does compare environments this will be a pretty big departure from all.equal(). By the way there's a naming problem, the most useful function for end users and perhaps even developers has the longer name ;)

lionel- commented 5 years ago

E.g. xptrs cannot be serialized, but it still makes sense to ask if two xptrs are identical.

I wouldn't discard the principle on this basis, developers can use more specialised functions to inspect properties of xptrs.

Edit: I think what I was getting at with the serialisation thing is that equality (or in R terms, identity) properties transcend the session and should survive it.

gaborcsardi commented 5 years ago

If you want an identical() that means "mostly identical", that only works for things that can be serialized, that's fine with me. :) (Although base::identical() is already this function, no?)

I often wish there was a really_identical() operator, though, that works for all R objects, including Xptrs, environments, ALTREP objects, etc. and tests for basically bit-wise equality.

lionel- commented 5 years ago

identical() is mostly that function but has some quirks. You're right it might not be worth it to introduce a new one just to fix it. Also R core has been discussing a replacement for some time so we might get base::identical2 once they reach a consensus.

I think bit-for-bit equality is useful but is also very specialised. It doesn't feel like a vctrs function. It's more of a "I know what I'm doing" function. Will it have the same semantics on different R implementations?

hadley commented 5 years ago

I don't think the principle of equal values/hashes after serialisation round-trip applies here. It's definitely a useful principle, but I think it's a bit too strict for vctrs needs (which are always either within a vector or between two vectors of the same type).

hadley commented 5 years ago

One more idea for naming: I could have obj_equal(), vec_equal(), obj_identical(), and vec_identical()?

A similar distinction is lobstr::obj_addr() vs lobstr::obj_addrs(), but I don't think plurals would work here.