r-lib / vctrs

Generic programming with typed R vectors
https://vctrs.r-lib.org
Other
289 stars 66 forks source link

`nomatch` parameter for `vec_match()` #568

Open lionel- opened 5 years ago

lionel- commented 5 years ago

Useful for this idiom that takes advantage of 0 indices being ignored:

idx <- match(c("foo", "a", "bar", "z"), letters, nomatch = 0)
letters[-idx]
lionel- commented 5 years ago

I even wonder if 0 is not a better default. What idioms are facilitated with NA mismatches?

lionel- commented 5 years ago

Actually the pattern above doesn't work robustly because if idx is a single 0, the negation doesn't work (0 and -0 are the same) and we end up with an empty vector. It works for positive selections though.

lionel- commented 5 years ago

This is the same issue with letters[-idx] when idx is a vector of positions of variable length that might be empty.

In tidyselect we have a similar issue, if the first selection is negative we start from the whole set. We can't know just from the value if it is negative because the first selection input might evaluate to an empty vector. So we also look symbolically for a - call.

It may be helpful to provide vec_index_is_empty() that would return TRUE if length is zero or if all elements are 0.

lionel- commented 5 years ago

It may be helpful to provide vec_index_is_empty() that would return TRUE if length is zero or if all elements are 0.

It would also be helpful to have vec_index_invert():

vec_index_invert <- function(x) {
  if (vec_index_is_empty(x) {
    TRUE
  } else {
    -x
  }
}

letters[vec_index_invert(idx)]
lionel- commented 5 years ago

Also look into equivalent of incomparables (always returns no-match value), and into a wildcard parameter. The wildcard would always match. For data frames, these parameters could be specified via a named list.

x <- tibble::tibble(
  value = 1L,
  names = "bar"
)
y <- tibble::tibble(
  value = c(1L, 1L, 1L),
  names = c("foo", "", "bar")
)

vctrs::vec_match(x, y)
#> 3L

vctrs::vec_match(x, y, wildcard = list(names = ""))
#> 2L
DavisVaughan commented 4 years ago

It would also be nice to have na_behavior = "propagate". It feels like this is the same as na_matches = "never" from #718. Combined with nomatch = -1 we could tell an NA apart from a nomatch, like in this example:

library(vctrs)

vec_match(
  c("x", "y", NA, "z"),
  c("x", "y")
)
# [1] 1  2 NA NA

vec_match(
  c("x", "y", NA, "z"),
  c("x", "y"),
  na_behavior = "propagate",
  no_match = -1
)
# [1] 1  2  NA  -1

I think I would've been able to use this when casting a character to a factor. I need to know if all the unique character values are in the set of levels in the factor I'm casting to, but NA values are okay. So I would vec_match(chr, levels, na_behavior = "propagate", no_match = -1) and then loop through the result looking for -1 values, erroring if I find any.

I also think a nice extension might be vec_all_in(needles, haystack, na_behavior = "propagate") to do this in a more memory efficient way

lionel- commented 4 years ago

I like this interface. But we should come up with something consistent with vec_equal().

Maybe na = c("propagate", "equal", "error")?

vec_match(NA, 1, na = "equal"))
vec_equal(NA, 1, na = "error"))

This would provide additional error checking for the case where NA is unexpected.

WDYT @DavisVaughan @hadley?

DavisVaughan commented 4 years ago

Unrelated to your comment @lionel- , but I also realized recently that incomparables from match() doesn't work with NA values like I want it to in the point above. It gives the nomatch value, when I really want to propagate

match(
  c("x", "y", NA, "z"),
  c("x", "y"),
  nomatch = -1,
  incomparables = NA_character_
)
#> [1]  1  2 -1 -1
DavisVaughan commented 4 years ago

I think you'd have to default to na = "equal" as the first argument there. It is the current behavior right?

lionel- commented 4 years ago

I think the default is to propagate?

DavisVaughan commented 4 years ago

I think I like:

vec_match(
  needles,
  haystack,
  ...,
  na = c("equal", "propagate", "error"),
  no_match = NULL,
  unmatchable = NULL
)

I think my main question is: What does na = "propagate", unmatchable = NA, no_match = -1 do? Of na and unmatchable, who wins? I guess I'd expect the unmatchable to win, since that was set explicitly, so NA values would result in -1

lionel- commented 4 years ago

oh I see what you mean now, they have different defaults:

vctrs::vec_match(NA, NA)
#> [1] 1

vctrs::vec_equal(NA, NA)
#> [1] NA

What does na = "propagate", unmatchable = NA, no_match = -1 do? Of na and unmatchable, who wins? I guess I'd expect the unmatchable to win, since that was set explicitly, so NA values would result in -1

hmm to me it's na that is set explicitly, and unmatchable might come from a variable possibly containing NA. So if na = "error", I would expect NA values to cause an error even if unmatchable. And if na = "propagate", you're sure to get the NA back as well.

Another way to see it is that you can only match values that can compare equal. So "unmatchable" does not apply if na is not "equal".

DavisVaughan commented 4 years ago

Another way to see it is that you can only match values that can compare equal

Oh this makes sense, you're right

hadley commented 4 years ago

Just to be clear vec_match(NA, NA) and vec_equal(NA, NA) should return different values because base R handles NAs different in matching and equality.

I don't particularly like the proposal for na = "propagate" because I don't think this is about propagation; it's about whether or not NA matches or equals NA. I would prefer vec_match(x, y, na_equal = FALSE). I don't see the need for na = "error" at this time.

I'd also prefer not to introduce an unmatchable argument at this time; you can work around in a straightforward way (by doing the matching yourself), and I'd prefer not to introduce this complexity in such a fundamental function until we have multiple indicators of its usefulness.

I think we could default no_match to NA and require that it always be a length one integer.

lionel- commented 4 years ago

I always found the na_equal argument of vec_equal() confusing to think about, it makes me pause every time I try to use it. I think it's even more confusing with vec_match().

Thinking of it as the alternative between "propagating missingness" or "comparing equal" makes it much more intuitive to me. This is why I prefer na = c("propagate", "equal").

Also I think it's useful to have the "error" alternative for type checking.

hadley commented 4 years ago

Given that it exists invec_equal(), it makes sense to me, and we have no current need for errors, I would really rather keep this as simple as possible for now. That will let us resolve the dplyr issue and we can come back to a richer API in the future.

lionel- commented 4 years ago

I think what doesn't make sense to me is this:

vec_equal(NA, NA, na_equal = FALSE)
#> [1] NA

If na doesn't compare equal, it should be false, not NA. The behaviour when na_equal is FALSE is really that they propagate, not that they don't compare equal. This is an important distinction because it changes the return type of the function. The function is clearer when the behaviour has a name.

I would really rather keep this as simple as possible for now. That will let us resolve the dplyr issue and we can come back to a richer API in the future.

Ok, no worries. I thought it might make sense to update the API before introducing it in vec-match.

hadley commented 4 years ago

Oh interesting, because that does make sense to me 😄 To me na_equal = TRUE, implies that NA == NA is TRUE; na_equal = FALSE doesn't imply anything about that value. (I'm not explaining this very well, I'll try again later)

lionel- commented 4 years ago

na_equal = FALSE doesn't imply anything about that value

This makes sense, but then it means that it falls back to the default behaviour, which is to "propagate" NA. I think naming this behaviour and the return type it implies clarifies the function.

DavisVaughan commented 4 years ago

If na doesn't compare equal, it should be false, not NA

FWIW this also confuses me every time I use vec_equal()

hadley commented 4 years ago

Let's come back to that again in the future. Fortunately there are few times that you don't need the default so I don't think a somewhat confusing argument spec is too harmful here.