Open lionel- opened 5 years ago
I even wonder if 0
is not a better default. What idioms are facilitated with NA
mismatches?
Actually the pattern above doesn't work robustly because if idx
is a single 0
, the negation doesn't work (0 and -0 are the same) and we end up with an empty vector. It works for positive selections though.
This is the same issue with letters[-idx]
when idx
is a vector of positions of variable length that might be empty.
In tidyselect we have a similar issue, if the first selection is negative we start from the whole set. We can't know just from the value if it is negative because the first selection input might evaluate to an empty vector. So we also look symbolically for a -
call.
It may be helpful to provide vec_index_is_empty()
that would return TRUE
if length is zero or if all elements are 0
.
It may be helpful to provide vec_index_is_empty() that would return TRUE if length is zero or if all elements are 0.
It would also be helpful to have vec_index_invert()
:
vec_index_invert <- function(x) {
if (vec_index_is_empty(x) {
TRUE
} else {
-x
}
}
letters[vec_index_invert(idx)]
Also look into equivalent of incomparables
(always returns no-match value), and into a wildcard
parameter. The wildcard would always match. For data frames, these parameters could be specified via a named list.
x <- tibble::tibble(
value = 1L,
names = "bar"
)
y <- tibble::tibble(
value = c(1L, 1L, 1L),
names = c("foo", "", "bar")
)
vctrs::vec_match(x, y)
#> 3L
vctrs::vec_match(x, y, wildcard = list(names = ""))
#> 2L
It would also be nice to have na_behavior = "propagate"
. It feels like this is the same as na_matches = "never"
from #718. Combined with nomatch = -1
we could tell an NA
apart from a nomatch
, like in this example:
library(vctrs)
vec_match(
c("x", "y", NA, "z"),
c("x", "y")
)
# [1] 1 2 NA NA
vec_match(
c("x", "y", NA, "z"),
c("x", "y"),
na_behavior = "propagate",
no_match = -1
)
# [1] 1 2 NA -1
I think I would've been able to use this when casting a character to a factor. I need to know if all the unique character values are in the set of levels
in the factor I'm casting to, but NA
values are okay. So I would vec_match(chr, levels, na_behavior = "propagate", no_match = -1)
and then loop through the result looking for -1
values, erroring if I find any.
I also think a nice extension might be vec_all_in(needles, haystack, na_behavior = "propagate")
to do this in a more memory efficient way
I like this interface. But we should come up with something consistent with vec_equal()
.
Maybe na = c("propagate", "equal", "error")
?
vec_match(NA, 1, na = "equal"))
vec_equal(NA, 1, na = "error"))
This would provide additional error checking for the case where NA
is unexpected.
WDYT @DavisVaughan @hadley?
Unrelated to your comment @lionel- , but I also realized recently that incomparables
from match()
doesn't work with NA values like I want it to in the point above. It gives the nomatch
value, when I really want to propagate
match(
c("x", "y", NA, "z"),
c("x", "y"),
nomatch = -1,
incomparables = NA_character_
)
#> [1] 1 2 -1 -1
I think you'd have to default to na = "equal"
as the first argument there. It is the current behavior right?
I think the default is to propagate?
I think I like:
vec_match(
needles,
haystack,
...,
na = c("equal", "propagate", "error"),
no_match = NULL,
unmatchable = NULL
)
vec_cast_common(needles, haystack, no_match, unmatchable)
no_match
has to be size 1 if not nullunmatchable
can be any size. If needles
is a data frame, you could provide a data frame here where each row is an unmatchable rowI think my main question is: What does na = "propagate", unmatchable = NA, no_match = -1
do? Of na
and unmatchable
, who wins? I guess I'd expect the unmatchable
to win, since that was set explicitly, so NA
values would result in -1
oh I see what you mean now, they have different defaults:
vctrs::vec_match(NA, NA)
#> [1] 1
vctrs::vec_equal(NA, NA)
#> [1] NA
What does na = "propagate", unmatchable = NA, no_match = -1 do? Of na and unmatchable, who wins? I guess I'd expect the unmatchable to win, since that was set explicitly, so NA values would result in -1
hmm to me it's na
that is set explicitly, and unmatchable
might come from a variable possibly containing NA
. So if na = "error"
, I would expect NA
values to cause an error even if unmatchable. And if na = "propagate"
, you're sure to get the NA
back as well.
Another way to see it is that you can only match values that can compare equal. So "unmatchable" does not apply if na
is not "equal"
.
Another way to see it is that you can only match values that can compare equal
Oh this makes sense, you're right
Just to be clear vec_match(NA, NA)
and vec_equal(NA, NA)
should return different values because base R handles NAs different in matching and equality.
I don't particularly like the proposal for na = "propagate"
because I don't think this is about propagation; it's about whether or not NA
matches or equals NA
. I would prefer vec_match(x, y, na_equal = FALSE)
. I don't see the need for na = "error"
at this time.
I'd also prefer not to introduce an unmatchable
argument at this time; you can work around in a straightforward way (by doing the matching yourself), and I'd prefer not to introduce this complexity in such a fundamental function until we have multiple indicators of its usefulness.
I think we could default no_match
to NA
and require that it always be a length one integer.
I always found the na_equal
argument of vec_equal()
confusing to think about, it makes me pause every time I try to use it. I think it's even more confusing with vec_match()
.
Thinking of it as the alternative between "propagating missingness" or "comparing equal" makes it much more intuitive to me. This is why I prefer na = c("propagate", "equal")
.
Also I think it's useful to have the "error"
alternative for type checking.
Given that it exists invec_equal()
, it makes sense to me, and we have no current need for errors, I would really rather keep this as simple as possible for now. That will let us resolve the dplyr issue and we can come back to a richer API in the future.
I think what doesn't make sense to me is this:
vec_equal(NA, NA, na_equal = FALSE)
#> [1] NA
If na
doesn't compare equal, it should be false, not NA
. The behaviour when na_equal
is FALSE is really that they propagate, not that they don't compare equal. This is an important distinction because it changes the return type of the function. The function is clearer when the behaviour has a name.
I would really rather keep this as simple as possible for now. That will let us resolve the dplyr issue and we can come back to a richer API in the future.
Ok, no worries. I thought it might make sense to update the API before introducing it in vec-match.
Oh interesting, because that does make sense to me 😄 To me na_equal = TRUE
, implies that NA == NA
is TRUE; na_equal = FALSE
doesn't imply anything about that value. (I'm not explaining this very well, I'll try again later)
na_equal = FALSE doesn't imply anything about that value
This makes sense, but then it means that it falls back to the default behaviour, which is to "propagate" NA
. I think naming this behaviour and the return type it implies clarifies the function.
If na doesn't compare equal, it should be false, not NA
FWIW this also confuses me every time I use vec_equal()
Let's come back to that again in the future. Fortunately there are few times that you don't need the default so I don't think a somewhat confusing argument spec is too harmful here.
Useful for this idiom that takes advantage of
0
indices being ignored: