Closed khusmann closed 2 months ago
@lionel- @DavisVaughan
Ok, I know I passionately advocated for this yesterday at conf, but after starting thinking about this again after a long break, I'm embarrassingly reconsidering the need for this (for my use case, at least).
I realized that for my application, it may actually be more theoretically consistent for comparisons with missing values (including missing reasons) to result in missing values. In base R, we have:
1 == NA => NA NA == NA => NA
similarly:
1 + NA => NA 1 - NA => NA (etc).
In interlacer, I say "operations on interlaced values are applied to the value channel". This way, when adding two interlaced vectors, only the values are added in the vector and missing values get automatically "masked out". This relies on:
1 + na("reason") => NA 1 - na("reason") => NA
Therefore, equality should also only apply to the value channel, just like for missing values without reasons:
1 == na("reason") => NA na("reason1") == na("reason2") => NA
But what I was asking for in this PR was the ability to compare missing reasons:
na("reason1") == na("reason2") => false na("reason1") == na("reason1") => true
But this is actually a different type of comparison, and breaks the interlacer rule if we allow it! (that is, breaks the rule that operations apply on values, masking out missingness). Instead, we should think of it as a new kind of comparison that compares values AND missing values (rather than comparing values and masks out missing). So I think what I actually need is to define a new form of the ==
and !=
operators that compare both the value and missingness of a value, and always return a boolean (never missing). Something like:
na("reason1") %==% na("reason2") => false na("reason1") %==% na("reason1") => true 1 %==% na("reason") => false
This operator would also make sense with unlabelled missingness (raw NA
values):
1 %==% NA => false NA %==% NA => true na("reason1") %==% NA => false
Anyway, tldr I need to think about this a little more… but I think we'll probably be able to close the PR! (if you have any theory input/perspective on this I'd really appreciate it but no pressure; I know y'all are super busy)
Sorry for the flip flop on this!
The latest version of interlacer (v0.4.0) now has the behavior I describe above, which so far seems to be working well... I'm closing this issue now!
I think it makes sense that all the vctrs ops operate on the underlying value, but then you could have special interlacer_equals()
operations or something like that (possibly an infix op like you showed) that "know" about the different kinds of missingness as well
It's kind of like how in ivs you can do vctrs::vec_set_difference()
and it uses the vctrs definition of sets, but there is also ivs::iv_set_difference()
that uses the interval definition of sets
It's kind of like how in ivs you can do vctrs::vec_set_difference() and it uses the vctrs definition of sets, but there is also ivs::iv_set_difference() that uses the interval definition of sets
Good parallel! (and ivs looks super cool btw)
The other nice part of the infix %==%
operator is it is I'm finding that even without interlaced types the definition makes sense and is actually pretty useful. It can be used to determine if the values of two columns are identical in values and missing pattern, even with unlabeled missingness:
library(interlacer)
a <- c(1, 2, NA, 3)
b <- c(1, 2, NA, NA)
a == b
#> [1] TRUE TRUE NA NA
a %==% b
#> [1] TRUE TRUE TRUE FALSE
# This becomes handy if I want to see if the vectors are identical
# in values and in missing pattern:
all(a %==% b)
#> [1] FALSE
all(a %==% a)
#> [1] TRUE
# By contrast, if I naively use base `==` I get `NA`
all(a == b)
#> [1] NA
all(a == a)
#> [1] NA
# To get the same behavior as all(a %==% b) requires a bit of gymnastics:
all(ifelse(is.na(a == b), is.na(a) & is.na(b), a == b))
#> [1] FALSE
Created on 2024-09-05 with reprex v2.0.2
As far as I'm aware, there aren't any other convenient ways to make this basic kind of value & missingness pattern comparison in base or rlang or vctrs... Right?
Sure, there's all.equal()
and identical()
, but 1. they operate on the full vector & don't have vectorized versions and 2. the former is for "near" equality, and the latter additionally compares attributes (rather than just values).
vec_equal(na_equal = TRUE)
offers some of this behavior
a <- c(1, 2, NA, 3)
b <- c(1, 2, NA, NA)
a == b
#> [1] TRUE TRUE NA NA
vctrs::vec_equal(a, b)
#> [1] TRUE TRUE NA NA
vctrs::vec_equal(a, b, na_equal = TRUE)
#> [1] TRUE TRUE TRUE FALSE
Created on 2024-09-05 with reprex v2.0.2
Ah nice! So I can simplify / generalize the definition of x %==% y
to be:
`%==%` <- function(x, y) {
args <- vec_cast_common(x, y)
vec_equal(vec_proxy(args[[1]]), vec_proxy(args[[2]]), na_equal = TRUE)
}
(Note vec_proxy()
is used here instead of vec_proxy_equal()
because vec_proxy_equal()
returns a vctr designed for value comparison with ==
, not for deep structural comparison as we want here)
With this definition the operator becomes meaningful for all vctrs record types as a "record equal" (or "deep equal"?) operator, as opposed to the usual "value equal". Cool.
For ivs, for example, this would allow you to compare intervals with one or more bounds missing. Interval vectors essentially can have 3 types of structural missingness: missing start, missing end, and missing both. So you'd have:
iv(5, 5) %==% iv(5, 5) # TRUE
iv(5, NA) %==% iv(5, NA) # TRUE
iv(NA, 5) %==% iv(NA, 5) # TRUE
iv(NA, NA) %==% iv(NA, NA) # TRUE
iv(5, NA) %==% iv(NA, NA) # FALSE
# Etc.
Compared to behavior of regular ==
:
iv(5, 5) == iv(5, 5) # TRUE
iv(5, NA) == iv(5, NA) # NA
iv(NA, 5) == iv(NA, 5) # NA
iv(NA, NA) == iv(NA, NA) # NA
iv(5, NA) == iv(NA, NA) # NA
# Etc.
This makes sense because when we test iv(5, NA) == iv(5, NA)
, we're asking "were these the same measured interval?" and the correct answer is "we don't know, because we forgot what the upper bound was... so we can't say either way". Whereas when we ask iv(5, NA) %==% iv(5, NA)
, we're asking "are these record entries equivalent?" and the answer is "yes, because they both have the same lower bound and are both missing the upper bound".
Of course right now in ivs when either the start or end value of the input is NA you force both to be NA (to avoid needing to deal with the complexity of incomplete intervals, which is totally valid). Still, it's interesting to think about applications of this operator outside of interlacer... If this has more broad application, it might be worth considering for inclusion into vctrs.
Anyway, thanks for thinking this thru with me! I'll continue to stew, and pls don't hesitate to tag me if you see any more applications or parallels on anything else you're working on...
vec_proxy_missing()
adds a new proxy that allows developers to extend theis.na()
,na.omit()
etc. behavior of their vctrs (as I describe in (#1925). It's useful/needed when vctrs contain special values (or data structures) that represent different types of missingness in their data. In these cases, you need a separate proxy forequal
andmissing
. It also lets you redefine the meaning ofcomplete
.I wanted to take an experimental shot at the implementation here because it didn't seem to hard, and I wanted to play around with the resulting API. I'm extremely satisfied with the result -- this is exactly what I need for implementing a generic vctrs-backed Result<> type in the library I'm working on, and I think could have similar applications elsewhere (like
haven::labelled_spss
, or declared)I've added a few tests and documentation updates, but it's only a first-pass. I would be happy to do a deeper & more thorough dive if this is considered for inclusion into vctrs.
Thank you for your consideration!