moodymudskipper / inops

Infix Operators for Detection, Subsetting and Replacement
GNU General Public License v3.0
40 stars 0 forks source link

inconsistent way of dealing with factors in `%in{}%` #19

Closed moodymudskipper closed 4 years ago

moodymudskipper commented 4 years ago
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# pure characters: everything works as it should
"a" == "a"                  # TRUE
"a" == list("a")            # TRUE
list("a") == "a"            # TRUE

list("a") == list("a")      # fails
list("a") %in% list("a")    # TRUE

"a" %in{}% "a"              # TRUE
"a" %in{}% list("a")        # TRUE
list("a") %in{}% "a"        # TRUE

list("a") %in{}% list("a")  # TRUE

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# factors on the right: we observe inconsistencies
"a" == factor("a")                # TRUE
"a" == list(factor("a"))          # FALSE
list("a") == factor("a")          # TRUE

list("a") == list(factor("a"))    # fails
list("a") %in% list(factor("a"))  # FALSE

"a" %in{}% factor("a")             # TRUE
"a" %in{}%  list(factor("a"))      # TRUE, should be FALSE for consistency
list("a") %in{}% factor("a")       # NA + warning, should be TRUE
list("a") %in{}% list(factor("a")) # NA + warning, should be FALSE ?

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# factors on the left: everything works as it should
factor("a") == "a"                  # TRUE
factor("a") == list("a")            # TRUE
list(factor("a")) == "a"            # FALSE

list(factor("a")) == list("a")      # fails
list(factor("a")) %in% list("a")    # FALSE

factor("a") %in{}% "a"              # TRUE
factor("a") %in{}% list("a")        # TRUE
list(factor("a")) %in{}% "a"        # FALSE

list(factor("a")) %in{}% list("a")  # FALSE

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# factors on both sides
factor("a") == factor("a")          # TRUE
factor("a") == list(factor("a"))    # FALSE
list(factor("a")) == factor("a")    # FALSE

list(factor("a")) == list(factor("a"))   # fails
list(factor("a")) %in% list(factor("a")) # TRUE

factor("a") %in{}% factor("a")       # TRUE
factor("a") %in{}% list(factor("a")) # TRUE, should be FALSE for consistency
list(factor("a")) %in{}% factor("a") # FALSE

list(factor("a")) %in{}% list(factor("a"))  # FALSE, should be TRUE
moodymudskipper commented 4 years ago

This passes all tests, is simpler, and works better with examples above, but is slower :

library(rangeops)
#> 
#> Attaching package: 'rangeops'
#> The following object is masked from 'package:base':
#> 
#>     <<-
`%in{}2%` <- function(x, table){
  if(is.language(table)) return(x == table)
  Reduce(`|`, lapply(table, `==`, x))
}

set.seed(42)
x_ <- sample(1000,100)
table_ <- sample(1000,100)
table2_ <- as.list(table_)
x2_ <- as.list(x_)
intersect(x_, table_)
#>  [1] 410 930 989 299 348 504 945 262 669 630
microbenchmark::microbenchmark(
  x_ %in% table_,
  x_ %in{}% table_,
  x_ %in{}2% table_
)
#> Unit: microseconds
#>               expr   min     lq    mean median     uq    max neval cld
#>     x_ %in% table_   2.1   2.65   3.786    3.0   3.60   18.5   100  a 
#>   x_ %in{}% table_   5.2   6.40  13.069    7.7   8.70  403.0   100  a 
#>  x_ %in{}2% table_ 111.5 129.70 184.133  142.0 180.95 2633.8   100   b

microbenchmark::microbenchmark(
  x2_ %in% table2_,
  x2_ %in{}% table2_,
  x2_ %in{}2% table2_
)
#> Unit: microseconds
#>                 expr   min     lq    mean median     uq   max neval cld
#>     x2_ %in% table2_ 162.6 168.80 184.088 171.00 174.90 477.6   100   b
#>   x2_ %in{}% table2_   7.4   8.50  11.936   9.30  10.85  99.4   100  a 
#>  x2_ %in{}2% table2_ 164.0 167.15 191.750 175.25 203.40 392.5   100   b

microbenchmark::microbenchmark(
  x_ %in% table2_,
  x_ %in{}% table2_,
  x_ %in{}2% table2_
)
#> Unit: microseconds
#>                expr   min     lq    mean median     uq   max neval cld
#>     x_ %in% table2_  95.0  96.55 100.216  97.80  99.45 206.1   100  b 
#>   x_ %in{}% table2_   6.1   6.95   8.618   7.55   8.10  42.6   100 a  
#>  x_ %in{}2% table2_ 104.1 105.50 113.780 107.35 110.35 267.8   100   c

microbenchmark::microbenchmark(
  x2_ %in% table_,
  x2_ %in{}% table_,
  x2_ %in{}2% table_
)
#> Unit: microseconds
#>                expr   min     lq    mean median     uq   max neval cld
#>     x2_ %in% table_ 100.2 101.50 143.410 102.40 103.80 531.0   100  b 
#>   x2_ %in{}% table_   6.2   7.05  11.874   8.10   8.60  68.7   100 a  
#>  x2_ %in{}2% table_ 166.6 169.15 240.365 171.15 174.35 916.7   100   c

Created on 2019-10-31 by the reprex package (v0.3.0)

In favor of new implementation:

In favor of previous :

moodymudskipper commented 4 years ago

note :

"a" %in{}% list("b", factor("a")) is FALSE but "a" %in{}% list(factor("a")) is TRUE, this cannot be acceptable.

with %in{}2% they are both TRUE, while "a" == list(factor("a")) is FALSE, but considering %in{}% should be seen as applying == on ELEMENTS of the rhs I believe this is OK.

I think the short and slow version might be the way to go to have something robust, and if speed becomes an issue we can design a bunch of unit tests (and benchmarks) and propose something better.

moodymudskipper commented 4 years ago

went with simple version with https://github.com/moodymudskipper/rangeops/commit/35e983be11b02c4616f1fd042bed52e784703caa