markfairbanks / tidytable

Tidy interface to 'data.table'
https://markfairbanks.github.io/tidytable/
Other
450 stars 32 forks source link

Unexpected NA handling with arrange.() #541

Closed mattsams89 closed 2 years ago

mattsams89 commented 2 years ago

Apologies if I missed it somewhere in the closed issues for the coming 0.8.1 update, but arrange.() is placing NAs at the beginning of vectors instead of at the end following the 0.8.0 update.

.df <-
  tidytable(group = c(rep("C-suite", 3), 
                      rep("Director", 3)), 
            employee = c("Bob", NA, "Jill", "James", NA, "Jackie"), 
            years = c(1, 3, NA, NA, 5, 7))

arrange.(.df, group, employee)
arrange.(.df, group, years)

dplyr::arrange(.df, group, employee)
dplyr::arrange(.df, group, years)
markfairbanks commented 2 years ago

Thanks for catching this. In v0.8.0 I updated arrange.() to use setorder() whenever possible because it was faster. But for some reason data.table has different defaults between df[order()] and setorder() when it comes to how NAs are sorted. df[order()] puts NAs at the end whereas setorder() puts them at the beginning.

pacman::p_load(data.table)

df <- data.table(x = c("a", "b", NA))

df[order(x)]
#>         x
#>    <char>
#> 1:      a
#> 2:      b
#> 3:   <NA>

setorder(copy(df), x)[]
#>         x
#>    <char>
#> 1:   <NA>
#> 2:      a
#> 3:      b

It doesn't really make sense to me why they would have different defaults between their own ordering functions 🤷‍♂️

It's an easy fix though - looks like setorder() has an na.last argument that you can set to TRUE.

setorder(copy(df), x, na.last = TRUE)[]
#>         x
#>    <char>
#> 1:      a
#> 2:      b
#> 3:   <NA>

And the benchmarks to show that the setorder() method is faster.

pacman::p_load(tidytable, data.table, stringi)

chr <- stri_rand_strings(100, 4)

data_size <- 10000000

df <- data.table(x = sample(chr, data_size, TRUE),
                 y = sample(chr, data_size, TRUE),
                 z = sample(1:100, data_size, TRUE))

bench::mark(
  order = df[order(x, y, z)],
  setorder = setorder(copy(df), x, y, z, na.last = TRUE)[],
  iterations = 11
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 order         635ms    694ms      1.39     231MB     1.14
#> 2 setorder      335ms    403ms      2.54     315MB     2.54
markfairbanks commented 2 years ago

All fixed - thanks again for catching this.

pacman::p_load(tidytable)

df <- tidytable(x = c("b", NA, "a"))

df %>%
  arrange.(x)
#> # A tidytable: 3 × 1
#>   x    
#>   <chr>
#> 1 a    
#> 2 b    
#> 3 <NA>

df %>%
  dplyr::arrange(x)
#> # A tidytable: 3 × 1
#>   x    
#>   <chr>
#> 1 a    
#> 2 b    
#> 3 <NA>