Closed mattsams89 closed 2 years ago
Thanks for catching this. In v0.8.0 I updated arrange.()
to use setorder()
whenever possible because it was faster. But for some reason data.table
has different defaults between df[order()]
and setorder()
when it comes to how NA
s are sorted. df[order()]
puts NA
s at the end whereas setorder()
puts them at the beginning.
pacman::p_load(data.table)
df <- data.table(x = c("a", "b", NA))
df[order(x)]
#> x
#> <char>
#> 1: a
#> 2: b
#> 3: <NA>
setorder(copy(df), x)[]
#> x
#> <char>
#> 1: <NA>
#> 2: a
#> 3: b
It doesn't really make sense to me why they would have different defaults between their own ordering functions 🤷♂️
It's an easy fix though - looks like setorder()
has an na.last
argument that you can set to TRUE
.
setorder(copy(df), x, na.last = TRUE)[]
#> x
#> <char>
#> 1: a
#> 2: b
#> 3: <NA>
And the benchmarks to show that the setorder()
method is faster.
pacman::p_load(tidytable, data.table, stringi)
chr <- stri_rand_strings(100, 4)
data_size <- 10000000
df <- data.table(x = sample(chr, data_size, TRUE),
y = sample(chr, data_size, TRUE),
z = sample(1:100, data_size, TRUE))
bench::mark(
order = df[order(x, y, z)],
setorder = setorder(copy(df), x, y, z, na.last = TRUE)[],
iterations = 11
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 order 635ms 694ms 1.39 231MB 1.14
#> 2 setorder 335ms 403ms 2.54 315MB 2.54
All fixed - thanks again for catching this.
pacman::p_load(tidytable)
df <- tidytable(x = c("b", NA, "a"))
df %>%
arrange.(x)
#> # A tidytable: 3 × 1
#> x
#> <chr>
#> 1 a
#> 2 b
#> 3 <NA>
df %>%
dplyr::arrange(x)
#> # A tidytable: 3 × 1
#> x
#> <chr>
#> 1 a
#> 2 b
#> 3 <NA>
Apologies if I missed it somewhere in the closed issues for the coming 0.8.1 update, but
arrange.()
is placing NAs at the beginning of vectors instead of at the end following the 0.8.0 update.