Setkey within tidytable?

markfairbanks / tidytable

Tidy interface to 'data.table'

https://markfairbanks.github.io/tidytable/

Other

450 stars 32 forks source link

Setkey within tidytable? #710

Closed dibbles21 closed 1 year ago

dibbles21 commented 1 year ago

Hi Mark,

Firstly, thank you for your excellent package which I use regularly 😊.

Is there a way to sort using 'setkey' or 'setkeyv' from {data.table} using {tidytable}? I'm finding my joins much quicker using {data.table} and 'setkeyv' compared to {tidytable}.

Many thanks,

Dan

markfairbanks commented 1 year ago

Unfortunately I can't add this feature to tidytable.

data.table leaves it up to the user to be careful about when to use copy() (which creates a "deep" copy) and when to modify-by-reference (using things like setkey() or setorder()). In tidytable I take care of all of that in the background and take "shallow" copies wherever possible to keep the operations fast. On the other hand data.table never uses shallow copies. Some functions in tidytable are faster than the data.table versions because of this, but some data.table versions are faster than tidytable versions as well. However on the whole tidytable is generally as fast as data.table. In this issue https://github.com/markfairbanks/tidytable/issues/312#issuecomment-1137433541 you can see the benchmarks that show this on some core functions.

General performance aside - since I use shallow copies as much as possible, using setorder()/setkey() on a tidytable can cause modify-by-reference in places a user wouldn't expect it. Here's an example.

In tidytable I use a shallow copy in select() so that it's a very fast transformation. In this example most users would expect df2 to remain unaltered. Note how df2 is altered by setorder(df1, x).

library(tidytable, w = FALSE)
library(data.table, w = FALSE)

df1 <- tidytable(x = c("b", "a", "c"), y = c(2, 1, 3))

df2 <- df1 %>%
  select(x, y)

setorder(df1, x)

df2
#> # A tidytable: 3 × 2
#>   x         y
#>   <chr> <dbl>
#> 1 a         1
#> 2 b         2
#> 3 c         3

markfairbanks commented 1 year ago

However for joins specifically - are the joins faster when you include the time of the setkey() step? From what I can tell including setkey() makes the joins take the same amount of time.

I could be wrong here, but this is my attempt at testing it. It's a little tough to test this since setkey() modifies-by-reference. So I take a copy of the datasets and subtract out the benchmarks to get the adjusted time. (At the end you'll see I take the time to copy both data frames and subtract it from the dt_with_setkey benchmark).

library(tidytable, w = FALSE)
library(data.table, w = FALSE)

bench_mark <- function(..., .fn = NULL) {
  bench::mark(...,
              check = FALSE, iterations = 30,
              memory = FALSE, time_unit = "ms") %>%
    suppressWarnings() %>%
    mutate(expression = as.character(expression),
           median = round(median, 1),
           func_tested = .env$.fn) %>%
    select(func_tested, expression, median) %>%
    pivot_wider(names_from = expression, values_from = median)
}

random_strings <- c("OoVt", "wCbu", "cXxX", "jdFu", "MCRx",
                    "ukhz", "ikce", "PHyu", "jpBY", "nLQM")

test_data <- function(.size) {
  tidytable(a = sample(1:20, .size, TRUE),
            b = sample(1:20, .size, TRUE),
            c = sample(random_strings, .size, TRUE),
            d = sample(random_strings, .size, TRUE))
}

left_df <- test_data(1000000)

right_df <- left_df %>%
  distinct(c, d) %>%
  mutate(e = row_number())

keyed_left_df <- setkeyv(copy(left_df), c("c", "d"))

keyed_right_df <- setkeyv(copy(right_df), c("c", "d"))

dt_join <- function(left, right, on) {
  left <- copy(left)
  setkeyv(left, on)
  right <- copy(right)
  setkeyv(right, on)

  right[left, allow.cartesian = TRUE]
}

left_join_marks <- bench_mark(
  copy_left = copy(left_df),
  copy_right = copy(right_df),
  tidytable = left_join(left_df, right_df, by = c("c", "d")),
  dt_with_setkey = dt_join(left_df, right_df, c("c", "d")),
  dt_pre_keyed = keyed_right_df[keyed_left_df, allow.cartesian = TRUE],
  .fn = "left_join"
)

left_join_marks %>%
  mutate(dt_with_setkey = dt_with_setkey - copy_left - copy_right) %>%
  select(-starts_with("copy"))
#> # A tidytable: 1 × 4
#>   func_tested tidytable dt_with_setkey dt_pre_keyed
#>   <chr>           <dbl>          <dbl>        <dbl>
#> 1 left_join        54.1             61         34.1

Also - glad to hear tidytable is working out for you 😄

dibbles21 commented 1 year ago

Hi Mark,

Thank you for your time in reviewing this and developing the benchmarking. It's helpful to read the explanation of why this cannot be achieved.

I have adapted your code above and fit the data more to my use-case (larger data, more join columns). I find that tidytable tends to be around 15% slower than dt_with_set_key in my case. I don't know if you can replicate that? I reduced the number of iterations due to the extra time needed to run each iteration.

library(tidytable, w = FALSE)
library(data.table, w = FALSE)

bench_mark <- function(..., .fn = NULL) {
  bench::mark(...,
              check = FALSE, iterations = 5,
              memory = FALSE, time_unit = "ms") %>%
    suppressWarnings() %>%
    mutate(expression = as.character(expression),
           median = round(median, 1),
           func_tested = .env$.fn) %>%
    select(func_tested, expression, median) %>%
    pivot_wider(names_from = expression, values_from = median)
}

test_data <- function(.size) {
  tidytable(a = sample(1:20, .size, TRUE),
            b = sample(1:20, .size, TRUE),
            c = sample(1:20, .size, TRUE),
            d = sample(1:20, .size, TRUE),
            e = sample(1:20, .size, TRUE))
}

left_df <- test_data(2e7)

right_df <- left_df |> 
  slice(1:1e6) |> 
  mutate(f = 1)

keyed_left_df <- setkeyv(copy(left_df), c("a", "b","c", "d", "e"))

keyed_right_df <- setkeyv(copy(right_df), c("a", "b","c", "d", "e"))

dt_join <- function(left, right, on) {
  left <- copy(left)
  setkeyv(left, on)
  right <- copy(right)
  setkeyv(right, on)

  right[left, allow.cartesian = TRUE]
}

left_join_marks <- bench_mark(
  copy_left = copy(left_df),
  copy_right = copy(right_df),
  tidytable = left_join(left_df, right_df, by = c("a", "b","c", "d", "e")),
  dt_with_setkey = dt_join(left_df, right_df, c("a", "b","c", "d", "e")),
  dt_pre_keyed = keyed_right_df[keyed_left_df, allow.cartesian = TRUE],
  .fn = "left_join"
)

left_join_marks %>%
  mutate(dt_with_setkey = dt_with_setkey - copy_left - copy_right) %>%
  select(-starts_with("copy"))

#> # A tidytable: 1 x 4
#>  func_tested dt_pre_keyed dt_with_setkey tidytable
#> <chr>              <dbl>          <dbl>     <dbl>
#>1 left_join          1427.          2481.      2881

Thank you again :)

Dan

dibbles21 commented 1 year ago

Interestingly, when I run the code with one iteration on a fresh R session (i.e., after restarting R), tidytable is around 35% slower than dt_with_set_key. If I then run the same code in the same session, the time difference vastly reduces to around 10-15%. I've tried this a few times to confirm it wasn't random. Could this be that tidytable is making better use of caching? I have noticed this with my analysis pipeline too, which takes around 2 minutes on a fresh session, and then around 1.7 minutes if ran again in the same session.

I realise that running with one iteration is not good benchmarking, but it more accurately reflects my real use and I'm interested in what might be causing this.

markfairbanks commented 1 year ago

Interesting, I'm not sure why the fresh session is causing things to be slower.

As for the join timings - I think this is a situation where tidytable will just be slower than data.table in some situations. Thanks for the discussion though - if you run into any other features you want added let me know 😄

dibbles21 commented 1 year ago

Thank you Mark!