Closed dibbles21 closed 1 year ago
Unfortunately I can't add this feature to tidytable
.
data.table
leaves it up to the user to be careful about when to use copy()
(which creates a "deep" copy) and when to modify-by-reference (using things like setkey()
or setorder()
). In tidytable
I take care of all of that in the background and take "shallow" copies wherever possible to keep the operations fast. On the other hand data.table
never uses shallow copies. Some functions in tidytable
are faster than the data.table
versions because of this, but some data.table
versions are faster than tidytable
versions as well. However on the whole tidytable
is generally as fast as data.table
. In this issue https://github.com/markfairbanks/tidytable/issues/312#issuecomment-1137433541 you can see the benchmarks that show this on some core functions.
General performance aside - since I use shallow copies as much as possible, using setorder()
/setkey()
on a tidytable
can cause modify-by-reference in places a user wouldn't expect it. Here's an example.
In tidytable
I use a shallow copy in select()
so that it's a very fast transformation. In this example most users would expect df2
to remain unaltered. Note how df2
is altered by setorder(df1, x)
.
library(tidytable, w = FALSE)
library(data.table, w = FALSE)
df1 <- tidytable(x = c("b", "a", "c"), y = c(2, 1, 3))
df2 <- df1 %>%
select(x, y)
setorder(df1, x)
df2
#> # A tidytable: 3 × 2
#> x y
#> <chr> <dbl>
#> 1 a 1
#> 2 b 2
#> 3 c 3
However for joins specifically - are the joins faster when you include the time of the setkey()
step? From what I can tell including setkey()
makes the joins take the same amount of time.
I could be wrong here, but this is my attempt at testing it. It's a little tough to test this since setkey()
modifies-by-reference. So I take a copy of the datasets and subtract out the benchmarks to get the adjusted time. (At the end you'll see I take the time to copy both data frames and subtract it from the dt_with_setkey
benchmark).
library(tidytable, w = FALSE)
library(data.table, w = FALSE)
bench_mark <- function(..., .fn = NULL) {
bench::mark(...,
check = FALSE, iterations = 30,
memory = FALSE, time_unit = "ms") %>%
suppressWarnings() %>%
mutate(expression = as.character(expression),
median = round(median, 1),
func_tested = .env$.fn) %>%
select(func_tested, expression, median) %>%
pivot_wider(names_from = expression, values_from = median)
}
random_strings <- c("OoVt", "wCbu", "cXxX", "jdFu", "MCRx",
"ukhz", "ikce", "PHyu", "jpBY", "nLQM")
test_data <- function(.size) {
tidytable(a = sample(1:20, .size, TRUE),
b = sample(1:20, .size, TRUE),
c = sample(random_strings, .size, TRUE),
d = sample(random_strings, .size, TRUE))
}
left_df <- test_data(1000000)
right_df <- left_df %>%
distinct(c, d) %>%
mutate(e = row_number())
keyed_left_df <- setkeyv(copy(left_df), c("c", "d"))
keyed_right_df <- setkeyv(copy(right_df), c("c", "d"))
dt_join <- function(left, right, on) {
left <- copy(left)
setkeyv(left, on)
right <- copy(right)
setkeyv(right, on)
right[left, allow.cartesian = TRUE]
}
left_join_marks <- bench_mark(
copy_left = copy(left_df),
copy_right = copy(right_df),
tidytable = left_join(left_df, right_df, by = c("c", "d")),
dt_with_setkey = dt_join(left_df, right_df, c("c", "d")),
dt_pre_keyed = keyed_right_df[keyed_left_df, allow.cartesian = TRUE],
.fn = "left_join"
)
left_join_marks %>%
mutate(dt_with_setkey = dt_with_setkey - copy_left - copy_right) %>%
select(-starts_with("copy"))
#> # A tidytable: 1 × 4
#> func_tested tidytable dt_with_setkey dt_pre_keyed
#> <chr> <dbl> <dbl> <dbl>
#> 1 left_join 54.1 61 34.1
Also - glad to hear tidytable
is working out for you 😄
Hi Mark,
Thank you for your time in reviewing this and developing the benchmarking. It's helpful to read the explanation of why this cannot be achieved.
I have adapted your code above and fit the data more to my use-case (larger data, more join columns). I find that tidytable
tends to be around 15% slower than dt_with_set_key
in my case. I don't know if you can replicate that? I reduced the number of iterations due to the extra time needed to run each iteration.
library(tidytable, w = FALSE)
library(data.table, w = FALSE)
bench_mark <- function(..., .fn = NULL) {
bench::mark(...,
check = FALSE, iterations = 5,
memory = FALSE, time_unit = "ms") %>%
suppressWarnings() %>%
mutate(expression = as.character(expression),
median = round(median, 1),
func_tested = .env$.fn) %>%
select(func_tested, expression, median) %>%
pivot_wider(names_from = expression, values_from = median)
}
test_data <- function(.size) {
tidytable(a = sample(1:20, .size, TRUE),
b = sample(1:20, .size, TRUE),
c = sample(1:20, .size, TRUE),
d = sample(1:20, .size, TRUE),
e = sample(1:20, .size, TRUE))
}
left_df <- test_data(2e7)
right_df <- left_df |>
slice(1:1e6) |>
mutate(f = 1)
keyed_left_df <- setkeyv(copy(left_df), c("a", "b","c", "d", "e"))
keyed_right_df <- setkeyv(copy(right_df), c("a", "b","c", "d", "e"))
dt_join <- function(left, right, on) {
left <- copy(left)
setkeyv(left, on)
right <- copy(right)
setkeyv(right, on)
right[left, allow.cartesian = TRUE]
}
left_join_marks <- bench_mark(
copy_left = copy(left_df),
copy_right = copy(right_df),
tidytable = left_join(left_df, right_df, by = c("a", "b","c", "d", "e")),
dt_with_setkey = dt_join(left_df, right_df, c("a", "b","c", "d", "e")),
dt_pre_keyed = keyed_right_df[keyed_left_df, allow.cartesian = TRUE],
.fn = "left_join"
)
left_join_marks %>%
mutate(dt_with_setkey = dt_with_setkey - copy_left - copy_right) %>%
select(-starts_with("copy"))
#> # A tidytable: 1 x 4
#> func_tested dt_pre_keyed dt_with_setkey tidytable
#> <chr> <dbl> <dbl> <dbl>
#>1 left_join 1427. 2481. 2881
Thank you again :)
Dan
Interestingly, when I run the code with one iteration on a fresh R session (i.e., after restarting R), tidytable
is around 35% slower than dt_with_set_key
. If I then run the same code in the same session, the time difference vastly reduces to around 10-15%. I've tried this a few times to confirm it wasn't random. Could this be that tidytable
is making better use of caching? I have noticed this with my analysis pipeline too, which takes around 2 minutes on a fresh session, and then around 1.7 minutes if ran again in the same session.
I realise that running with one iteration is not good benchmarking, but it more accurately reflects my real use and I'm interested in what might be causing this.
Interesting, I'm not sure why the fresh session is causing things to be slower.
As for the join timings - I think this is a situation where tidytable
will just be slower than data.table
in some situations. Thanks for the discussion though - if you run into any other features you want added let me know 😄
Thank you Mark!
Hi Mark,
Firstly, thank you for your excellent package which I use regularly 😊.
Is there a way to sort using 'setkey' or 'setkeyv' from {data.table} using {tidytable}? I'm finding my joins much quicker using {data.table} and 'setkeyv' compared to {tidytable}.
Many thanks,
Dan