Closed Kimozu closed 2 months ago
tidytable
is not designed to work with modify-by-reference like data.table
does. It instead utilizes shallow copies like dplyr
does when possible.
Though tidytable
will be slower than using modify-by-reference to add columns it isn't nearly as expensive as using copy()
on a dataset in normal data.table
.
Hope this helps - if you have any questions let me know.
library(data.table)
library(tidytable)
random_strings <- c('OoVt', 'wCbu', 'cXxX', 'jdFu', 'MCRx',
'ukhz', 'ikce', 'PHyu', 'jpBY', 'nLQM')
test_data <- function(.size) {
tidytable(a = sample(1:20, .size, TRUE),
b = sample(1:20, .size, TRUE),
c = sample(random_strings, .size, TRUE),
d = sample(random_strings, .size, TRUE))
}
tidytable_df <- test_data(10000000)
datatable_df <- as.data.table(tidytable_df)
bench::mark(
tidytable_mutate = tidytable_df |> mutate(e = 1),
tidytable_dt = tidytable_df |> dt(, e := 1),
data.table_copy = copy(datatable_df)[, e := 1],
data.table_by_ref = datatable_df[, e := 1],
check = FALSE, iterations = 30
) |>
select(expression, min, median, mem_alloc)
#> # A tidytable: 4 × 4
#> expression min median mem_alloc
#> <bch:expr> <bch:tm> <bch:tm> <bch:byt>
#> 1 tidytable_mutate 6.22ms 6.95ms 78.9MB
#> 2 tidytable_dt 6.38ms 6.56ms 76.5MB
#> 3 data.table_copy 23.58ms 25.62ms 305.2MB
#> 4 data.table_by_ref 1.24ms 1.26ms 76.3MB
Thank you very much for the clarification!
I tried adding tidytable_df[, e := 1]
to the benchmark (see below), and got comparable speed results as with data.table
. Is there any reason to avoid mixing data.table
syntax with tidytable
in this way rather than via piping and dt()
function?
library(data.table)
library(tidytable)
random_strings <- c('OoVt', 'wCbu', 'cXxX', 'jdFu', 'MCRx',
'ukhz', 'ikce', 'PHyu', 'jpBY', 'nLQM')
test_data <- function(.size) {
tidytable(a = sample(1:20, .size, TRUE),
b = sample(1:20, .size, TRUE),
c = sample(random_strings, .size, TRUE),
d = sample(random_strings, .size, TRUE))
}
tidytable_df <- test_data(10000000)
datatable_df <- as.data.table(tidytable_df)
dataframe_df = as.data.frame(tidytable_df)
bench::mark(
tidytable_mutate = tidytable_df |> mutate(e = 1),
tidytable_dt = tidytable_df |> dt(, e := 1),
tidytable_dt_direct = tidytable_df[, e := 1],
data.table_copy = copy(datatable_df)[, e := 1],
data.table_by_ref = datatable_df[, e := 1],
check = FALSE, iterations = 30
) |>
select(expression, min, median, mem_alloc)
#> # A tidytable: 4 × 4
#> expression min median mem_alloc
#> <bch:expr> <bch:tm> <bch:tm> <bch:byt>
#> 1 tidytable_mutate 15.36ms 18.79ms 76.3MB
#> 2 tidytable_dt 24.34ms 26.94ms 76.3MB
#> 3 tidytable_dt_direct 5.89ms 6.09ms 76.3MB
#> 4 data.table_copy 79.1ms 88.72ms 305.2MB
#> 5 data.table_by_ref 5.77ms 6ms 76.3MB
When using data.table
the entire workflow depends on you (the user) knowing when you are making deep copies and when you aren't. tidytable
uses shallow copies where necessary and deep copies where necessary to work like dplyr
. So you may see some unexpected issues of modify-by-reference affecting "upstream" datasets in tidytable
. dt()
prevents modify-by-reference without taking a deep copy to help the user not have to think about deep vs. shallow.
Also in general the time savings on a mutate()
call are pretty small. If we take the above example and adjust it to e = a + b
(a calculated field) instead of e = 1
(a length 1 literal) the time savings of modify-by-reference virtually disappear.
library(data.table)
library(tidytable)
random_strings <- c('OoVt', 'wCbu', 'cXxX', 'jdFu', 'MCRx',
'ukhz', 'ikce', 'PHyu', 'jpBY', 'nLQM')
test_data <- function(.size) {
tidytable(a = sample(1:20, .size, TRUE),
b = sample(1:20, .size, TRUE),
c = sample(random_strings, .size, TRUE),
d = sample(random_strings, .size, TRUE))
}
tidytable_df <- test_data(10000000)
datatable_df <- as.data.table(tidytable_df)
bench::mark(
tidytable_mutate = tidytable_df |> mutate(e = a + b),
tidytable_dt = tidytable_df |> dt(, e := a + b),
data.table_copy = copy(datatable_df)[, e := a + b],
data.table_by_ref = datatable_df[, e := a + b],
check = FALSE, iterations = 30
) |>
select(expression, min, median, mem_alloc)
#> # A tidytable: 4 × 4
#> expression min median mem_alloc
#> <bch:expr> <bch:tm> <bch:tm> <bch:byt>
#> 1 tidytable_mutate 24.8ms 25.1ms 40.9MB
#> 2 tidytable_dt 24.6ms 25.1ms 38.3MB
#> 3 data.table_copy 44.6ms 45.6ms 267.1MB
#> 4 data.table_by_ref 24.3ms 24.8ms 38.2MB
I have a function taking in a data.table and modifying it in place. However, when I pass a tidytable to the function, it is not modified and I receive the following warning message:
"Warning message: In `[.data.table`(d1, , `:=`(a, 1)) : Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved."
However, when passing a data.table, all works as intended and the data.table is modified in place.
See below: