markfairbanks / tidytable

Tidy interface to 'data.table'
https://markfairbanks.github.io/tidytable/
Other
450 stars 32 forks source link

Invalid internal reference when modifying in place within a function #823

Closed Kimozu closed 2 months ago

Kimozu commented 2 months ago

I have a function taking in a data.table and modifying it in place. However, when I pass a tidytable to the function, it is not modified and I receive the following warning message:

"Warning message: In `[.data.table`(d1, , `:=`(a, 1)) : Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved."

However, when passing a data.table, all works as intended and the data.table is modified in place.

See below:

library(tidytable)

f = function(d1) {
  d1[, a := 1]
}

data1 <- tidytable(id = 1:100)

#Here I receive the warning:
f(data1)

#The column a has not been added
"a" %in% names(data1) #FALSE

data1 <- data.table(id = 1:100)

f(data1)

#The column a is added when using data.table
"a" %in% names(data1) #TRUE
markfairbanks commented 2 months ago

tidytable is not designed to work with modify-by-reference like data.table does. It instead utilizes shallow copies like dplyr does when possible.

Though tidytable will be slower than using modify-by-reference to add columns it isn't nearly as expensive as using copy() on a dataset in normal data.table.

Hope this helps - if you have any questions let me know.

library(data.table)
library(tidytable)

random_strings <- c('OoVt', 'wCbu', 'cXxX', 'jdFu', 'MCRx',
                    'ukhz', 'ikce', 'PHyu', 'jpBY', 'nLQM')

test_data <- function(.size) {
  tidytable(a = sample(1:20, .size, TRUE),
            b = sample(1:20, .size, TRUE),
            c = sample(random_strings, .size, TRUE),
            d = sample(random_strings, .size, TRUE))
}

tidytable_df <- test_data(10000000)
datatable_df <- as.data.table(tidytable_df)

bench::mark(
  tidytable_mutate = tidytable_df |> mutate(e = 1),
  tidytable_dt = tidytable_df |> dt(, e := 1),
  data.table_copy = copy(datatable_df)[, e := 1],
  data.table_by_ref = datatable_df[, e := 1],
  check = FALSE, iterations = 30
) |>
  select(expression, min, median, mem_alloc)
#> # A tidytable: 4 × 4
#>   expression             min   median mem_alloc
#>   <bch:expr>        <bch:tm> <bch:tm> <bch:byt>
#> 1 tidytable_mutate    6.22ms   6.95ms    78.9MB
#> 2 tidytable_dt        6.38ms   6.56ms    76.5MB
#> 3 data.table_copy    23.58ms  25.62ms   305.2MB
#> 4 data.table_by_ref   1.24ms   1.26ms    76.3MB
Kimozu commented 2 months ago

Thank you very much for the clarification!

I tried adding tidytable_df[, e := 1] to the benchmark (see below), and got comparable speed results as with data.table. Is there any reason to avoid mixing data.table syntax with tidytable in this way rather than via piping and dt() function?

library(data.table)
library(tidytable)

random_strings <- c('OoVt', 'wCbu', 'cXxX', 'jdFu', 'MCRx',
                    'ukhz', 'ikce', 'PHyu', 'jpBY', 'nLQM')

test_data <- function(.size) {
  tidytable(a = sample(1:20, .size, TRUE),
            b = sample(1:20, .size, TRUE),
            c = sample(random_strings, .size, TRUE),
            d = sample(random_strings, .size, TRUE))
}

tidytable_df <- test_data(10000000)
datatable_df <- as.data.table(tidytable_df)
dataframe_df = as.data.frame(tidytable_df)

bench::mark(
  tidytable_mutate = tidytable_df |> mutate(e = 1),
  tidytable_dt = tidytable_df |> dt(, e := 1),
  tidytable_dt_direct = tidytable_df[, e := 1],
  data.table_copy = copy(datatable_df)[, e := 1],
  data.table_by_ref = datatable_df[, e := 1],
  check = FALSE, iterations = 30
) |>
  select(expression, min, median, mem_alloc)
#> # A tidytable: 4 × 4
#>   expression             min   median mem_alloc
#>   <bch:expr>        <bch:tm> <bch:tm> <bch:byt>
#> 1 tidytable_mutate     15.36ms  18.79ms    76.3MB
#> 2 tidytable_dt         24.34ms  26.94ms    76.3MB
#> 3 tidytable_dt_direct   5.89ms   6.09ms    76.3MB
#> 4 data.table_copy       79.1ms  88.72ms   305.2MB
#> 5 data.table_by_ref     5.77ms      6ms    76.3MB
markfairbanks commented 2 months ago

When using data.table the entire workflow depends on you (the user) knowing when you are making deep copies and when you aren't. tidytable uses shallow copies where necessary and deep copies where necessary to work like dplyr. So you may see some unexpected issues of modify-by-reference affecting "upstream" datasets in tidytable. dt() prevents modify-by-reference without taking a deep copy to help the user not have to think about deep vs. shallow.

Also in general the time savings on a mutate() call are pretty small. If we take the above example and adjust it to e = a + b (a calculated field) instead of e = 1 (a length 1 literal) the time savings of modify-by-reference virtually disappear.

library(data.table)
library(tidytable)

random_strings <- c('OoVt', 'wCbu', 'cXxX', 'jdFu', 'MCRx',
                    'ukhz', 'ikce', 'PHyu', 'jpBY', 'nLQM')

test_data <- function(.size) {
  tidytable(a = sample(1:20, .size, TRUE),
            b = sample(1:20, .size, TRUE),
            c = sample(random_strings, .size, TRUE),
            d = sample(random_strings, .size, TRUE))
}

tidytable_df <- test_data(10000000)
datatable_df <- as.data.table(tidytable_df)

bench::mark(
  tidytable_mutate = tidytable_df |> mutate(e = a + b),
  tidytable_dt = tidytable_df |> dt(, e := a + b),
  data.table_copy = copy(datatable_df)[, e := a + b],
  data.table_by_ref = datatable_df[, e := a + b],
  check = FALSE, iterations = 30
) |>
  select(expression, min, median, mem_alloc)
#> # A tidytable: 4 × 4
#>   expression             min   median mem_alloc
#>   <bch:expr>        <bch:tm> <bch:tm> <bch:byt>
#> 1 tidytable_mutate    24.8ms   25.1ms    40.9MB
#> 2 tidytable_dt        24.6ms   25.1ms    38.3MB
#> 3 data.table_copy     44.6ms   45.6ms   267.1MB
#> 4 data.table_by_ref   24.3ms   24.8ms    38.2MB