tidyverse / tibble

A modern re-imagining of the data frame
https://tibble.tidyverse.org/
Other
682 stars 130 forks source link

Issue with memory allocations when using as_tibble() #1605

Open danwwilson opened 1 month ago

danwwilson commented 1 month ago

I am writing a simple function to return the first row of a dataset using data.table. I don't want to assume all inputs will be a data.table so was conducting some testing with tibbles and came across the following bug.

When using as_tibble() (from ‘3.2.1’) to define a tibble I assume it must keep some reference to the original data.frame. The reprex below shows some strange reordering that happens when the original data.frame is modified and then a function is called on the tibble created from as_tibble().

I am trying to understand why this might occur. I would have thought that tbl <- as_tibble(df) would have assigned an item to a new memory location and no longer referenced df. If we change the order of execution to run `select_first(tbl, keyby = "c") before any others it returns the desired output.

library(data.table)
library(tibble)

select_first <- function(dt, by, keyby) {
  setDT(dt)

  if (!missing(keyby)) {
    data.table::setkeyv(dt, keyby)
  }

  if (missing(by)) {
    return(dt[1])
  }

  dt[dt[, .I[1], by]$V1]
}

df <- data.frame(
  a = 1:5,
  b = LETTERS[1:5],
  c = 5:1,
  d = c("a", "a", "b", "b", "b")
)
df
#   a b c d
# 1 1 A 5 a
# 2 2 B 4 a
# 3 3 C 3 b
# 4 4 D 2 b
# 5 5 E 1 b

tbl <- tibble::as_tibble(df)
tbl
# # A tibble: 5 × 4
#       a b         c d
#   <int> <chr> <int> <chr>
# 1     1 A         5 a
# 2     2 B         4 a
# 3     3 C         3 b
# 4     4 D         2 b
# 5     5 E         1 b

dt <- data.table::as.data.table(df)
dt
#        a      b     c      d
#    <int> <char> <int> <char>
# 1:     1      A     5      a
# 2:     2      B     4      a
# 3:     3      C     3      b
# 4:     4      D     2      b
# 5:     5      E     1      b

df # format remains the same as when first created
#   a b c d
# 1 1 A 5 a
# 2 2 B 4 a
# 3 3 C 3 b
# 4 4 D 2 b
# 5 5 E 1 b

select_first(df, keyby = "c") # result as expected
# Key: <c>
#        a      b     c      d
#    <int> <char> <int> <char>
# 1:     5      E     1      b

tbl # order of columns `b` and `d` have changed
# A tibble: 5 × 4
#       a b         c d
#   <int> <chr> <int> <chr>
# 1     1 E         5 b
# 2     2 D         4 b
# 3     3 C         3 b
# 4     4 B         2 a
# 5     5 A         1 a

select_first(tbl, keyby = "c") # results are not as expected
# Key: <c>
#        a      b     c      d
#    <int> <char> <int> <char>
# 1:     5      A     1      a

dt # format remains the same as when first created
#        a      b     c      d
#    <int> <char> <int> <char>
# 1:     1      A     5      a
# 2:     2      B     4      a
# 3:     3      C     3      b
# 4:     4      D     2      b
# 5:     5      E     1      b

select_first(dt, keyby = "c") # result as expected
# Key: <c>
#        a      b     c      d
#    <int> <char> <int> <char>
# 1:     5      E     1      b
TimTaylor commented 2 weeks ago

@danwwilson - this is not to do with tibble but with the fact that you should avoid using setDT programmatically (i.e use as.data.table() instead) or exercise great care in non-interactive settings. To illustrate:

as_bob <- function(x) {
    x <- as.data.frame(x)
    class(x) <- c("bob", class(x))
    x
}

df <- data.frame(
    a = 1:5,
    b = LETTERS[1:5],
    c = 5:1,
    d = c("a", "a", "b", "b", "b")
)

(bob <- as_bob(df))
#>   a b c d
#> 1 1 A 5 a
#> 2 2 B 4 a
#> 3 3 C 3 b
#> 4 4 D 2 b
#> 5 5 E 1 b
select_first(df, keyby = "c")
#> Key: <c>
#>        a      b     c      d
#>    <int> <char> <int> <char>
#> 1:     5      E     1      b
bob
#>   a b c d
#> 1 1 E 5 b
#> 2 2 D 4 b
#> 3 3 C 3 b
#> 4 4 B 2 a
#> 5 5 A 1 a

The onus would be on data.table to change the behaviour of setDT() (should they wish to) but that would likely remove it's efficient nature. You could make your function force a copy() but in that case you might as well call as.data.table() (at this point the only real difference in overhead would be the method dispatch).

Hope that makes sense.