tidyverse / dtplyr

Data table backend for dplyr
https://dtplyr.tidyverse.org
Other
670 stars 57 forks source link

rle() returns NULL #434

Closed camnesia closed 1 year ago

camnesia commented 1 year ago

I recently updated the dtplyr package and the code snippet with rle() no longer works. The val should have values 19 and 4 but instead both are NULL.

library(dplyr)
library(tidyr)
library(dtplyr)

data <- tibble(name = c('a','b'),
             string = c('0000000000000000000','0000hu000000')) %>%
  lazy_dt() %>%
  mutate(val = sapply(string, function(x) rle(strsplit(x, '')[[1]])$lengths[1])) %>%
  collect()

image

markfairbanks commented 1 year ago

Root cause - rle() is returning an rle object (more or less a list) with lengths and values. Since base::lengths() is a function in the base environment, dt_squash() prepends ... It assumes lengths is a variable in the global environment instead of something to be extracted from the rle object.

Another example. Let's say we're trying to add a column of another data frame to our lazy_dt:

library(dplyr)
library(dtplyr)

df <- tibble(length = 1)

tibble(x = 1:3, y = c("a", "a", "b")) %>%
  lazy_dt() %>%
  mutate(new = df$length)
#> Warning: Unknown or uninitialised column: `..length`.
#> Warning in `[.data.table`(copy(`_DT1`), , `:=`(new = ..df$..length)): Column
#> 'new' does not exist to remove
#> Source: local data table [3 x 2]
#> Call:   copy(`_DT1`)[, `:=`(new = ..df$..length)]
#> 
#>       x y    
#>   <int> <chr>
#> 1     1 a    
#> 2     2 a    
#> 3     3 b    
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results
markfairbanks commented 1 year ago

The problem gets even more complicated if you put an action inside a function since dt_squash_call() tries to run eval() on them:

library(dtplyr)
library(dplyr, warn.conflicts = FALSE)

fn <- function() {
  df <- tibble(length = 1)

  tibble(x = 1:3, y = c("a", "a", "b")) %>%
    lazy_dt() %>%
    mutate(new = df$length)
}

fn()
#> Error in `$`(structure(list(length = 1), class = c("tbl_df", "tbl", "data.frame": invalid subscript type 'builtin'