tidyverse / dtplyr

Data table backend for dplyr
https://dtplyr.tidyverse.org
Other
670 stars 57 forks source link

filter error on data with lubridate intervals #475

Closed jakeybob closed 3 months ago

jakeybob commented 3 months ago

Hi -- I've encountered an issue where dtplyr seems to fail when filtering data that has a lubridate::interval() column. I saw this originally on a tibble of ~50 columns, of various different data types (including several lubridate date/time etc types), and dropping the single interval() column seemed to fix it -- so it does seem to be specific to interval data.

I've submitted here (rather than as a lubridate issue) as it happens when the filtering is done with respect to other data (here an integer column).

It's easy enough to work around, but figured I'd raise an issue as the behaviour seems unexpected. Any thoughts appreciated! :smiley:

library(dplyr)
library(dtplyr)
library(lubridate)

# dummy data
df <- tibble(a = 1:3) |> 
  mutate(interval = interval(start = ymd("2024-01-01") - days(a), end = ymd("2024-01-01"))) 

# expected filter result using dplyr
df |> 
  filter(a == max(a))

# dtplyr filter result throws error
df |> 
  dtplyr::lazy_dt() |> 
  filter(a == max(a))

# dtplyr filter result (also throws error -- so nothing to do with max())
df |> 
  dtplyr::lazy_dt() |> 
  filter(a == 3)

# Error in `[<-`:
# ! Assigned data `map(.subset(x, unname), vectbl_set_names, NULL)` must be compatible with existing
#   data.
# ✖ Existing data has 1 row.
# ✖ Element 2 of assigned data has 3 rows.
# ℹ Row updates require a list value. Do you need `list()` or `as.list()`?
# Caused by error in `vectbl_recycle_rhs_rows()`:
# ! Can't recycle input of size 3 to size 1.

# dtplyr filter works when dropping lubridate::interval col
df |> 
  select(-interval) |> 
  dtplyr::lazy_dt() |> 
  filter(a == max(a))
sessionInfo()
─ Session info────────────────────────────────────────
 setting  value
 version  R version 4.4.0 (2024-04-24)
 os       macOS Sonoma 14.5
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Europe/London
 date     2024-07-15
 pandoc   2.12 @ /Users/xxx/opt/anaconda3/bin/pandoc
─ Packages───────────────────────────────────────────
 package     * version date (UTC) lib source
 cli           3.6.2   2023-12-11 [1] CRAN (R 4.4.0)
 data.table    1.15.4  2024-03-30 [1] CRAN (R 4.4.0)
 dplyr       * 1.1.4   2023-11-17 [1] CRAN (R 4.4.0)
 dtplyr      * 1.3.1   2023-03-22 [1] CRAN (R 4.4.0)
 fansi         1.0.6   2023-12-08 [1] CRAN (R 4.4.0)
 generics      0.1.3   2022-07-05 [1] CRAN (R 4.4.0)
 glue          1.7.0   2024-01-09 [1] CRAN (R 4.4.0)
 lifecycle     1.0.4   2023-11-07 [1] CRAN (R 4.4.0)
 lubridate   * 1.9.3   2023-09-27 [1] CRAN (R 4.4.0)
 magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.4.0)
 pillar        1.9.0   2023-03-22 [1] CRAN (R 4.4.0)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.4.0)
 R6            2.5.1   2021-08-19 [1] CRAN (R 4.4.0)
 rlang         1.1.3   2024-01-10 [1] CRAN (R 4.4.0)
 sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.4.0)
 tibble        3.2.1   2023-03-20 [1] CRAN (R 4.4.0)
 tidyselect    1.2.1   2024-03-11 [1] CRAN (R 4.4.0)
 timechange    0.3.0   2024-01-18 [1] CRAN (R 4.4.0)
 utf8          1.2.4   2023-10-22 [1] CRAN (R 4.4.0)
 vctrs         0.6.5   2023-12-01 [1] CRAN (R 4.4.0)
 withr         3.0.0   2024-01-16 [1] CRAN (R 4.4.0)
eutwt commented 3 months ago

Period objects and similar "multi-column" structures are not supported by data.table, as described in https://github.com/Rdatatable/data.table/issues/4415. I don't think there's anything we can do on the dtplyr end.

Notice the length of the "start" slot when subsetting a data frame vs when subsetting a data.table. Subsetting the data.table (rather than just a column) produces an error.

suppressPackageStartupMessages({
library(lubridate)
library(data.table)
library(dplyr)
})

df <- tibble(a = 1:3) |> 
  mutate(interval = interval(start = ymd("2024-01-01") - days(a), end = ymd("2024-01-01"))) 
dt <- as.data.table(df)

str(df[3, 'interval', drop = TRUE])
#> Formal class 'Interval' [package "lubridate"] with 3 slots
#>   ..@ .Data: num 259200
#>   ..@ start: POSIXct[1:1], format: "2023-12-29"
#>   ..@ tzone: chr "UTC"
str(dt[3, interval])
#> Formal class 'Interval' [package "lubridate"] with 3 slots
#>   ..@ .Data: num 259200
#>   ..@ start: POSIXct[1:3], format: "2023-12-31" "2023-12-30" ...
#>   ..@ tzone: chr "UTC"
dt[3]
#> Error in dimnames(x) <- dn: length of 'dimnames' [1] not equal to array extent

Created on 2024-07-21 with reprex v2.0.2

jakeybob commented 3 months ago

OK, thanks, appreciate the reply -- I wasn't aware of the underlying workings and multi-col structures etc; good to know!