tidyverse / dtplyr

Data table backend for dplyr
https://dtplyr.tidyverse.org
Other
664 stars 58 forks source link

Implementation of bind_rows #437

Closed aarongraybill closed 1 year ago

aarongraybill commented 1 year ago

I am trying to duplicate all of the rows of a dataset, but I believe there is no way for me to do this in dtplyr.

In dplyr I could write:

df <- cars
bind_rows(df,df)

However, running the same on a dtplyr object gives:

df <- lazy_dt(cars)
bind_rows(df,df)
Error in `bind_rows()`:
! Argument 1 must be a data frame or a named atomic vector.
Run `rlang::last_trace()` to see where the error occurred.

The best I could come up with are the following two hack-y solutions:

bind_rows(collect(df), collect(df)) %>% lazy_dt()
full_join(df %>% mutate(temp = 1), df %>% mutate(temp = 2)) %>% select(-temp)

I would love a dtplyr-native implementation of bind_rows, but I recognize this might be much more complicated when someone is trying to use bind_rows for two non-identical lazy_dts. Thanks!

markfairbanks commented 1 year ago

Unfortunately the short answer is bind_rows() can't be implemented in dtplyr.

As for your specific case of duplicating your data frame, this workaround should work. (Let me know if it doesn't - my laptop died over the weekend and I'm still in the process of getting a new one!)

lazy_dt(df) %>%
  slice(1:n(), 1:n())

The more in depth answer is dtplyr implements S3 methods of dplyr functions (like mutate(), filter(), etc.) that work specifically on "lazy_dt" objects. dplyr::bind_rows() isn't an S3 generic, and therefore no S3 methods can be implemented. So we can't build a dtplyr version that works on lazy_dt's.

I also remember seeing a discussion at some point of turning bind_rows() into a generic, and it was decided it wasn't possible (I can't remember the exact reasons).

If you have any questions let me know!