tidyverse / dbplyr

Database (DBI) backend for dplyr
https://dbplyr.tidyverse.org
Other
469 stars 169 forks source link

Filter by a column in another table #1476

Closed Jiefei-Wang closed 3 months ago

Jiefei-Wang commented 3 months ago

This is an amazing package that helps me a lot in my work! I'm not sure if this is just a limitation in my backend database(I'm using Oracle SQL) or a feature that hasn't been developed. Suppose I have a database with two tables A and B where both tables have an id column, what I want to do is to filter the rows in table A where the id is presented in B. Here is a minimum example in dplyr:

library(dplyr)
A <- data.frame(id=1:5, value1 = runif(5))
B <- data.frame(id=c(3, 3, 5))
A |> filter(id %in% B$id)

This, however, cannot be easily converted to dbplyr if both A and B are tables in a database. Subsetting the object B by column name will give a NULL value(e.g. B$id = NULL) and therefore lead to an error. As a workaround, I have to preprocess the table B to create distinct ids, then perform an inner join. It is still fine, but not that elegant. I wonder if there is any better way to do the filter.

fh-mthomson commented 3 months ago

Would the following work (translating A and B to tables on Oracle)?

library(dplyr)
A <- data.frame(id=1:5, value1 = runif(5))
B <- data.frame(id=c(3, 3, 5))

A %>% 
  semi_join(B, by = "id")

As you mentioned, using x$id to identify a column is intentionally not supported for lazy tibbles, per the error message:

table <- dbplyr::lazy_frame(i = 1, con = dbplyr::simulate_dbi())
table$id
#> Error in `table$id`:
#> ! The `$` method of <tbl_lazy> is for internal use only.
#> ℹ Use `dplyr::pull()` to get the values in a column.

Created on 2024-03-15 with reprex v2.1.0

hadley commented 3 months ago

Yeah, a semi join is what you want here. Thanks for the explanation @fh-mthomson!

Jiefei-Wang commented 3 months ago

Hi @fh-mthomson , Thanks for your great explanation! I'm just curious about the reason that the $ operation is not supported. I understand collecting a column from the table is costly, but is it possible to return a delayed vector? What is the major concern about it? Thanks.

fh-mthomson commented 3 months ago

The x object is a "lazy tibble" which is constructured as a list object, rather than a data frame object. Technically, x$column_name only makes sense for data frames, but not the lazy version. However, you can still readily the values via dplyr::pull(table, id).

See more technical detail in #1056

The same can be said for $ -- even if it doesn't return a column, I would guess that's what the vast majority of people who wind up using $ on a tbl_lazy object will have intended.

hadley commented 3 months ago

To add to @fh-mthomson's explanation, I don't think there's an obvious way to translate id %in% B$id to SQL unless you did something like id %in% ('a', 'b', 'c', ...). And that's unnecessarily expensive because you'd have to download that entire vector into R and convert it to SQL and then send it back to the database. With the semi join, all the data stays in one place (the database).