Closed Jiefei-Wang closed 3 months ago
Would the following work (translating A
and B
to tables on Oracle)?
library(dplyr)
A <- data.frame(id=1:5, value1 = runif(5))
B <- data.frame(id=c(3, 3, 5))
A %>%
semi_join(B, by = "id")
As you mentioned, using x$id
to identify a column is intentionally not supported for lazy tibbles, per the error message:
table <- dbplyr::lazy_frame(i = 1, con = dbplyr::simulate_dbi())
table$id
#> Error in `table$id`:
#> ! The `$` method of <tbl_lazy> is for internal use only.
#> ℹ Use `dplyr::pull()` to get the values in a column.
Created on 2024-03-15 with reprex v2.1.0
Yeah, a semi join is what you want here. Thanks for the explanation @fh-mthomson!
Hi @fh-mthomson , Thanks for your great explanation! I'm just curious about the reason that the $
operation is not supported. I understand collecting a column from the table is costly, but is it possible to return a delayed vector? What is the major concern about it? Thanks.
The x
object is a "lazy tibble" which is constructured as a list object, rather than a data frame object. Technically, x$column_name
only makes sense for data frames, but not the lazy version. However, you can still readily the values via dplyr::pull(table, id)
.
See more technical detail in #1056
The same can be said for $ -- even if it doesn't return a column, I would guess that's what the vast majority of people who wind up using $ on a tbl_lazy object will have intended.
To add to @fh-mthomson's explanation, I don't think there's an obvious way to translate id %in% B$id
to SQL unless you did something like id %in% ('a', 'b', 'c', ...)
. And that's unnecessarily expensive because you'd have to download that entire vector into R and convert it to SQL and then send it back to the database. With the semi join, all the data stays in one place (the database).
This is an amazing package that helps me a lot in my work! I'm not sure if this is just a limitation in my backend database(I'm using Oracle SQL) or a feature that hasn't been developed. Suppose I have a database with two tables
A
andB
where both tables have anid
column, what I want to do is to filter the rows in tableA
where the id is presented inB
. Here is a minimum example in dplyr:This, however, cannot be easily converted to
dbplyr
if bothA
andB
are tables in a database. Subsetting the objectB
by column name will give aNULL
value(e.g.B$id = NULL
) and therefore lead to an error. As a workaround, I have to preprocess the tableB
to create distinctid
s, then perform an inner join. It is still fine, but not that elegant. I wonder if there is any better way to do the filter.