Closed fh-afrachioni closed 6 months ago
Does devtools::install_github("tidyverse/dbplyr#1475")
solve your issue? It seems to work for your reprex / my local tests.
That fixes it!
> devtools::install_github("tidyverse/dbplyr#1475")
> # ... as above ...
> join_result
<SQL>
SELECT `df`.*
FROM `df`
WHERE EXISTS (
SELECT 1 FROM (
SELECT `a`
FROM `df`
GROUP BY `a`
HAVING (COUNT(*) > 3.0)
) AS `RHS`
WHERE (`df`.`a` = `RHS`.`a`)
)
I really appreciate the quick update here, and am hopeful this will make it into the upcoming release!
Adding that (1) the behavior isn't backend-specific and (2) the impact is that semi_join()
and inner_join()
can have alarmingly-different results:
library(dplyr, warn.conflicts = FALSE)
table <- tibble::tibble(a = c("x", "y"), b = c(1, 2))
con <- DBI::dbConnect(RSQLite::SQLite())
copy_to(con, table, "table")
tbl_lazy <- tbl(con, "table")
# only `y` meets condition:
filtered <- tbl_lazy %>%
group_by(a) %>%
summarize(include = all(b > 1)) %>%
filter(include) %>%
select(a)
# correct: only `y` returned
tbl_lazy %>% inner_join(filtered, by = "a")
#> # Source: SQL [1 x 2]
#> # Database: sqlite 3.45.0 []
#> a b
#> <chr> <dbl>
#> 1 y 2
# incorrect: both returned
tbl_lazy %>% semi_join(filtered, by = "a")
#> # Source: SQL [2 x 2]
#> # Database: sqlite 3.45.0 []
#> a b
#> <chr> <dbl>
#> 1 x 1
#> 2 y 2
Created on 2024-03-15 with reprex v2.1.0
Confirmed that the PR resolves the issue:
tbl_lazy %>% inner_join(filtered, by = "a")
#> # Source: SQL [1 x 2]
#> # Database: sqlite 3.45.0 []
#> a b
#> <chr> <dbl>
#> 1 y 2
# incorrect: both returned
tbl_lazy %>% semi_join(filtered, by = "a")
#> # Source: SQL [1 x 2]
#> # Database: sqlite 3.45.0 []
#> a b
#> <chr> <dbl>
#> 1 y 2
@hadley @mgirlich would it be possible to include #1475 in the next release?
Greetings,
dbplyr
friends. I'd like to report an issue we're experiencing with Snowflake translations, related tofilter()
criteria dropped from lazy tables whensemi_join()
ed. This seems specifically limited tofilter()
applied to columns which result fromsummarize()
, that are not selected for inclusion in the result.Here's a small example, reproducing the behavior:
If we inspect the query produced for
sim_2_transformed
, it looks correct:However, the join drops the
HAVING
criterion leading to an incorrect result;join_result
isNotably, if I remove
select(a)
from the definition ofsim_2_transformed
, theHAVING
clause is included as expected:Thanks for the continuing support of Snowflake backends; I'm happy to help with any testing that might be valuable.
Release version of dbplyr, 2.4.0. cc @fh-mthomson