tidyverse / dbplyr

Database (DBI) backend for dplyr
https://dbplyr.tidyverse.org
Other
474 stars 173 forks source link

Snowflake translation error: dropped filter with `anti_join` #1474

Closed fh-afrachioni closed 6 months ago

fh-afrachioni commented 7 months ago

Greetings, dbplyr friends. I'd like to report an issue we're experiencing with Snowflake translations, related to filter() criteria dropped from lazy tables when semi_join()ed. This seems specifically limited to filter() applied to columns which result from summarize(), that are not selected for inclusion in the result.

Here's a small example, reproducing the behavior:

sim_1 <- dbplyr::lazy_frame(a = 5, b = 5, con = dbplyr::simulate_snowflake())
sim_2 <- dbplyr::lazy_frame(a = 5, b = 5, con = dbplyr::simulate_snowflake())

sim_2_transformed <- sim_2 %>% 
  group_by(a) %>% 
  summarize(group_count = n()) %>% 
  filter(group_count > 3) %>% 
  select(a)

join_result <- semi_join(sim_1, sim_2_transformed, by = 'a')

If we inspect the query produced for sim_2_transformed, it looks correct:

> sim_2_transformed
<SQL>
SELECT `a`
FROM `df`
GROUP BY `a`
HAVING (COUNT(*) > 3.0)

However, the join drops the HAVING criterion leading to an incorrect result; join_result is

> join_result
<SQL>
SELECT `df_LHS`.*
FROM `df` AS `df_LHS`
WHERE EXISTS (
  SELECT 1 FROM `df` AS `df_RHS`
  WHERE (`df_LHS`.`a` = `df_RHS`.`a`)
)

Notably, if I remove select(a) from the definition of sim_2_transformed, the HAVING clause is included as expected:

> join_result_without_final_select
<SQL>
SELECT `df`.*
FROM `df`
WHERE EXISTS (
  SELECT 1 FROM (
  SELECT `a`, COUNT(*) AS `group_count`
  FROM `df`
  GROUP BY `a`
  HAVING (COUNT(*) > 3.0)
) AS `RHS`
  WHERE (`df`.`a` = `RHS`.`a`)
)

Thanks for the continuing support of Snowflake backends; I'm happy to help with any testing that might be valuable.

Release version of dbplyr, 2.4.0. cc @fh-mthomson

ejneer commented 7 months ago

Does devtools::install_github("tidyverse/dbplyr#1475") solve your issue? It seems to work for your reprex / my local tests.

fh-afrachioni commented 7 months ago

That fixes it!

> devtools::install_github("tidyverse/dbplyr#1475")
> # ... as above ...
> join_result
<SQL>
SELECT `df`.*
FROM `df`
WHERE EXISTS (
  SELECT 1 FROM (
  SELECT `a`
  FROM `df`
  GROUP BY `a`
  HAVING (COUNT(*) > 3.0)
) AS `RHS`
  WHERE (`df`.`a` = `RHS`.`a`)
)

I really appreciate the quick update here, and am hopeful this will make it into the upcoming release!

fh-mthomson commented 6 months ago

Adding that (1) the behavior isn't backend-specific and (2) the impact is that semi_join() and inner_join() can have alarmingly-different results:

library(dplyr, warn.conflicts = FALSE)

table <- tibble::tibble(a = c("x", "y"), b = c(1, 2))

con <- DBI::dbConnect(RSQLite::SQLite())
copy_to(con, table, "table")
tbl_lazy <- tbl(con, "table")

# only `y` meets condition:
filtered <- tbl_lazy %>% 
  group_by(a) %>% 
  summarize(include = all(b > 1)) %>% 
  filter(include) %>% 
  select(a)

# correct: only `y` returned
tbl_lazy %>% inner_join(filtered, by = "a")
#> # Source:   SQL [1 x 2]
#> # Database: sqlite 3.45.0 []
#>   a         b
#>   <chr> <dbl>
#> 1 y         2

# incorrect: both returned
tbl_lazy %>% semi_join(filtered, by = "a")
#> # Source:   SQL [2 x 2]
#> # Database: sqlite 3.45.0 []
#>   a         b
#>   <chr> <dbl>
#> 1 x         1
#> 2 y         2

Created on 2024-03-15 with reprex v2.1.0

Confirmed that the PR resolves the issue:

tbl_lazy %>% inner_join(filtered, by = "a")
#> # Source:   SQL [1 x 2]
#> # Database: sqlite 3.45.0 []
#>   a         b
#>   <chr> <dbl>
#> 1 y         2

# incorrect: both returned
tbl_lazy %>% semi_join(filtered, by = "a")
#> # Source:   SQL [1 x 2]
#> # Database: sqlite 3.45.0 []
#>   a         b
#>   <chr> <dbl>
#> 1 y         2

@hadley @mgirlich would it be possible to include #1475 in the next release?