tidyverse / dbplyr

Database (DBI) backend for dplyr
https://dbplyr.tidyverse.org
Other
477 stars 173 forks source link

Renamed columns are not being handled correctly in `join_by()` translation #1524

Open francisbarton opened 3 months ago

francisbarton commented 3 months ago

I discovered a strange error in my work, it seems like a bug? It probably looks like a pretty contrived situation in the reprex below, but that is what I had to do to reproduce what I found.

For some reason, the column name y$rhs_val is treated differently when being compared to the right argument of between(), to when compared to the left argument.

In the former case it retains its renamed name, in the latter case it reverts to its original name (VALUE). I would expect the "new" column name to be used identically on both halves of the between step.

library(dplyr, warn.conflicts = FALSE)

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

tbl_a <- tibble::tribble(
  ~ID, ~LEFT, ~RIGHT,
  "A", 1, 3,
  "B", 2, 4,
  "C", 3, 5
)

tbl_b <- tibble::tribble(
  ~ID, ~VALUE,
  "A", 4,
  "B", 4,
  "C", 4
) 

dplyr::copy_to(con, tbl_a)
dplyr::copy_to(con, tbl_b)

tbl_b2 <- dplyr::tbl(con, "tbl_b") |>
  janitor::clean_names() |>
  dplyr::rename(rhs_val = "value")

dplyr::tbl(con, "tbl_a") |>
  janitor::clean_names() |>
  dplyr::semi_join(
    tbl_b2,
    join_by(
      "id",
      between(y$rhs_val, x$left, x$right)
    )
  )
#> Error in `collect()`:
#> ! Failed to collect lazy table.
#> Caused by error:
#> ! no such column: tbl_b.rhs_val

dplyr::tbl(con, "tbl_a") |>
  janitor::clean_names() |>
  dplyr::semi_join(
    tbl_b2,
    join_by(
      "id",
      between(y$rhs_val, x$left, x$right)
    )
  ) |>
  dplyr::show_query()
#> <SQL>
#> SELECT `ID` AS `id`, `LEFT` AS `left`, `RIGHT` AS `right`
#> FROM `tbl_a`
#> WHERE EXISTS (
#>   SELECT 1 FROM `tbl_b`
#>   WHERE
#>     (`tbl_a`.`ID` = `tbl_b`.`ID`) AND
#>     (`tbl_a`.`LEFT` <= `tbl_b`.`VALUE`) AND
#>     (`tbl_a`.`RIGHT` >= `tbl_b`.`rhs_val`)
#> )

Created on 2024-07-16 with reprex v2.1.1

Needless to say, if you collect() both tables before attempting the semi_join, it works fine. So it seems a dbplyr issue specifically, as the show_query() code seems to confirm.

I have dbplyr v2.5.0 (CRAN)


(PS I was expecting the SQL translation here to use WHERE ... rhs_val BETWEEN LEFT AND RIGHT instead of the two-step comparison using <= and >= - but maybe BETWEEN ... AND ... doesn't work across all SQL implementations?)

davidfoord1 commented 3 months ago

To add on:

#> ! no such column: tbl_b.rhs_val

The rename() behaviour has been dropped by between() - there is no longer a

`VALUE` AS `rhs_val`

As in

show_query(tbl_b2)
#> <SQL>
#> SELECT `ID` AS `id`, `VALUE` AS `rhs_val`
#> FROM `tbl_b`

Trying with column-based joins the rename is in the final SELECT, but not applied to the table being joined to, so you still get the no such column error, as well as the mismatch between left and right.

col_query <- dplyr::tbl(con, "tbl_a") |>
  janitor::clean_names() |>
  dplyr::left_join(  # left instead of semi join
    tbl_b2,
    join_by(
      "id",
      between(y$rhs_val, x$left, x$right)
    )
  )

col_query
#> Error in `collect()`:
#> ! Failed to collect lazy table.
#> Caused by error:
#> ! no such column: tbl_b.rhs_val
#> Run `rlang::last_trace()` to see where the error occurred.

show_query(col_query)
#> <SQL>
#> SELECT
#>   `tbl_a`.`ID` AS `id`,
#>   `LEFT` AS `left`,
#>   `RIGHT` AS `right`,
#>   `VALUE` AS `rhs_val`
#> FROM `tbl_a`
#> LEFT JOIN `tbl_b`
#>   ON (
#>     `tbl_a`.`ID` = `tbl_b`.`ID` AND
#>     `tbl_a`.`LEFT` <= `tbl_b`.`VALUE` AND
#>     `tbl_a`.`RIGHT` >= `tbl_b`.`rhs_val`
#>   )
francisbarton commented 3 months ago

I've just discovered https://dbplyr.tidyverse.org/articles/reprex.html and I'm going to try to supply a tighter, smaller, lower-level reprex of the issue if I can home in on it.

francisbarton commented 3 months ago

between() doesn't seem to be implemented for MSSQL in the same way as it is for several other DBs simulated in dbplyr:

library(dbplyr)
translate_sql(between(x, 2L, 4L), con = simulate_sqlite())
#> <SQL> `x` BETWEEN 2 AND 4
translate_sql(between(x, 2L, 4L), con = simulate_postgres())
#> <SQL> `x` BETWEEN 2 AND 4
translate_sql(between(x, 2L, 4L), con = simulate_mssql())
#> Error in if (context$clause == "WHERE") {: argument is of length zero

Created on 2024-07-17 with reprex v2.1.1

francisbarton commented 3 months ago

Another reprex - sorry, I'm sorting of using this issue as a place to work through this.

it doesn't make a difference if you use simulate_mssql() or simulate_sqlite():

library(dbplyr)
library(dplyr, warn.conflicts = FALSE)

v1 <- list(id = LETTERS[1:5], x = 1:5, y = 5:1)
v2 <- list(id = LETTERS[1:5], w = 2:6)
lf1 <- lazy_frame(!!!v1, con = simulate_mssql())
lf2 <- lazy_frame(!!!v2, con = simulate_mssql()) |>
  rename(z = "w")

lf1 |>
  semi_join(lf2, join_by("id", between(y$z, x$x, x$y)))
#> <SQL>
#> SELECT `df_LHS`.*
#> FROM `df` AS `df_LHS`
#> WHERE EXISTS (
#>   SELECT 1 FROM `df` AS `df_RHS`
#>   WHERE
#>     (`df_LHS`.`id` = `df_RHS`.`id`) AND
#>     (`df_LHS`.`x` <= `df_RHS`.`w`) AND
#>     (`df_LHS`.`y` >= `df_RHS`.`z`)
#> )

lf1 <- lazy_frame(!!!v1, con = simulate_sqlite())
lf2 <- lazy_frame(!!!v2, con = simulate_sqlite()) |>
  rename(z = "w")

lf1 |>
  semi_join(lf2, join_by("id", between(y$z, x$x, x$y)))
#> <SQL>
#> SELECT `df_LHS`.*
#> FROM `df` AS `df_LHS`
#> WHERE EXISTS (
#>   SELECT 1 FROM `df` AS `df_RHS`
#>   WHERE
#>     (`df_LHS`.`id` = `df_RHS`.`id`) AND
#>     (`df_LHS`.`x` <= `df_RHS`.`w`) AND
#>     (`df_LHS`.`y` >= `df_RHS`.`z`)
#> )

Created on 2024-07-17 with reprex v2.1.1

francisbarton commented 3 months ago

There is a test that seems to relate to this here - with a comment to say that the query generated is wrong, although all the tests there pass.

If I amend the example situation in that test so that the last condition with lf4 is a join_by() criterion with the second part involving a comparison with < (forget about all the stuff about between() above), then we get:

library(dbplyr)
library(dplyr, warn.conflicts = FALSE)

# from https://github.com/tidyverse/dbplyr/blob/860bd6adf988a2b039485030d234eba05ee46a3c/tests/testthat/test-verb-joins.R#L656-L681
lf1 <- lazy_frame(x = 1, a = 2, .name = "lf1")
lf2 <- lazy_frame(x = 1, b = 3, .name = "lf2")
lf3 <- lazy_frame(x = 1, c = 4, .name = "lf3")
lf4 <- lazy_frame(x = 1, a = 5, .name = "lf4") |>
  rename(a4 = "a")

lf1 |>
  inner_join(lf2, by = "x") |>
  inner_join(lf3, by = "x") |>
  inner_join(lf4, by = join_by("x", a < a4))
#> <SQL>
#> SELECT `lf1`.*, `b`, `c`, `lf4`.`a` AS `a4`
#> FROM `lf1`
#> INNER JOIN `lf2`
#>   ON (`lf1`.`x` = `lf2`.`x`)
#> INNER JOIN `lf3`
#>   ON (`lf1`.`x` = `lf3`.`x`)
#> INNER JOIN `lf4`
#>   ON (`lf1`.`x` = `lf4`.`x` AND `lf1`.`a` < `lf4`.`a`)

Created on 2024-07-18 with reprex v2.1.1

Suggesting that the rename is being kept in the top-level SELECT line in an attempt to avoid having a nested query, but a rename like this probably needs to be operationalised sooner, in the final INNER JOIN, something like this? :

SELECT `lf1`.*, `b`, `c`, `a4`
FROM `lf1`
INNER JOIN `lf2`
  ON (`lf1`.`x` = `lf2`.`x`)
INNER JOIN `lf3`
  ON (`lf1`.`x` = `lf3`.`x`)
INNER JOIN (SELECT `x`, `a` AS `a4` FROM `lf4`) AS `lf4`
  ON (`lf1`.`x` = `lf4`.`x` AND `lf1`.`a` < `lf4`.`a4`)
francisbarton commented 3 months ago

Here's a really small reprex showing the issue with rename() - I think this behaviour is incorrect? In real life this SQL will cause an error saying that lf2.x is not found.

library(dbplyr)
library(dplyr, warn.conflicts = FALSE)

lf1 <- lazy_frame(x = seq(2), con = simulate_sqlite(), .name = "lf1")
lf2 <- lazy_frame(x = 1, con = simulate_sqlite(), .name = "lf2") |>
  rename(y = "x")

lf1 |>
  left_join(lf2, by = c(x = "y"), keep = TRUE)
#> <SQL>
#> SELECT `lf1`.`x` AS `x`, `lf2`.`x` AS `y`
#> FROM `lf1`
#> LEFT JOIN `lf2`
#>   ON (`lf1`.`x` = `lf2`.`x`)

Created on 2024-07-18 with reprex v2.1.1

The behaviour in my first post above with the left and right sides of between() being treated differently is more bizarre, but at root I am guessing it is something to do with where renames are actioned?