tidyverse / dbplyr

Database (DBI) backend for dplyr
https://dbplyr.tidyverse.org
Other
474 stars 173 forks source link

postition of head in pipeline no longer leading to different sql #1489

Open edward-burn opened 6 months ago

edward-burn commented 6 months ago

@hadley @mgirlich I notice that with the current version of dbplyr if I put head() in my pipeline it will return the same sql whether it was before or after a filter. I imagine this relates to the sql optimisation that was introduced in version 2.3.0. I think in this case though the older behaviour was more desirable. I'm not sure this was an intentional change or would be considered a bug?

Current behaviour

library(dbplyr)
#> Warning: package 'dbplyr' was built under R version 4.2.3
packageVersion("dbplyr")
#> [1] '2.5.0'
lf_postgres <- lazy_frame(a = TRUE, b = 1, c = 2, d = "zzzyzzz", 
                          con = simulate_postgres())
lf_postgres |> 
  head(1) |> 
  dplyr::filter(d == "zzzyzzz") |> 
  dplyr::show_query()
#> <SQL>
#> SELECT `df`.*
#> FROM `df`
#> WHERE (`d` = 'zzzyzzz')
#> LIMIT 1

lf_postgres |> 
  dplyr::filter(d == "zzzyzzz") |> 
  head(1) |> 
  dplyr::show_query()
#> <SQL>
#> SELECT `df`.*
#> FROM `df`
#> WHERE (`d` = 'zzzyzzz')
#> LIMIT 1

Created on 2024-04-04 with reprex v2.0.2

Old behaviour

# remotes::install_version("dbplyr", version = "2.2.0", repos = "http://cran.us.r-project.org")
library(dbplyr)
packageVersion("dbplyr")
#> [1] '2.2.0'
lf_postgres <- lazy_frame(a = TRUE, b = 1, c = 2, d = "zzzyzzz", 
                          con = simulate_postgres())
lf_postgres |> 
  head(1) |> 
  dplyr::filter(d == "zzzyzzz") |> 
  dplyr::show_query()
#> <SQL>
#> SELECT *
#> FROM (
#>   SELECT *
#>   FROM `df`
#>   LIMIT 1
#> ) `q01`
#> WHERE (`d` = 'zzzyzzz')

lf_postgres |> 
  dplyr::filter(d == "zzzyzzz") |> 
  head(1) |> 
  dplyr::show_query()
#> <SQL>
#> SELECT *
#> FROM `df`
#> WHERE (`d` = 'zzzyzzz')
#> LIMIT 1

Created on 2024-04-04 with reprex v2.0.2

hadley commented 6 months ago

Hmmmm, this is probably technically a bug but it's not as straightforward as it might appear because LIMIT IIRC is generally not respected inside of subqueries.

edward-burn commented 6 months ago

Interesting. I suppose throwing a warning if head is included in a complex query might be appropriate?

The desired behaviour (well at least what I was thinking was desired) can be achieved using a filter with row_number. This made me thing that maybe I could use slice_head but then this is not supported - I'm not sure if it would be of interest to support it using something similar to how row_number is working below?

options(conflicts.policy = list(warn = FALSE))
library(dplyr)
library(dbplyr)
library(duckdb)
#> Loading required package: DBI

con <- DBI::dbConnect(duckdb(), path = ":memory:")

test_data <- data.frame(
  person = c(1L, 2L),
  date_1 = as.Date("2001-01-01")
)

db_test_data <- copy_to(con, test_data, overwrite = TRUE)

# head with unexpected behaviour
db_test_data |> 
  head(1) |> 
  dplyr::filter(person == 2L)
#> # Source:   SQL [1 x 2]
#> # Database: DuckDB v0.10.0 [eburn@Windows 10 x64:R 4.4.0/:memory:]
#>   person date_1    
#>    <int> <date>    
#> 1      2 2001-01-01

db_test_data |> 
  dplyr::filter(person == 2L) |> 
  head(1) 
#> # Source:   SQL [1 x 2]
#> # Database: DuckDB v0.10.0 [eburn@Windows 10 x64:R 4.4.0/:memory:]
#>   person date_1    
#>    <int> <date>    
#> 1      2 2001-01-01

# filter with row_number provides expected behaviour
db_test_data |> 
  dplyr::filter(row_number() <= 1) |> 
  dplyr::filter(person == 2L)
#> # Source:   SQL [0 x 2]
#> # Database: DuckDB v0.10.0 [eburn@Windows 10 x64:R 4.4.0/:memory:]
#> # ℹ 2 variables: person <int>, date_1 <date>

db_test_data |> 
  dplyr::filter(row_number() <= 1) |> 
  dplyr::filter(person == 2L) |> 
  dplyr::show_query()
#> <SQL>
#> SELECT person, date_1
#> FROM (
#>   SELECT test_data.*, ROW_NUMBER() OVER () AS col01
#>   FROM test_data
#> ) q01
#> WHERE (col01 <= 1.0) AND (person = 2)

db_test_data |> 
  dplyr::filter(person == 2L) |> 
  dplyr::filter(row_number() <= 1)
#> # Source:   SQL [1 x 2]
#> # Database: DuckDB v0.10.0 [eburn@Windows 10 x64:R 4.4.0/:memory:]
#>   person date_1    
#>    <int> <date>    
#> 1      2 2001-01-01

db_test_data |> 
  dplyr::filter(person == 2L) |> 
  dplyr::filter(row_number() <= 1) |> 
  dplyr::show_query()
#> <SQL>
#> SELECT person, date_1
#> FROM (
#>   SELECT test_data.*, ROW_NUMBER() OVER () AS col01
#>   FROM test_data
#>   WHERE (person = 2)
#> ) q01
#> WHERE (col01 <= 1.0)

# slice_head not supported, but could be?
db_test_data |> 
  dplyr::filter(row_number() <= 1) |> 
  dplyr::slice_head()
#> Error in `dplyr::slice_head()`:
#> ! `slice_head()` is not supported on database backends.
#> ℹ Please use `slice_min()` instead.

Created on 2024-04-05 with reprex v2.1.0

hadley commented 5 months ago

I'm pretty sure the reason that doesn't work is support for ROW_NUMBER() without an explicit column isn't widespread? My memory is not great here, but I'm pretty sure I would've used that idea if worked broadly.