tidyverse / dbplyr

Database (DBI) backend for dplyr
https://dbplyr.tidyverse.org
Other
474 stars 173 forks source link

Oracle generates wrong SQL for `head()` #1436

Closed nilescbn closed 7 months ago

nilescbn commented 9 months ago

I've got an issue that may be dbplyr related but I'm not 100% sure. My apologies if it's not.

In brief, I'm a big fan of using dplyr with databases. So, thank you for your work. In terms of workflow, I typically just print the output of a query to a RStudio notebook interactively before settling on what I want and bringin the data into R using collect(). Last week, I was getting results that just didn't add up. After some troubleshooting, I figured out that the output was not printing correctly.

In terms of a reproducible example, access to the database is limited. So I use some screenshots to demonstrate the behavior. This behavior happens whether I connect to the Oracle database using odbc or ROracle.

The ft object is a connection created using tbl(). The screenshot shows the output of this query.

ft |> distinct(MANAGEMENT_GROUP_CODE)

image

I thought it might have something to do with the RStudio R Markdown notebook but the same thing happens if I print to the Console, as this next screenshot shows.

image

Only 1 row shows up. As the next screenshot, shows there are 8 unique codes and they correctly display if I use collect().

ft |> distinct(MANAGEMENT_GROUP_CODE) |> collect()

image

It's not just a single row that shows up. I've seen it, for example, display 11 rows when it should have shown 15.

I'm guessing the issue is related to dplyr because if I use a SQL chunk to run the query, the output prints correctly (and, as I understand it, that works via DBI). It seems limited to Oracle. It does not happen when I query our SQL Server or Postgres databases. I suspected the tbl_dbi object (and it's Oracle versions) because the behavior goes away when collect creates the tbl_df.

It could be an issue with RStudio because the query output prints correctly without collect() if I use the R GUI console instead.

I downloaded the developmemt version of dbplyr (2.4.0.9000) but that didn't fix it. I'm using R version 4.3.2 and RStudio 2023.12.0+369 "Ocean Storm" on Windows 10 (although I only updated to this version this morning and the issue existed under the prior version I was using too--that version was a few months old at least).

Okay, thank you for the help. I'm standing by to provide additional information if helpful.

hadley commented 8 months ago

Would you mind having a go producing a minimal reprex (reproducible example) with the reprex package? The goal of a reprex is to make it as easy as possible for me to recreate your problem so that I can fix it: please help me help you! If you've never heard of a reprex before, start by reading about the reprex package, including the advice further down the page.

That'll help us figure out if the problem is in RStudio or somewhere higher up the stack.

nilescbn commented 8 months ago

Thank you, for your attention to this Hadley.

With all the great resources you and colleagues have put out there, I do get the basics of a reprex and the package. However, the instructions say to us "the smallest, simplest, built-in data possible." And I haven't figured out how to do that with a database. I hate to bother with you more, but I'm of course willing to try anything. I'm just not connecting the dots on how to do it...

Here's a generic example representing a simple query of the type I show in the screenshots in my original post.

library(DBI)
library(dplyr)
library(dbplyr)

con <- dbConnect(odbc::odbc(), dsn = "dsn name",  uid = "your username", pwd = "your password")

con |>  tbl("database_table_name")  |> distinct(column_name)

My thought was that it has something to do with Oracle databases and "tbl_OraConnection" and "tbl_Oracle" objects. So I was assuming that running a simple query like that on a different Oracle database might show the same behavior.

Thank you again, much appreciated. I'll keep figuring out how to do a better reprex, but if anyone has tips, those would be great to hear.

My session info. ``` R version 4.3.2 (2023-10-31 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19045) Matrix products: default locale: [1] LC_COLLATE=English_United States.utf8 [2] LC_CTYPE=English_United States.utf8 [3] LC_MONETARY=English_United States.utf8 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.utf8 time zone: America/Los_Angeles tzcode source: internal attached base packages: [1] stats graphics grDevices utils datasets methods [7] base other attached packages: [1] dbplyr_2.4.0.9000 dplyr_1.1.4 DBI_1.2.0 loaded via a namespace (and not attached): [1] vctrs_0.6.5 cli_3.6.2 knitr_1.45 [4] rlang_1.1.3 xfun_0.41 purrr_1.0.2 [7] generics_0.1.3 assertthat_0.2.1 glue_1.7.0 [10] bit_4.0.5 hms_1.1.3 fansi_1.0.6 [13] tibble_3.2.1 lifecycle_1.0.4 odbc_1.4.1 [16] compiler_4.3.2 fs_1.6.3 blob_1.2.4 [19] Rcpp_1.0.12 pkgconfig_2.0.3 rstudioapi_0.15.0 [22] keyring_1.3.2 R6_2.5.1 reprex_2.1.0 [25] tidyselect_1.2.0 utf8_1.2.4 pillar_1.9.0 [28] magrittr_2.0.3 withr_2.5.2 tools_4.3.2 [31] bit64_4.0.5 datapasta_3.1.0 ```
hadley commented 8 months ago

Yes, running that example you suggest with reprex would be perfect. You then would just edit the reprex before posting to remove your password from the code you put on github.

nilescbn commented 8 months ago

Okay, thank you. I imagine I might not have quite got all the way there in doing it correctly, please let me know if not.

But here's what reprex() puts on the clipboard (with me masking sensitive info) for the case where the wrong info is returned. As you'll see, to make it even more concise I used summarize() with n_distinct() instead of just distinct().

library(DBI)
library(dplyr, warn.conflicts = FALSE )
library(dbplyr, warn.conflicts = FALSE)

con <- dbConnect(odbc::odbc(), 
                 dsn = [...], 
                 uid = [...}
                 pwd = [...])

con |> 
  tbl(in_schema([...], [...]) |> 
  summarize(n_distinct(DISTRICT_CODE)) 
#> # Source:   SQL [1 x 1]
#> # Database: Oracle 19.00.0000[***masked***]
#>   `n_distinct(DISTRICT_CODE)`
#>                         <dbl>
#> 1                           2

Created on 2024-01-16 with reprex v2.1.0

And then here's an example of how adding collect() returns the correct info.

library(DBI)
library(dplyr, warn.conflicts = FALSE )
library(dbplyr, warn.conflicts = FALSE)

con <- dbConnect(odbc::odbc(), 
                 dsn = [...], 
                 uid = [...}
                 pwd = [...])

con |> 
  tbl(in_schema([...], [...])  |> 
  summarize(n_distinct(DISTRICT_CODE)) |> 
  collect()
#> # A tibble: 1 × 1
#>   `n_distinct(DISTRICT_CODE)`
#>                         <dbl>
#> 1                          12

Created on 2024-01-16 with reprex v2.1.0

So it tells me there are only two distinct values without including collect() when there are truly 12.

I don't think I added in my previous posts that I load dbplyr because of the in_schema() function (but don't expect that it's very relevant).

hadley commented 8 months ago

I think I'm getting a glimmer of what's going wrong here. Could you please run this code and send me the output?

my_tbl <- con |> tbl(in_schema([...], [...]) 

my_tbl |> 
  summarize(n_distinct(DISTRICT_CODE)) |> 
  head(6)

my_tbl  |> 
  summarize(n_distinct(DISTRICT_CODE)) |> 
  head(6) |> 
  show_query()

I think the problem might be that print() automatically calls head() in order to download only a small amount of data. In most databases we use LIMIT to compute that, and LIMIT is applied after the summaries have been computed. But Oracle doesn't support LIMIT so we instead do (e.g.) ROWNUM <= 6 and I'm guessing that's applied before the summaries are computed.

It looks like we need to always use ROWNUM in a subquery, as recommended by https://www.techonthenet.com/oracle/functions/rownum.php.

nilescbn commented 8 months ago

Okay, thank you Hadley. Here's the copy/paste output from reprex() from the code you shared (with certain entries masked with [...] again).

my_tbl |> 
  summarize(n_distinct(DISTRICT_CODE)) |> 
  head(6)
#> # Source:   SQL [1 x 1]
#> # Database: Oracle 19.00.0000[...]
#>   `n_distinct(DISTRICT_CODE)`
#>                         <dbl>
#> 1                           1

my_tbl  |> 
  summarize(n_distinct(DISTRICT_CODE)) |> 
  head(6) |> 
  show_query()
#> <SQL>
#> SELECT COUNT(DISTINCT "DISTRICT_CODE") AS "n_distinct(DISTRICT_CODE)"
#> FROM "[...]"."[...]"
#> WHERE (ROWNUM <= 6)

Created on 2024-01-17 with reprex v2.1.0

Following that link you sent on subqueries, I can confirm that running this SQL chunk returns the correct result (and also, that query from show_query() produces the wrong count).

SELECT *
FROM (SELECT COUNT(DISTINCT "DISTRICT_CODE") AS "n_distinct(DISTRICT_CODE)"
      FROM "RECFIN_MARTS"."COMPREHENSIVE_REC_CATCH_EST")
WHERE ROWNUM <= 6;

The odd thing is that I'm 99.9% sure that this behaviour is new. I've been using dplyr with this database pretty much as soon as Oracle was supported.

hadley commented 8 months ago

Oh yeah, it looks like it was introduced in #1292, so we made need to revert that change.

Thanks for helping me debug this problem!