tidyverse / dbplyr

Database (DBI) backend for dplyr
https://dbplyr.tidyverse.org
Other
472 stars 170 forks source link

Add <JDBCConnection> backend #1359

Open nviraj opened 12 months ago

nviraj commented 12 months ago

Hi,

I am using “pool” with “RJDBC” and “dbplyr” packages to connect to Athena tables. However I see the following warning.

Warning: <JDBCConnection> uses an old dbplyr interface
ℹ Please install a newer version of the package or contact the maintainer

This has been slightly tricky to pinpoint. Initially I thought it had something to do with RJDBC but it does not seem to be the case, as seen here. I subsequently reported the issue to Pool, but it does not seem to be originating from there as well as I could replicate the issue using a DBI connection as well as Pool.

It was easier to generate a reprex using PostgreSQL Docker and JDBC, and the issue does not seem to be database specific. Please let me know if I am doing something incorrectly or need to report this elsewhere.

Steps to reproduce:

1. Create docker image to connect to using the official image: docker run --network=host --name postgresdb_reprex -e POSTGRES_PASSWORD=mysecretpassword -e POSTGRES_USER=postgres -d postgres

2. Download JAR file from here Alternatively you can use the attached Project which contains the script and the JAR file for Windows. pool_issue.zip

3. Run reprex code reprex.R contains the actual code. Running through reprex::reprex(input = "reprex.R", venue = "gh", session_info = TRUE)

Reprex Content:

library(DBI)
#> Warning: package 'DBI' was built under R version 4.3.1
library(pool)
#> Warning: package 'pool' was built under R version 4.3.1
library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.3.1
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(RJDBC)
#> Warning: package 'RJDBC' was built under R version 4.3.1
#> Loading required package: rJava

# Connection details for docker image of postgres
connection_url <- "jdbc:postgresql://localhost:5432/postgres"
username <- "postgres"
password <- "mysecretpassword"
# Where is the JAR file stored
# JDBC downloaded through https://jdbc.postgresql.org/download/
jdbc_path <- "postgresql-42.6.0.jar"

# Create a JDBC driver
driver <-
  RJDBC::JDBC(
    driverClass = "org.postgresql.Driver",
    classPath = jdbc_path,
    identifier.quote = "'"
  )

# Establish connection through DBI
conn <- DBI::dbConnect(driver, connection_url,
  user = username,
  password = password
)
# Write table to database using DBI
if (!DBI::dbExistsTable(conn, "mtcars")) {
  DBI::dbWriteTable(conn, "mtcars", mtcars)
} else {
  base::message("Table already exists!")
}

# Create a Pool connection
con <- pool::dbPool(
  drv = driver, url = connection_url,
  user = username,
  password = password
)

# Connect to table
mtcars_db <- dplyr::tbl(con, "mtcars")
#> Warning: <JDBCConnection> uses an old dbplyr interface
#> ℹ Please install a newer version of the package or contact the maintainer
#> This warning is displayed once every 8 hours.

# Run sample query
mtcars_db %>%
  dplyr::filter(cyl == 8) %>%
  utils::head()
#> # Source:   SQL [6 x 11]
#> # Database: JDBCConnection
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#> 2  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#> 3  16.4     8  276.   180  3.07  4.07  17.4     0     0     3     3
#> 4  17.3     8  276.   180  3.07  3.73  17.6     0     0     3     3
#> 5  15.2     8  276.   180  3.07  3.78  18       0     0     3     3
#> 6  10.4     8  472    205  2.93  5.25  18.0     0     0     3     4

Created on 2023-09-04 with reprex v2.0.2

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.3.0 (2023-04-21 ucrt) #> os Windows 11 x64 (build 22621) #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate English_India.utf8 #> ctype English_India.utf8 #> tz Asia/Calcutta #> date 2023-09-04 #> pandoc 3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> ! package * version date (UTC) lib source #> blob 1.2.4 2023-03-17 [1] CRAN (R 4.3.1) #> cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.1) #> DBI * 1.1.3 2022-06-18 [1] CRAN (R 4.3.1) #> dbplyr 2.3.3 2023-07-07 [1] CRAN (R 4.3.1) #> digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.1) #> dplyr * 1.1.2 2023-04-20 [1] CRAN (R 4.3.1) #> evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.1) #> fansi 1.0.4 2023-01-22 [1] CRAN (R 4.3.1) #> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.1) #> fs 1.6.3 2023-07-20 [1] CRAN (R 4.3.1) #> generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.1) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.1) #> htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.1) #> knitr 1.43 2023-05-25 [1] CRAN (R 4.3.1) #> later 1.3.1 2023-05-02 [1] CRAN (R 4.3.1) #> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.1) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.1) #> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.1) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.1) #> pool * 1.0.1 2023-02-21 [1] CRAN (R 4.3.1) #> purrr 1.0.2 2023-08-10 [1] CRAN (R 4.3.1) #> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.3.1) #> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.3.0) #> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.3.0) #> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.3.1) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.1) #> Rcpp 1.0.11 2023-07-06 [1] CRAN (R 4.3.1) #> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.3.1) #> D rJava * 1.0-6 2021-12-10 [1] CRAN (R 4.3.0) #> RJDBC * 0.2-10 2022-03-24 [1] CRAN (R 4.3.1) #> rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.1) #> rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.1) #> rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.1) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.1) #> styler 1.10.2 2023-08-29 [1] CRAN (R 4.3.1) #> tibble 3.2.1 2023-03-20 [1] CRAN (R 4.3.1) #> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.1) #> utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.1) #> vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.3.1) #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.1) #> xfun 0.40 2023-08-09 [1] CRAN (R 4.3.1) #> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0) #> #> [1] C:/Users/viraj/AppData/Local/R/win-library/4.3 #> [2] C:/Program Files/R/R-4.3.0/library #> #> D ── DLL MD5 mismatch, broken installation. #> #> ────────────────────────────────────────────────────────────────────────────── ```

4. Stop and remove Docker image

docker stop postgresdb_reprex
docker rm -f postgresdb_reprex

Please let me know if you need any further details. Thanks!

shearerpmm commented 11 months ago

I did some digging. The warning is referring entirely to an issue in dbplyr itself. It essentially saying "the developers of dbplyr haven't yet written a backend for JDBC that is up to their own current standards".

Details

mgirlich commented 11 months ago

Currently, dbplyr doesn't support JDBC but only ODBC. @shearerpmm @nviraj does any of you maybe want to create a PR that adds JDBC support?

shearerpmm commented 11 months ago

I actually would really like to. I'm not sure how big of a lift it is for me, my experience in developing R packages is pretty limited.

Despite not having a specific backend, I've found that dbplyr already works fantastically well over JDBC. So far the only issue I've found is due to a problem in RJDBC, not dbplyr.

There are some places where the generated SQL doesn't work, but I'm not sure that can be solved with a JDBC backend, because JDBC is just a way to connect to a variety of databases with different SQL dialects. In my recent experience, I find SQL generation glitches can be easily routed around by using sql() and manually coding anything dialect-specific.

From this perspective, I'm not sure there a JDBC backend has anything of value to do other than silence this warning. And maybe warn the user that compute() currently doesn't work due to the RJDBC issue mentioned above.

hadley commented 10 months ago

@shearerpmm I think those are still valuable contributions 😄

shearerpmm commented 10 months ago

The little I'm able to do is worth my time. IMHO dbplyr is the best tool available today for building complex data pipelines in SQL databases. I know about the alternatives and the competition is not close.

juniperlsimonis commented 7 months ago

Big +1 on this request and dbplyr being invaluable

I've had to jump over to RJDBC to handle testing an enterprise db migration from an old DB2 to a new SQL Server. i would have used odbc connections, but the recent issues with odbc and db2 meant we were stuck using R 4.1.1 and thus hit memory limits etc (dealing with many tables that have tens of millions of entries and a dozen rows)

there are definitely still a lot of db2 databases out there and I imagine that other folks have had to do similar transitions if they wanted to update R itself, so would expect that there's likely a pick-up in need compared to a few years ago when you could use any old ODBC to connect to a db2 so wouldn't hit this problem.

so far i've been able to generally just hot-swap the source between the db2 and sqlserver but i've just run into this issue of not having a "proper" backend as i've been trying to push more compute towards the database. in particular, i'm trying to do things like inlining tables to avoid collecting the massive dataset just to do a join (rather, i'd like to inline the table and then join it before running a collect statement), and this is what isn't working without the backend. (getting the same warning as OP posted).

i'm not really adept at the backend coding, although i would be up for learning to support this, but i definitely have multiple use cases and would be happy to test things in the wild or provide reprexes if either would be helpful

hadley commented 7 months ago

@juniperlsimonis what are the problems with odbc and DB2? Could you file an issue in https://github.com/r-dbi/odbc?

juniperlsimonis commented 7 months ago

@hadley see ibm's support page there have been multiple issues in odbc that were filed like this one . the solutions folks have been able to implement haven't worked on my end given enterprise limitations.

hadley commented 6 months ago

To get rJava working on my arm64 mac:

brew install openjdk
# follow sudo ln advice
sudo R CMD javareconf
hadley commented 6 months ago

I don't think there's much dbplyr can do with RJDBC since it doesn't appear to expose a way to figure out the underlying database type from the driver/connection:

library(RJDBC)
#> Loading required package: DBI
#> Loading required package: rJava

# Download from https://github.com/xerial/sqlite-jdbc
class_path <- c(
  "~/Downloads/sqlite-jdbc-3.45.1.0.jar",
  "~/Downloads/slf4j-api-1.7.36.jar"
)
con <- DBI::dbConnect(
  JDBC("org.sqlite.JDBC", class_path),
  url = "jdbc:sqlite::memory:"
)

dbGetInfo(con)
#> list()

Created on 2024-02-22 with reprex v2.1.0

shearerpmm commented 6 months ago

Since the user generally knows, could they have the option to tell dbplyr what the database is when the connection is JDBC? I thought all it really affects is the SQL dialect used?

hadley commented 6 months ago

@shearerpmm I think we could in principle do that, but the idea that the connection defines the database type is deeply intwined in the guts of dbplyr, so fixing that would be quite a lot of work.

Yes, it only affects the SQL dialect used but given that dbplyr is all about converting R code to SQL, that's really important!