ropensci / ckanr

R client for the CKAN API
https://docs.ropensci.org/ckanr
Other
99 stars 38 forks source link

CKANConnection uses an old dbplyr interface #187

Open fjuniorr opened 1 year ago

fjuniorr commented 1 year ago

Here is the message when we connect to a CKAN site with DataStore enabled:

library("ckanr"); library("dplyr")
ckan <- src_ckan("https://dados.mg.gov.br/")
#> url: https://dados.mg.gov.br/

resource <- tbl(src = ckan, name = "72d031e9-2753-469a-acfa-2d67417a2f49")
#> Warning: <CKANConnection> uses an old dbplyr interface
#> ℹ Please install a newer version of the package or contact the maintainer
#> This warning is displayed once every 8 hours.

According to Hadley this should be a simple fix and the details are available at dbplyr 2.0.0 backend API • dbplyr.

Created on 2022-10-21 with reprex v2.0.2

Session Info and Traceback ``` r sessionInfo() #> R version 4.2.1 (2022-06-23) #> Platform: x86_64-apple-darwin17.0 (64-bit) #> Running under: macOS Big Sur ... 10.16 #> #> Matrix products: default #> BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib #> LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib #> #> locale: #> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 #> #> attached base packages: #> [1] stats graphics grDevices utils datasets methods base #> #> other attached packages: #> [1] dplyr_1.0.10 ckanr_0.6.0 DBI_1.1.3 #> #> loaded via a namespace (and not attached): #> [1] Rcpp_1.0.9 compiler_4.2.1 pillar_1.8.1 dbplyr_2.2.1 #> [5] highr_0.9 tools_4.2.1 digest_0.6.29 jsonlite_1.8.1 #> [9] evaluate_0.16 lifecycle_1.0.2 tibble_3.1.8 pkgconfig_2.0.3 #> [13] rlang_1.0.6 reprex_2.0.2 cli_3.4.1 rstudioapi_0.14 #> [17] crul_1.3 curl_4.3.2 yaml_2.3.5 xfun_0.33 #> [21] fastmap_1.1.0 withr_2.5.0 stringr_1.4.1 knitr_1.40 #> [25] generics_0.1.3 fs_1.5.2 vctrs_0.4.2 triebeard_0.3.0 #> [29] tidyselect_1.1.2 glue_1.6.2 httpcode_0.3.0 R6_2.5.1 #> [33] fansi_1.0.3 rmarkdown_2.16 purrr_0.3.4 magrittr_2.0.3 #> [37] urltools_1.7.3 htmltools_0.5.3 assertthat_0.2.1 utf8_1.2.2 #> [41] stringi_1.7.8 ``` Created on 2022-10-21 with [reprex v2.0.2](https://reprex.tidyverse.org)
rdenham commented 3 months ago

With the release of dbplyr 2.5.0 this is now pretty urgent. Any plans to address this?

florianm commented 3 months ago

@rdenham thanks for the ping! I've taken a quick stab at upgrading the dbplyr generics. Feel free to try out my branch (see PR) if you have time, any feedback appreciated.

rdenham commented 3 months ago

Thanks for your prompt reply @florianm !

The problem I have is with the dplyr interface, just like in the example at the start of this issue:

library("ckanr"); library("dplyr")
ckan <- src_ckan("https://dados.mg.gov.br/")

This will throw a slightly misleading error (see this issue) but basically, the dbplyr function src_sql is now fully deprecated, so I can't currently get the dplyr interface working.

My example url, in case it's useful, is:

ckan <- src_ckan("https://www.data.qld.gov.au/")

Please let me know if I've missed something.

rdenham commented 3 months ago

Just having a look around, it seems like https://github.com/r-dbi/bigrquery might provide me with some idea on how to upgrade this part to work with DBI; there are some similar ideas in there.

florianm commented 3 months ago

I have missed that one, I initially went by the dbplyr upgrade guide. Sure enough, https://dbplyr.tidyverse.org/reference/src_sql.html says "Deprecated: please use directly use a DBIConnection object instead." and it's used in ckanr::src_ckan.

So we need to upgrade src_ckan to use https://dbi.r-dbi.org/reference/DBIConnection-class.html in R/dplyr.R:

src_ckan <- function(url) {
  if (!requireNamespace("dplyr", quietly = TRUE)) {
    stop("Please install dplyr", call. = FALSE)
  }
  drv <- new("CKANDriver")
  con <- dbConnect(drv, url = url)
  info <- dbGetInfo(con)
  src_sql("ckan", con, info = info) ## This line needs to be upgraded
}

Any ideas welcome! I'll set up a local CKAN test instance if I get a chance later. Greetings from Western Australia! (was formerly involved in data.wa.gov.au)

rdenham commented 3 months ago

Thanks, I'll spend a little bit of time on it later tonight. Shouldn't be hard, just I need to work out how it all fits together first :-).

Greetings from Western Australia!

And the same to you from Queensland!

rdenham commented 3 months ago

If I get this right, we should drop the approach of:

my_ckan <- src_ckan("http://demo.ckan.org")

and instead use something like:

con <- dbConnect(ckanr::ckan(), url = "https://www.data.qld.gov.au/")
tbl(con, "587f65ae-6675-4b8e-bac5-606ce7f4446a")

where ckanr::ckan() is just something like:

ckan <- function(url) {
  if (!requireNamespace("dplyr", quietly = TRUE)) {
    stop("Please install dplyr", call. = FALSE)
  }
  drv <- new("CKANDriver")
}

Then the nomal dbi stuff should work, and the dplyr interface will now just follow.

Here's my test:

## DBI approach
sql = 'select "Density class", "Full" from "587f65ae-6675-4b8e-bac5-606ce7f4446a" limit 5'
dbGetQuery(con, sql)

## dplyr approach
tbl(con, "587f65ae-6675-4b8e-bac5-606ce7f4446a") %>% 
  select(`Density class`, Full) 

I do get a Warning: Translator is missing window variants of the following aggregate functions: paste. Not sure how serious this is, and I think can be addressed using dbplyr::sql_paste or something.

florianm commented 3 months ago

Thanks for that! I can get the connection to work but am running into various errors with the examples. Pending a closer look, this seems like we'd need to implement/change some of the generics and update examples and vignettes.

Examples of errors I'm seeing:

# Minimal working updated src_ckan
src_ckan <- function(url) {
  if (!requireNamespace("dplyr", quietly = TRUE)) {stop("Please install dplyr", call. = FALSE)}
  drv <- new("CKANDriver")
  dbConnect(drv, url = url)
}

my_ckan <- src_ckan("https://www.data.qld.gov.au/")

tbl2 <- tbl(src = my_ckan, name = "587f65ae-6675-4b8e-bac5-606ce7f4446a")
Error in tbl.DBIConnection(src = my_ckan, name = "587f65ae-6675-4b8e-bac5-606ce7f4446a") : 
  argument "from" is missing, with no default

> tbl1 <- tbl(src = my_ckan, from = "test", name = tbl_list[1])
Error in ds_search_sql(as.character(statement), url = conn@url, as = "table") : 
  {"help": "https://www.data.qld.gov.au/api/3/action/help_show?name=datastore_search_sql", "error": {"query": ["(psycopg2.errors.UndefinedTable) relation \"test\" does not exist\nLINE 2: FROM \"test\"\n             ^\n\n[SQL: SELECT * FROM (SELECT *\nFROM \"test\"\nWHERE (0 = 1)) AS blah LIMIT 32001 ;]\n(Background on this error at: https://sqlalche.me/e/14/f405)"], "info": {"statement": ["SELECT * FROM (SELECT *\nFROM \"test\"\nWHERE (0 = 1)) AS blah LIMIT 32001 ;"], "params": [{}], "orig": ["relation \"test\" does not exist\nLINE 2: FROM \"test\"\n             ^\n"]}, "__type": "Validation Error"}, "success": false}

I'll keep looking into this as bandwidth allows.

rdenham commented 3 months ago

my thoughts were to drop src_ckan, and just have something like ckan as the way to connect to the 'database', just like you would in other drivers for other DBI::dbConnect() examples. So instead of src_ckan, I suggest:

ckan <- function() {
  drv <- new("CKANDriver")
}

Then do the connection like any other DBI connection, ie

con <- dbConnect(ckanr::ckan(), url = "https://www.data.qld.gov.au/")

The src_ckan function is no longer needed as you can get the tbl directly from a DBI connection now. Like the help for dbplyr::src_sql says, please use directly use a DBIConnection object instead.

This evening I'll try to go through the dplyr.R code in ckanr to try to clarify, and drop the deprecated components. I'll share that and you can see if that's the approach you'd like to take.

florianm commented 3 months ago

I'm not a maintainer and only speaking as a contributor here, so this is just my opinion. I like the approach of having just one function to create a connection. I assume we will have to adjust all examples, as the errors I'm seeing indicate that the DBI connection works differently to the R4 object returned by ckanr v0.7.0 (dbplyr v1). I may be wrong about this. So the function signature could be

function_name_here <- function(url) {
  if (!requireNamespace("dplyr", quietly = TRUE)) {stop("Please install dplyr", call. = FALSE)}
  drv <- new("CKANDriver")
  DBI::dbConnect(drv, url = url)
}

If we retain the function name src_ckan:

Pro your approach:

# Perfectly valid and idiomatic usage of DBI
con <- DBI::dbConnect(ckanr::ckan(), url = "https://www.data.qld.gov.au/")

# Less typing, easier to memorize
con <- ckanr::ckan("https://www.data.qld.gov.au/")

Pedantry aside, I can create a valid DCI connection to the QLD CKAN, but DBI::tbl throws the error that it requires a "from" parameter. What could I be doing wrong?

src_ckan <- function(url) {
  if (!requireNamespace("dplyr", quietly = TRUE)) {
    stop("Please install dplyr", call. = FALSE)
  }
  # drv <- new("CKANDriver")
  dbConnect(ckanr::ckan(), url = url)
}

#' @export
ckan <- function() {
  drv <- new("CKANDriver")
  drv
}
> my_ckan <- src_ckan("https://www.data.qld.gov.au/")
> tbl_list <- DBI::dbListTables(my_ckan, limit=5)

# The parameter "from" seems to be required but no example in ckanr shows it being used. Odd.
> tbl1 <- tbl(src = my_ckan, name = tbl_list[1])
Error in tbl.DBIConnection(src = my_ckan, name = tbl_list[1]) : 
  argument "from" is missing, with no default

# Let's try a "from".
> tbl1 <- tbl(src = my_ckan, from = "test", name = tbl_list[1])
Error in ds_search_sql(as.character(statement), url = conn@url, as = "table") : 
  {"help": "https://www.data.qld.gov.au/api/3/action/help_show?name=datastore_search_sql", "error": {"query": ["(psycopg2.errors.UndefinedTable) relation \"test\" does not exist\nLINE 2: FROM \"test\"\n             ^\n\n[SQL: SELECT * FROM (SELECT *\nFROM \"test\"\nWHERE (0 = 1)) AS blah LIMIT 32001 ;]\n(Background on this error at: https://sqlalche.me/e/14/f405)"], "info": {"statement": ["SELECT * FROM (SELECT *\nFROM \"test\"\nWHERE (0 = 1)) AS blah LIMIT 32001 ;"], "params": [{}], "orig": ["relation \"test\" does not exist\nLINE 2: FROM \"test\"\n             ^\n"]}, "__type": "Validation Error"}, "success": false}

# Using a valid table name as value for "from"
> tbl1 <- tbl(src = my_ckan, from = tbl_list[1], name = tbl_list[1])
Error in `db_query_fields.DBIConnection()`:
! Can't query fields.
Using SQL: _id
Using SQL: _full_text
Using SQL: Motor Accident Insurance Commission
Caused by error in `curl::curl_fetch_memory()`:
! URL rejected: Malformed input to a URL function
rdenham commented 3 months ago

Obviously, I'd defer to you and the package authors, but my reasoning for the

con <- DBI::dbConnect(ckanr::ckan(), url = "https://www.data.qld.gov.au/")

style is that it follows the standard approach of connecting to a backend, like

con <- dbConnect(RSQLite::SQLite(), dbname = ":memory:")

and similarly for SQLite, duckdb etc. So consistent and therefore intuitive for users. I think this would also solve your difficulty in using DBI::tbl, but I can't tell exactly why it doesn't work in your example. Maybe you'd like to check my fork? For me, that should be able to be used like:

library(dbplyr)
library(dplyr)
library(ckanr)

drv = ckan()
dbGetInfo(drv)
summary(drv)
dbUnloadDriver(drv) # not sure what this should do

con <- dbConnect(drv, url = "https://www.data.qld.gov.au/")

dbDisconnect(con) # not sure what this should do 

dbGetInfo(con) # shows url

sql = 'select "Density class", "Full" from "587f65ae-6675-4b8e-bac5-606ce7f4446a" limit 5'
rs <- dbSendQuery(con, sql)
dbFetch(rs)
dbGetQuery(con, sql)

dbListTables(con, limit =5)

res = dbReadTable(con, '587f65ae-6675-4b8e-bac5-606ce7f4446a')
class(res)

# doesn't work at the moment
#dbExistsTable(con, '587f65ae-6675-4b8e-bac5-606ce7f4446a')

sql = 'select "Density class", "Full" from "587f65ae-6675-4b8e-bac5-606ce7f4446a" limit 5'
rs <- dbSendQuery(con, sql)

dbFetch(rs)

dbListFields(rs)

## dplyr interface

tab1 <- tbl(con, "587f65ae-6675-4b8e-bac5-606ce7f4446a")

# check translations
# explain won't wok
try(
tab1 %>% 
  summarise(sdf = sd(Full)) %>% 
  explain())

# but show query does
tab1 %>% 
  summarise(sdf = sd(Full)) %>% 
  show_query()

# paste 
tab1 %>% 
  mutate(x = paste(`_id`, `Monitoring period`)) %>% 
  show_query()

# count
tab1 %>% 
  summarise(nobs=n()) %>% 
  show_query()

tab1 %>% 
  summarise(xc=cor(Full)) %>% 
  show_query()

# not sure that's what we want
tab1 %>% 
  summarise(xp=paste(`Density class`)) %>% 
  show_query()

tab1 %>% 
  mutate(x = paste(`_id`, `Monitoring period`)) %>% 
  select(x)

tab1 %>% 
  select(-`_full_text`)

I think that most of the code in dplyr.R is no longer needed, so I've simplified that a lot, but you might like to check that a bit more thoroughly.

I notice that the vignette doesn't have much in it on the dplyr methods, so it shouldn't be hard to update, and perhaps expand on this. I'm happy to do this if you like.

florianm commented 3 months ago

This looks great, I'll try it out!

IMHO it would be beneficial to replace ckanr wrappers with idiomatic DBI / dplyr / dbplyr code and examples (vignette?). The only purpose of ckanr wrappers would be to handle CKAN datastore API behaviour or to simplify code.