r-dbi / DBI

A database interface (DBI) definition for communication between R and RDBMSs
https://dbi.r-dbi.org
GNU Lesser General Public License v2.1
296 stars 75 forks source link

Multi-statement support #402

Open mmuurr opened 1 year ago

mmuurr commented 1 year ago

I created an issue (quite a while back!) specifically asking about this with RPostgres, where libpq has functionality to handle multi-statement SQL being sent to the DB server. The recommendation was to bubble this up to DBI, but I found a duplicate DBI issue with essentially the same use case, too: SQL scripts holding multiple statements that we'd like to relay (perhaps after string interpolation) to the DB via R/DBI. The recommendation there was to bubble this to projects like squr or dbr, but neither of those projects seem to be active (the last commits were 3 and 2 years ago, respectively, and neither are available on CRAN).

The general pattern of:

readr::read_file("some-sql-script-template.sql") %>%
  glue::glue_sql(.con = my_con) %>%  ## maybe we do some interpolation here
  DBI::dbMultiExecute(my_con)  ## imaginary DBI method

... still seems very appealing to me and I've written wrappers around this to try to robustly split the SQL script into the separate statements and loop over them, but I can't help but wonder if this should still be available within DBI, especially for the DB back-ends that support multi-statement SQL.

DB back-ends differ on what multi-statement execution returns to the client and how it should be handled (PostgreSQL, for example, simply sends back the standard return of the final statement; MySQL, on the other hand, wants to send back multiple resultsets). For this reason, I think a new optionally-supported execution method, like DBI::dbMultiExecute() could be used, with return-value semantics that align with the traditional DBI::dbExecute() method: simply the last status value sent back by the DB. The risk seems to be poor interpretation of that status, e.g., if there's a failure mid-way, on which statement did the full execution fail? But this also seems (to me) to be a standard 'here-be-dragons' situation with programming, and anyone using multi-statement SQL with these databases outside of R are already accepting this risk. (Experienced folks should probably be wrapping those scripts in transactions anyhow, one way or another.)

This might wind up being a "won't support" feature, but I keep finding myself wanting to adopt the example pattern shown above, and I imagine there are plenty of other folks out there with SQL scripts that want to do the same without needing to tackle some amount of script-parsing on their own, either with-or-without an existing R package to do so. (FWIW, I've been using {reticulate} with the sqlparse Python package to help with this, but this seems like a heavyweight solution to a problem that shouldn't exist -- at least for me, because I use PostgreSQL/libpq most of the time :-)

Curious about your thoughts, but regardless of them, thanks much for the continued DBI work (it's highly-valued)!

krlmlr commented 1 year ago

Thanks. I'm torn on this one. I see two components:

The former seems hard and somewhat dependent on the SQL dialect, the latter is easy to roll manually with a simple loop.

If we want to solve the first component in DBI or in some backends, we can also help with the second. For the other way around, I'm not convinced.

AFAICT, libpq requires multiple queries to separated by semicolons: https://www.postgresql.org/docs/current/libpq-exec.html . (And we'd need to use PQexec(), but splitting by semicolon is considerably easy with what DBI offers today, assuming it can't be part of a valid single SQL query.) What's the situation with other drivers?

mmuurr commented 1 year ago

I also think splitting a text stream by some delimiter is quite difficult, as it requires re-implementing what would already be done by some of the native SQL servers/drivers. For example, with libpq, my instinct is to let PQexec() handle the multi-statement argument and not have DBI pre-process it in any way.

For SQL systems that don't have native multi-statement support, I think DBI should likewise not permit it. This is the main reason that I'd propose not changing the semantics (or any code) associated with dbSendStatement() or dbExecute() and instead introduce new multi-statement versions (e.g. dbMultiSendStatement()) and let the individual drivers:

The main difference I see between PostgreSQL's and MySQL's multi-statement support is:

These can then consistently be represented as a single list (of length-one in PostgreSQL's case) of DBIResult objects. This might make working with methods like dbHasCompleted(), dbGetRowsAffected(), etc. harder, though, so perhaps one needs to make *Multi* versions of nearly all objects and methods?

I see the complexity with how to integrate this with the existing DBI API, but at the same time I see the ease with which one could do this with PostgreSQL in particular, and a very compelling use case of templated SQL scripts (with either glue::glue_sql() or native value interpolation), so I'm also torn. Though that use case is so compelling to me that I find I'm already doing some manual splitting of my scripts and it just feels like this is something that is fragile and unneeded, since libpq can do this better on its own. (FWIW, splitting those scripts into all separate files doesn't really make sense, either, as some complex SQL scripts can be 100s of individual statements, often building temporary tables, creating-then-later-dropping functions, etc.)

mlin commented 1 year ago

Adding +1 for this. SQLite supports semicolon-separated statements through sqlite3_exec(). Agree it should be supported only if the underlying db has it natively like that; I wouldn't want to see R code trying to split the SQL correctly!