Closed davidchall closed 1 year ago
@davidchall, have you had luck playing with and/or using the above code chunk? I have a project where parallelization (with future
/promises
) is a must-have, and I'd really like to be more parallel-friendly with something like your code above. (I have not tested it, I'm asking about lessons-learned since then so that I don't have to learn them, too :-)
It seems to be working! I have successfully used an external pool (with RMariaDB
backend) with minSize = 0
and idleTimeout = 0
in a future
.
Just in case, I use the following option to make sure I fail fast if a db connection is passed to the future (which happens when idleTimeout
> 0).
# Fail early if a non-exportable object (db con) is passed to future
options(future.globals.onReference = "error")
With this option, the code fails straight away if the pool contains a db connection, with the error:
Error: Detected a non-exportable reference (‘externalptr’) in one of the globals (‘db_pool’ of class ‘Pool’) used in the future expression
I will report back if I observe errors under load.
Remember that without dplyr::collect
at the end of query future
will fail, because it loses connection with pool
future::plan(multisession)
db_res <- future::future({
connection <- pool::dbPool(
maxSize = Inf,
minSize = 0,
validationInterval = 15,
RPostgres::Postgres(),
host = host,
user = user,
password = password,
dbname = dbname,
port = port
)
# does not work:
get_data_from_db(connection)
# works:
get_data_from_db(connection) %>%
dplyr::collect()
value(db_res)
}
@Tazovsky If you define the pool in the future, you also need to close it from within.
By the way, to avoid leaking connections when you turn off the API, you need to close the poolon.exit
.
I would suggest calling on.exit(pool::poolClose(db_pool))
immediately after creating the pool is best practice, regardless of where you create the pool.
Yes, you are absolutely right. Thank for reminding it ;)
Correct me if I'm wrong, but ... one commonly-used mechanism in shiny apps is that the block would be reacting to inputs and/or other events, so the connection might (and probably will) be needed again. Opening and closing a connection every time can become a bottleneck (on both ends, though mostly for the client).
I answered a question on stackoverflow some time ago (https://stackoverflow.com/a/57873387/3358272) with a "checkout/return" mechanism. It's imperfect (still debugging leaked connections), but it keeps the pool alive within each future block, as many future instances as are created (by the parallelization backend). Using that example, I'd recommend something like:
if (!exists("mypool")) assign("mypool", do.call(pool::dbPool, cred), envir = .GlobalEnv)
con <- pool::poolCheckout(mypool)
# do database stuff here
on.exit(pool::poolReturn(con), add=TRUE)
You could do this manually in every future
-able block you use, or write a wrapper that does it for you. Assigning the mypool
to the global environment is intentional, hoping to give the best chance of preserving what namespace control shiny
and/or future
will impose implicitly on the executed code.
The first time that block runs, assuming cred
was already defined in the shiny global or server environment (and passed to the sub-process via future
's implicit recognition of global variable references), then mypool
does not exist so is created. From there on out, each call to the chunk will see the mypool
and checkout an object. If it times out, then pool
should internally handle reconnection. (The next step in this code would be some basic error-checking to ensure other errors don't occur, such as unknown host, authentication error, etc.)
The cred
object should likely be created in the parent (shiny app global), and look something like
cred <- list(
driver = "PostgreSQL ANSI(x64)",
uid = "myname", pwd = Sys.getenv("DBPASS"),
host = "serverip", port = 5432
)
or, better yet, use the config::
package so that you can differentiate between deployed/production and test/dev environments. (I'm making a leap on that last recommendation.) One such config.yml
file could be:
development:
db:
driver: "PostgreSQL ANSI(x64)",
server: "127.0.0.1"
port: 25432
database: devdb
user: postgres
password: mysecret
rsconnect:
db:
driver: "PostgreSQL ANSI(x64)",
server: "11.22.33.44"
port: 5432
database: proddb
user: !expr Sys.getenv("dbuser")
password: !expr Sys.getenv("dbpass")
Capitalizing on RStudio's recommendations (in this case, for an RStudio Connect deployment, but frankly it doesn't matter as long as your production and development environments set a R_CONFIG_ACTIVE
env var; ref https://github.com/rstudio/config). Once that is in a config.yml
file in the shiny dir or one of its parent dirs, you can add this to your shiny app (global or server environment):
cred <- config::get("db")
It seems to be working! I have successfully used an external pool (with
RMariaDB
backend) withminSize = 0
andidleTimeout = 0
in afuture
.Just in case, I use the following option to make sure I fail fast if a db connection is passed to the future (which happens when
idleTimeout
> 0).# Fail early if a non-exportable object (db con) is passed to future options(future.globals.onReference = "error")
With this option, the code fails straight away if the pool contains a db connection, with the error:
Error: Detected a non-exportable reference (‘externalptr’) in one of the globals (‘db_pool’ of class ‘Pool’) used in the future expression
I will report back if I observe errors under load.
Is there a way to use idleTimeout > 0 with futures/promises? Essentially this setting makes it so that it opens/closes a connection for every query. It would be great if there was an R6 method like emptyPool to get around this.
Possible usage:
pool <- pool::dbPool(...)
f <- future::future({
# definitely needed to not use database connections from other process.
pool$emptyPool()
# insert code running in forked process (or however defined by plan()) here
# might be unecessary? is it sharing the pool object from outer scope or was the object copied?
pool$emptyPool()
})
Maybe there could be a PR that handles this? Obviously the ideal thing would be for the pool to detect it was passed (I believe it gets copied, otherwise there'd be so many race conditions) to a future object, and then flush its pool of database connections. But I'm sure I could hack out something that closes all the database connections and builds minSize database connections.
For anyone looking at this in the future, note there are two things going on here:
Thanks for the great summary @zmbc!
In the documentation for the
promises
package, it states:This means they are recommending to write future code blocks like:
I'd like to retain the
pool
package to manage database connections in my Shiny app, if possible.If I set a minimum pool size of zero and an idle timeout of zero, does this ensure that a new database connection is instantiated for each future code block? (Thereby satisfying the requirements of the
promises
package.)For example: