Best practices using pool with promises/future

davidchall commented 5 years ago

In the documentation for the promises package, it states:

Future code blocks cannot use resources such as database connections and network sockets that were created in the parent process. This is true regardless of what future implementation you use! Even if it seems to work with a simple test, you are asking for crashes or worse by sharing these kinds of resources across processes.

Instead, make sure you create, use, and destroy such resources entirely within the scope of the future code block.

This means they are recommending to write future code blocks like:

future({
  conn <- DBI::dbConnect(RMySQL::MySQL(), ...)
  result <- tbl(conn, "flights") %>%
    group_by(dest) %>%
    summarise(delay = mean(dep_time)) %>%
    collect()
  DBI::dbDisconnect(conn)
  result
})

I'd like to retain the pool package to manage database connections in my Shiny app, if possible.

If I set a minimum pool size of zero and an idle timeout of zero, does this ensure that a new database connection is instantiated for each future code block? (Thereby satisfying the requirements of the promises package.)

For example:

pool <- pool::dbPool(
  RMySQL::MySQL(),
  minSize = 0,
  maxSize = 5,
  idleTimeout = 0,
  ...
)
future({
  tbl(pool, "flights") %>%
    group_by(dest) %>%
    summarise(delay = mean(dep_time)) %>%
    collect()
})
pool::poolClose(pool)

r2evans commented 5 years ago

@davidchall, have you had luck playing with and/or using the above code chunk? I have a project where parallelization (with future/promises) is a must-have, and I'd really like to be more parallel-friendly with something like your code above. (I have not tested it, I'm asking about lessons-learned since then so that I don't have to learn them, too :-)

antoine-sachet commented 5 years ago

It seems to be working! I have successfully used an external pool (with RMariaDB backend) with minSize = 0 and idleTimeout = 0 in a future.

Just in case, I use the following option to make sure I fail fast if a db connection is passed to the future (which happens when idleTimeout > 0).

# Fail early if a non-exportable object (db con) is passed to future
options(future.globals.onReference = "error")

With this option, the code fails straight away if the pool contains a db connection, with the error:

Error: Detected a non-exportable reference (‘externalptr’) in one of the globals (‘db_pool’ of class ‘Pool’) used in the future expression

I will report back if I observe errors under load.

Tazovsky commented 5 years ago

Remember that without dplyr::collect at the end of query future will fail, because it loses connection with pool

future::plan(multisession)

db_res <- future::future({
  connection <- pool::dbPool(
    maxSize = Inf,
    minSize = 0,
    validationInterval = 15,
    RPostgres::Postgres(),
    host = host,
    user = user,
    password = password,
    dbname = dbname,
    port = port
  )

# does not work:
  get_data_from_db(connection)

# works:
  get_data_from_db(connection) %>% 
    dplyr::collect()

value(db_res)
}

antoine-sachet commented 5 years ago

@Tazovsky If you define the pool in the future, you also need to close it from within.

By the way, to avoid leaking connections when you turn off the API, you need to close the poolon.exit.

I would suggest calling on.exit(pool::poolClose(db_pool)) immediately after creating the pool is best practice, regardless of where you create the pool.

Tazovsky commented 5 years ago

Yes, you are absolutely right. Thank for reminding it ;)

r2evans commented 5 years ago

Correct me if I'm wrong, but ... one commonly-used mechanism in shiny apps is that the block would be reacting to inputs and/or other events, so the connection might (and probably will) be needed again. Opening and closing a connection every time can become a bottleneck (on both ends, though mostly for the client).

I answered a question on stackoverflow some time ago (https://stackoverflow.com/a/57873387/3358272) with a "checkout/return" mechanism. It's imperfect (still debugging leaked connections), but it keeps the pool alive within each future block, as many future instances as are created (by the parallelization backend). Using that example, I'd recommend something like:

if (!exists("mypool")) assign("mypool", do.call(pool::dbPool, cred), envir = .GlobalEnv)
con <- pool::poolCheckout(mypool)
# do database stuff here
on.exit(pool::poolReturn(con), add=TRUE)

You could do this manually in every future-able block you use, or write a wrapper that does it for you. Assigning the mypool to the global environment is intentional, hoping to give the best chance of preserving what namespace control shiny and/or future will impose implicitly on the executed code.

The first time that block runs, assuming cred was already defined in the shiny global or server environment (and passed to the sub-process via future's implicit recognition of global variable references), then mypool does not exist so is created. From there on out, each call to the chunk will see the mypool and checkout an object. If it times out, then pool should internally handle reconnection. (The next step in this code would be some basic error-checking to ensure other errors don't occur, such as unknown host, authentication error, etc.)

The cred object should likely be created in the parent (shiny app global), and look something like

cred <- list(
  driver = "PostgreSQL ANSI(x64)",
  uid = "myname", pwd = Sys.getenv("DBPASS"),
  host = "serverip", port = 5432
)

or, better yet, use the config:: package so that you can differentiate between deployed/production and test/dev environments. (I'm making a leap on that last recommendation.) One such config.yml file could be:

development:
  db:
    driver: "PostgreSQL ANSI(x64)",
    server: "127.0.0.1"
    port: 25432
    database: devdb
    user: postgres
    password: mysecret

rsconnect:
  db:
    driver: "PostgreSQL ANSI(x64)",
    server: "11.22.33.44"
    port: 5432
    database: proddb
    user: !expr Sys.getenv("dbuser")
    password: !expr Sys.getenv("dbpass")

Capitalizing on RStudio's recommendations (in this case, for an RStudio Connect deployment, but frankly it doesn't matter as long as your production and development environments set a R_CONFIG_ACTIVE env var; ref https://github.com/rstudio/config). Once that is in a config.yml file in the shiny dir or one of its parent dirs, you can add this to your shiny app (global or server environment):

cred <- config::get("db")

lmeninato commented 4 years ago

It seems to be working! I have successfully used an external pool (with RMariaDB backend) with minSize = 0 and idleTimeout = 0 in a future.

Just in case, I use the following option to make sure I fail fast if a db connection is passed to the future (which happens when idleTimeout > 0).
# Fail early if a non-exportable object (db con) is passed to future
options(future.globals.onReference = "error")
With this option, the code fails straight away if the pool contains a db connection, with the error:
Error: Detected a non-exportable reference (‘externalptr’) in one of the globals (‘db_pool’ of class ‘Pool’) used in the future expression
I will report back if I observe errors under load.

Is there a way to use idleTimeout > 0 with futures/promises? Essentially this setting makes it so that it opens/closes a connection for every query. It would be great if there was an R6 method like emptyPool to get around this.

Possible usage:

pool <- pool::dbPool(...)

f <- future::future({
  # definitely needed to not use database connections from other process.
  pool$emptyPool()

  # insert code running in forked process (or however defined by plan()) here

  # might be unecessary? is it sharing the pool object from outer scope or was the object copied?
  pool$emptyPool() 
})

Maybe there could be a PR that handles this? Obviously the ideal thing would be for the pool to detect it was passed (I believe it gets copied, otherwise there'd be so many race conditions) to a future object, and then flush its pool of database connections. But I'm sure I could hack out something that closes all the database connections and builds minSize database connections.

zmbc commented 3 years ago

For anyone looking at this in the future, note there are two things going on here:

People asking for a method to use {pool} as a way to essentially cache credentials while still creating a database connection for each future. In this case, the pool is not actually pooling anything and performance is the same as not using a pool at all. This can be done using minSize = 0 and idleTimeout = 0 (I recommend also taking this advice to handle errors).
People asking for a method to actually pool database connections between futures (and hence multiple R processes). This is not possible due to limitations in the database packages themselves. See my answer here for Postgres: https://stackoverflow.com/a/66664783/5981641

hadley commented 1 year ago

Thanks for the great summary @zmbc!

rstudio / pool

Best practices using pool with promises/future #83