rstudio / pool

Object Pooling in R
http://rstudio.github.io/pool/
Other
252 stars 32 forks source link

Automatic validation after error #186

Closed krlmlr closed 2 months ago

krlmlr commented 2 months ago

In some setups, the connection to the database server will be closed by the server or the client library, e.g., after a period of time. Such a connection still gives TRUE for dbIsValid(), but running a query will fail.

Can we extend dbSendQuery() and dbSendStatement() to do a "smoke test" with SELECT 1 after failure, to see if the connection is still alive, and perhaps automatically attempt a reconnect?

CC @kjellpk.

krlmlr commented 2 months ago

This seems to be fine when I simulate a connection drop with duckdb. Kjell, are you seeing a different behavior when the connection to your real database drops?

drv <- duckdb::duckdb()

pool <- pool::dbPool(drv)

# All is fine
con <- pool::poolCheckout(pool)
con@conn_ref
#> <pointer: 0x600002eca180>

DBI::dbGetQuery(con, "SELECT 1")
#>   1
#> 1 1

# Connection drops
con <- unserialize(serialize(con, NULL))

con@conn_ref
#> <pointer: 0x0>

pool::poolReturn(con)

# Connection seems to be fine when checked out from pool again
con <- pool::poolCheckout(pool)
con@conn_ref
#> <pointer: 0x600002ec9fc0>

DBI::dbGetQuery(con, "SELECT 2")
#>   2
#> 1 2
pool::poolReturn(con)

Created on 2024-08-14 with reprex v2.1.0

kjellpk commented 2 months ago

In my experience it only happens when a server drops the connection. Here is a reprex:

pool <- pool::dbPool(drv = RMariaDB::MariaDB(),
                     host = "relational.fel.cvut.cz",
                     port = 3306,
                     user = "guest",
                     password = "ctu-relational")

con <- pool::poolCheckout(pool)

DBI::dbExecute(con, "SET SESSION wait_timeout=2;")
# [1] 0
pool::poolReturn(con)

Sys.sleep(3)

con <- pool::poolCheckout(pool)
DBI::dbIsValid(con)
# [1] TRUE

DBI::dbGetQuery(con, "SELECT 1")
# Error: Server has gone away [2006]
krlmlr commented 2 months ago

Thanks, confirmed.

jcheng5 commented 2 months ago

What happens if you pass a validationInterval=0 argument to dbPool? And if that works, validationInterval=1?

krlmlr commented 2 months ago

This is a public-facing server, it's easy to try various options. That said, both validationInterval = 0 and = 1 seem to work.

I still see a race condition: checkout, connection drop, issuing query. It seems to be easy for the pool package to just retry in the case of query failures, perhaps optionally, and perhaps only if the error text (🙀🙀🙀) matches a pattern.

0

pool <- pool::dbPool(
  drv = RMariaDB::MariaDB(),
  validationInterval = 0,
  host = "relational.fel.cvut.cz",
  port = 3306,
  user = "guest",
  password = "ctu-relational"
)

con <- pool::poolCheckout(pool)

DBI::dbExecute(con, "SET SESSION wait_timeout=2;")
#> [1] 0
pool::poolReturn(con)

Sys.sleep(3)

con <- pool::poolCheckout(pool)
DBI::dbIsValid(con)
#> [1] TRUE

DBI::dbGetQuery(con, "SELECT 1")
#>   1
#> 1 1

Created on 2024-08-21 with reprex v2.1.0

1

pool <- pool::dbPool(
  drv = RMariaDB::MariaDB(),
  validationInterval = 1,
  host = "relational.fel.cvut.cz",
  port = 3306,
  user = "guest",
  password = "ctu-relational"
)

con <- pool::poolCheckout(pool)

DBI::dbExecute(con, "SET SESSION wait_timeout=2;")
#> [1] 0
pool::poolReturn(con)

Sys.sleep(3)

con <- pool::poolCheckout(pool)
DBI::dbIsValid(con)
#> [1] TRUE

DBI::dbGetQuery(con, "SELECT 1")
#>   1
#> 1 1

Created on 2024-08-21 with reprex v2.1.0

default

pool <- pool::dbPool(
  drv = RMariaDB::MariaDB(),
  validationInterval = 60,
  host = "relational.fel.cvut.cz",
  port = 3306,
  user = "guest",
  password = "ctu-relational"
)

con <- pool::poolCheckout(pool)

DBI::dbExecute(con, "SET SESSION wait_timeout=2;")
#> [1] 0
pool::poolReturn(con)

Sys.sleep(3)

con <- pool::poolCheckout(pool)
DBI::dbIsValid(con)
#> [1] TRUE

DBI::dbGetQuery(con, "SELECT 1")
#> Error: Server has gone away [2006]

Created on 2024-08-21 with reprex v2.1.0

krlmlr commented 2 months ago

Example for race condition with validationInterval = 1 :

pool <- pool::dbPool(
  drv = RMariaDB::MariaDB(),
  validationInterval = 1,
  host = "relational.fel.cvut.cz",
  port = 3306,
  user = "guest",
  password = "ctu-relational"
)

con <- pool::poolCheckout(pool)

DBI::dbExecute(con, "SET SESSION wait_timeout=2;")
#> [1] 0
pool::poolReturn(con)

Sys.sleep(1)

con <- pool::poolCheckout(pool)

Sys.sleep(2)

DBI::dbIsValid(con)
#> [1] TRUE

DBI::dbGetQuery(con, "SELECT 1")
#> Error: Server has gone away [2006]

Created on 2024-08-21 with reprex v2.1.0

jcheng5 commented 2 months ago

My understanding of the way we intend people to use pool with databases is to treat the pool object itself as the DBI connection for most cases, in which case the checkout and return are performed automatically just before and after the query. The only(?) reason to do an explicit poolCheckout is if you need transactions, in which case you absolutely do NOT want to respond to a closed connection by opening a new one and retrying.

(It has been quite a while since I’ve thought about this package so please excuse me if I’ve forgotten some important use case!)

krlmlr commented 2 months ago

A classic case of "you're holding it the wrong way", it seems? 🙃

Do you think it's worth stressing this point in ?poolCheckout ?

pool <- pool::dbPool(
  drv = RMariaDB::MariaDB(),
  validationInterval = 1,
  host = "relational.fel.cvut.cz",
  port = 3306,
  user = "guest",
  password = "ctu-relational"
)

DBI::dbExecute(pool, "SET SESSION wait_timeout=2;")
#> [1] 0

Sys.sleep(3)

DBI::dbIsValid(pool)
#> [1] TRUE

DBI::dbGetQuery(pool, "SELECT 1")
#>   1
#> 1 1

Created on 2024-08-21 with reprex v2.1.0

jcheng5 commented 2 months ago

We totally could have a documentation problem. IIRC when Barbara and I originally wrote this package the intention was to have everyone do poolCheckout/poolReturn, it was Hadley’s idea to have the pool itself act like a connection. I don’t remember what the docs emphasized (and I’m about to board a plane so I won’t look right now).

Is this all happening because there’s a real database out there with a session timeout that’s shorter than 60 seconds?

krlmlr commented 2 months ago

No, the timeout could be arbitrarily long. I did assume that pool works this way, but never got the chance to try it in production. The first attempt wasn't successful, we picked shorter timeouts to be able to reprex.

Happy to help review a documentation PR.

krlmlr commented 2 months ago

After reviewing the docs, I think we might want to slightly reframe the use cases.

In Shiny (shiny?) apps with database access, querying the data from a database will block not only the UI of the app currently carrying out the database operation, but also the UIs of all other apps managed by the same process. (We learned that the hard way.) It's rare that we need to serve multiple connections (because all database accesses are short-lived anyway, with implicit checkout + read + return in close succession), but more important to keep the connections stable in the presence of server settings we can't control.

Async DB access is something that ADBC/adbi might be able to solve, but a few more things will need to fall into place before this happens. (Also, should it be async DB, or async DB + computation + ... in a separate R process handled by mirai?) Happy to discuss.

jcheng5 commented 2 months ago

all other apps managed by the same process

I think you mean "session" instead of app? IIRC the relationship between R process and Shiny app is still only 1:1, i.e., you can't run more than one Shiny app simultaneously--but you can have more than one user simultaneously connected to a single app/process.

It's rare that we need to serve multiple connections (because all database accesses are short-lived anyway, with implicit checkout + read + return in close succession), but more important to keep the connections stable in the presence of server settings we can't control.

This is all true but I'm not sure what change in framing you're suggesting? If you use dbPool the way we suggest in the docs (with the exception of not waving you away if you come upon ?poolCheckout) then you get exactly the right behavior, right? With the exception of, I never expected any database server to have a connection idle timeout of less than a minute.

(Also, should it be async DB, or async DB + computation + ... in a separate R process handled by mirai?)

Unfortunately I think the answer is "it depends"... 😄

jcheng5 commented 2 months ago

Added a note to ?poolCheckout: https://github.com/rstudio/pool/pull/189

krlmlr commented 2 months ago

Thanks, the doc extension looks good. I have added my vision to https://github.com/rstudio/pool/pull/190 . These cover what I would add to make sure the reconnection part is covered too. Other than that, yes, the behavior is what I'd be looking for. Thank you for your support!

Yes, session => app 🤦

It depends, yes.

shikokuchuo commented 2 months ago

@krlmlr as we discussed last week, if the DB connection is known to block then a solution would be to use one dedicated mirai process to host the DBI / pool connection for the app (can even be ADBC) - something along the lines of daemons(1, dispatcher = FALSE, .compute = "special_db_uid"). Then make all DBI queries via mirai using the "special_db_uid" compute profile. This is cheap and should work well if you indeed rarely need to serve multiple connections at once.

This would operate independently of other async computations you might (or might not) want to run for the Shiny app using the default or another profile.

Just in relation to the case you mentioned, not talking about broader implications (yet 😄).

krlmlr commented 2 months ago

As Joe said, it depends.

If we move just the database into one process, but have other time-consuming computations and move them to another process, a lot of time will be spent on communication. That's my primary concern and the reason I'm not pushing too hard for "only" database async at this time. I'd like to understand this space better before moving.

shikokuchuo commented 2 months ago

If we move just the database into one process, but have other time-consuming computations and move them to another process, a lot of time will be spent on communication.

I think this also depends on what we're trying to solve. (i) for keeping the Shiny app responsive, both will be async and not block, and (ii) if we're not moving around huge amounts of data, the pure communications overhead is likely to be minimal (and 1/100 of what you might have in mind if your frame of reference is pre-mirai).

That's my primary concern and the reason I'm not pushing too hard for "only" database async at this time. I'd like to understand this space better before moving.

I see it as an opportunity to solve both cases rather than either/or, as I agree that it probably 'depends'!

If I follow my original idea, pool may actually be the natural place for it to live as it's already managing a pool of connections. There could be a user option to have some of these in a background (or remote) process, and return 'mirai' from a pool connection if subsequent requests are wrapped with an async() keyword (for example). Alternatively, we could have a special type of 'async pool' and only 'mirai' are returned from these.

Just floating some thoughts.