Connection pool hangs after DB disconnect

fgstewart commented 3 years ago

Bug Report

Versions

Driver: r2dbc-postgres
Database: postgres
Java: 11
OS: macOS

Current Behavior

With the following steps and attached minimal test app I can repro a DB connection pool hang where killing and restarting the DB results in the app never reconnecting.

Stack trace

``` // your stack trace here ```

Table schema

Input Code

```sql CREATE EXTENSION IF NOT EXISTS "uuid-ossp"; DROP TABLE IF EXISTS test_table; CREATE TABLE test_table ( id uuid DEFAULT uuid_generate_v4() NOT NULL, name varchar(50) NOT NULL, CONSTRAINT "PK_1" PRIMARY KEY (id) ) ; ```

Steps to reproduce

Link to minimal test app reproducing issue: https://github.com/yossarian81/r2dbc-issue

This app loads a minimal schema and some data, then it uses a Spring scheduled task to periodically read from the DB.

Steps:

create a localhost postgres instance on port 5432
user: 'test' password: 'test' db: 'test'
start up the test app and observe that it writes schema and initial data
stop the DB
observe errors in app logs due to connection drop
restart DB

The periodic data read application never recovers, it hangs when trying to read and with TRACE enabled the connection pool logs only:

"Obtaining new connection from the driver"

Tried many connection pool settings and I see the same behaviour.

Input Code

```java // your code here; ```

Expected behavior/code

Expectation is that the periodic task is able to read from the DB once the DB is restarted.

Possible Solution

Additional context

leewp commented 3 years ago

The same with #68 ?

leewp commented 3 years ago

I found a way to avoid this. use config like this

spring:
  r2dbc:
    url: r2dbcs:mysql://127.0.0.1:3306/xxx
    username: xx
    password: xx
    pool:
      max-idle-time: 2s

I notice

PoolBuilder<Connection, PoolConfig<Connection>> builder

init with

BiPredicate<Connection, PooledRefMetadata> evictionPredicate = (connection, metadata) -> {
            long maxIdleTimeMills = maxIdleTime.toMillis();
            long maxLifeTimeMillis = maxLifeTime.toMillis();
            boolean isIdleTimeExceeded = maxIdleTimeMills != 0L && metadata.idleTime() >= maxIdleTimeMills;
            boolean isLifeTimeExceeded = maxLifeTimeMillis != 0L && metadata.lifeTime() >= maxLifeTimeMillis;
            return isIdleTimeExceeded || isLifeTimeExceeded;
        };

the disconnected connection below to idle, but default max-idle-time is 30m, I tested with 2s, it works fine

leewp commented 3 years ago

I suggest modify evictionPredicate, where connection is not valid, evict it

mp911de commented 3 years ago

Connection validation requires database I/O (Publisher<Boolean> Connection.validate(ValidationDepth depth)).

When the connection validation fails, then we invalidate the pooled object and propagate the error. Validation is guarded with maxAcquireTime against timeouts. acquireRetry is by default 1 which means that upon a validation error we retry to acquire a new connection. If that fails, then the acquisition error gets propagated.

fgstewart commented 3 years ago

Connection validation requires database I/O (Publisher<Boolean> Connection.validate(ValidationDepth depth)).

When the connection validation fails, then we invalidate the pooled object and propagate the error. Validation is guarded with maxAcquireTime against timeouts. acquireRetry is by default 1 which means that upon a validation error we retry to acquire a new connection. If that fails, then the acquisition error gets propagated.

If i set an acquire time of 2 seconds it changes behavior and i get this error on the timer queries: Did not observe any item or terminal signal within 2000ms

But the connections don't recover. I also tried adding a "SELET 1" validation query and same result.

Using the max idle time of 2 seconds as suggested above does fix this sample app but if i change the recurring query timer to 1 second it breaks again. I.e. with the query happening in that idle window when the connection has not been invalidated due to idleness.

Thanks.

mp911de commented 3 years ago

Did not observe any item or terminal signal within 2000ms is a timeout signal. We need to dig likely deeper why that is. What could happen here is that all connections in the pool are broken and therefore, the next connection to obtain is invalid, too.

johnny commented 3 years ago

We've seen the exact same issue during a database restart.

First all connections drop:

WARN 9713 --- [reactor-tcp-epoll-1] i.r.p.client.ReactorNettyClient : Error: SEVERITY_LOCALIZED=FATAL, SEVERITY_NON_LOCALIZED=FATAL, CODE=57P01, MESSAGE=terminating connection due to administrator command, FILE=postgres.c, LINE=2902, ROUTINE=ProcessInterrupts
WARN 9713 --- [reactor-tcp-epoll-1] i.r.p.client.ReactorNettyClient : Error: SEVERITY_LOCALIZED=ERROR, CODE=08P01, MESSAGE=server conn crashed?

And then no new connections can be optained (using postgres)

2021-02-12 10:32:51.790 DEBUG 9713 --- [scheduler-1] io.r2dbc.pool.ConnectionPool             : Obtaining new connection from the driver
2021-02-12 10:32:52.790 DEBUG 9713 --- [scheduler-1] io.r2dbc.pool.ConnectionPool             : Obtaining new connection from the driver
2021-02-12 10:32:53.790 DEBUG 9713 --- [scheduler-1] io.r2dbc.pool.ConnectionPool             : Obtaining new connection from the driver
2021-02-12 10:32:54.790 DEBUG 9713 --- [scheduler-1] io.r2dbc.pool.ConnectionPool             : Obtaining new connection from the driver
2021-02-12 10:32:55.790 DEBUG 9713 --- [scheduler-1] io.r2dbc.pool.ConnectionPool             : Obtaining new connection from the driver
2021-02-12 10:32:56.790 DEBUG 9713 --- [scheduler-1] io.r2dbc.pool.ConnectionPool             : Obtaining new connection from the driver
2021-02-12 10:32:57.790 DEBUG 9713 --- [scheduler-1] io.r2dbc.pool.ConnectionPool             : Obtaining new connection from the driver
2021-02-12 10:32:58.790 DEBUG 9713 --- [scheduler-1] io.r2dbc.pool.ConnectionPool             : Obtaining new connection from the driver

Edit: Only one of three services was affected by this, the one which polls pg every second. The other services were running fine after the pg server restart. We're using pgbouncer in front of pg, which might explain this

johnny commented 3 years ago

This issue is risking one of our releases. Is there something we can do to help?

mp911de commented 3 years ago

We need to investigate why this is and we haven't gotten to the point yet where we had sufficient time to investigate the issue. Feel free to have a look into the flow and why some guards (like timeouts) don't seem to work.

fgstewart commented 3 years ago

This may be of use - when we upgraded all our Spring/R2DBC libs the behavior changed. With the versions below what I see is that after the DB comes back up there are a couple more failures (which i think might remove bad connections form the pool) and then it reconnects.

I've not isolated this to one lib or the other, but just upgrading the R2DBC libs doesn't seem to fix it alone.

    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>2.4.2</version>
        <relativePath/> <!-- lookup parent from repository -->
    </parent>
...
        <dependency>
            <groupId>io.r2dbc</groupId>
            <artifactId>r2dbc-postgresql</artifactId>
            <version>0.8.6.RELEASE</version>
        </dependency>
        <dependency>
            <groupId>io.r2dbc</groupId>
            <artifactId>r2dbc-pool</artifactId>
            <version>0.8.5.RELEASE</version>
        </dependency>

johnny commented 3 years ago

Similar to @yossarian81 upgrading to above versions fixed the issue. The visible errors in io.r2dbc stayed the same, see below, and we suspect that the issue is caused somewhere in reactor libs, presumably reactor-pool. The call connectionPool.acquire() never succeeded after the bad state was reached

service-messaging-647b99846-hszbn service-messaging Caused by: io.r2dbc.spi.R2dbcNonTransientResourceException: Connection validation failed
service-messaging-647b99846-hszbn service-messaging     at io.r2dbc.pool.Validation.lambda$validate$2(Validation.java:45) ~[r2dbc-pool-0.8.5.RELEASE.jar:0.8.5.RELEASE]
service-messaging-647b99846-hszbn service-messaging     at reactor.core.publisher.FluxHandleFuseable$HandleFuseableSubscriber.onNext(FluxHandleFuseable.java:163) ~[reactor-core-3.3.12.RELEASE.jar:3.3.12.RELEASE]
service-messaging-647b99846-hszbn service-messaging     ... 90 common frames omitted

mp911de commented 3 years ago

Can you check from where the logging gets issued? That would be helpful to investigate whether failures get propagated and logged or whether the failure is just logged without affecting the application.

mp911de commented 3 years ago

Closing due to lack of requested feedback. If you would like us to look at this issue, please provide the requested information and we will re-open the issue.

r2dbc / r2dbc-pool