Connection not released when request is cancelled

rogtejada commented 2 years ago

Bug Report

Versions

Driver: Driver: 0.8.8.RELEASE
Database: Posgresql 12.7
Java: 11
OS: Windows

Current Behavior

This issue was first opened on r2dbc-postgresql, some discussion can be seen there https://github.com/pgjdbc/r2dbc-postgresql/issues/459

Then it was opened here: https://github.com/r2dbc/r2dbc-pool/issues/140

The result was a fix on the pool but it does not fix the problem.

We have created a new project without spring dependencies and the connection keeps getting stucked.

You can find the source code here: https://github.com/jfrossetto/r2dbc-stress-cancel

pid	datname	usename	application_name	backend_start	query_start	query
3065	postgres	postgres	sample	2021-11-01 10:02:41.308	2021-11-01 10:02:45.159	SELECT 1
3064	postgres	postgres	sample	2021-11-01 10:02:41.307	2021-11-01 10:02:46.358	SELECT 1
3066	postgres	postgres	sample	2021-11-01 10:02:45.640	2021-11-01 10:02:50.739	SELECT 1

LOGFILE: https://github.com/jfrossetto/r2dbc-stress-cancel/blob/master/log_stresscancel.txt

There was another issue opened with the same problem here: https://github.com/r2dbc/r2dbc-pool/issues/143

Stack trace

``` // your stack trace here ```

Table schema

Input Code

```sql -- your SQL here; ```

Steps to reproduce

Input Code

```java // your code here; ```

Expected behavior/code

Possible Solution

Additional context

rogtejada commented 2 years ago

@mp911de

Searching in: https://github.com/r2dbc/r2dbc-pool/issues/143 I have cloned the source code posted here: https://github.com/stefannegele/connection-leak-demo

The behavior is the same, connection not released after cancellation. I just need to bump the numbers of executions from 5 to 10 to achieve the same result.

I have also tested it against the new 0.8.8 version and the problems still occurs.

But it seems more simple to reproduce with the following code presented on the repository above

rogtejada commented 2 years ago

@mp911de

I was looking over the implmentation of the connection pool https://github.com/r2dbc/r2dbc-pool/blob/main/src/main/java/io/r2dbc/pool/ConnectionPool.java#:~:text=PoolBuilder%3CConnection%2C%20PoolConfig,Runtime.getRuntime().availableProcessors())%3B

The PoolBuilder have a method to register a releaseHandler, is it possible to force implementation of a release method in SPI Connection interface? This way PostgresqlConnection in r2dbc-postgresql will need to have a release mechanism and it could be registered on the pool creation.

I have tried using commitTransaction as release handler and it seems that the connection gets released, but not always a commit will be the best approach to release the connection. I would like to hear your opinion on that matter.

DenisMd commented 2 years ago

I think we have the same issue in production. For version 0.8.7.RELEASE

Connection leak from grafana

our service sometime get close by grpc when clients refused their connections. I'll provide more information about the problem. But situation is not good :(

Does 0.9.0.M2 solve the problem?

mp911de commented 2 years ago

The PoolBuilder have a method to register a releaseHandler, is it possible to force implementation of a release method in SPI Connection interface? This way PostgresqlConnection in r2dbc-postgresql will need to have a release mechanism and it could be registered on the pool creation.

You can register a customizer through ConnectionPoolConfiguration.Builder.customizer(…) if you want to. Note that releaseHandler gets only called when the connection gets released by calling Connection.close().

I'm still wondering about the cause. If you can reliably reproduce the issue, then I would like to ask you to either register a doOnDiscard(…) callback that closes the connection (close + subscribe) or move around ´doFinally´. I think that timeout/onErrorMap/retry operators might interfere with the cancellation and that the cleanup guards only catch some cancellations but not all of them.

mp911de commented 2 years ago

@DenisMd 0.9 updates to the new R2DBC spec, it's not a rewrite of the pool or the like.

DenisMd commented 2 years ago

I was able to reproduce the leaks too. Simple code for webflux request:

log.info("Before 1 sec") databaseClient.sql("select pg_sleep(1)").await() log.info("After 1 sec")

Normally log for 1 standalone request

2021-11-08 14:42:24.487 INFO [,2f3d52ee53e5e389,2f3d52ee53e5e389] 13589 --- [ctor-http-nio-6] a.p.s.core.service.SdiService$Companion : Before 1 sec 2021-11-08 14:42:24.487 DEBUG [,2f3d52ee53e5e389,2f3d52ee53e5e389] 13589 --- [ctor-http-nio-6] io.r2dbc.pool.ConnectionPool : Obtaining new connection from the driver 2021-11-08 14:42:24.487 DEBUG [,2f3d52ee53e5e389,2f3d52ee53e5e389] 13589 --- [ctor-http-nio-6] io.r2dbc.postgresql.QUERY : Executing query: select pg_sleep(1) 2021-11-08 14:42:25.590 DEBUG [,2f3d52ee53e5e389,2f3d52ee53e5e389] 13589 --- [actor-tcp-nio-1] io.r2dbc.pool.PooledConnection : Releasing connection 2021-11-08 14:42:25.590 INFO [,2f3d52ee53e5e389,2f3d52ee53e5e389] 13589 --- [actor-tcp-nio-1] a.p.s.core.service.SdiService$Companion : After 1 sec

Then i launched jmeter load-scenario(100 hundred requests in parallel and 3.5 sec timeout for response).

After completion a prometheus-metric was: r2dbc_pool_acquired_connections{name="connectionFactory",} 5.0

and never go down

build.config

plugins { id("application") id("org.springframework.boot") version "2.5.1" id("io.spring.dependency-management") version "1.0.11.RELEASE" kotlin("jvm") version "1.5.10" kotlin("plugin.spring") version "1.5.10" id("org.jlleitschuh.gradle.ktlint") version "10.1.0" id("org.jetbrains.kotlin.kapt") version "1.5.10" }

dependencies {
implementation("org.springframework.boot:spring-boot-starter-actuator")
implementation("org.springframework.boot:spring-boot-starter-webflux")

// Database r2dbc
implementation("org.springframework.boot:spring-boot-starter-data-r2dbc")
runtimeOnly("io.r2dbc:r2dbc-postgresql")
}

DenisMd commented 2 years ago

Interesting case, if i execute sql-request in a transaction there will be no leaks

log.info("Before 1 sec") transactionalOperator.executeAndAwait { databaseClient.sql("select pg_sleep(1)").await() } log.info("After 1 sec")

mp911de commented 2 years ago

Thanks for your reproducers. I'm finally able to reproduce the issue although it requires multiple attempts. Trying to figure out what's happening and where connections get lost.

mp911de commented 2 years ago

Refining the allocation and cancellation guards indeed helped to reduce (fixed?) the test runs in which connections were lost in the connection pool. While tracing code paths that might skip connection release, I found that Flux.usingWhen(…) seems to see a connection that is emitted but not released. I need to analyze this further.

mp911de commented 2 years ago

I reported the issue back to the Reactor team to get their input.

rogtejada commented 2 years ago

@mp911de We could reproduce the issue even with the recent changes on r2dbc pool. But we agree that somehow the problem may be on the reactor implementation. Will investigate it on our side, any new discovery will be posted here.

jfrossetto commented 2 years ago

@mp911de in our tests we add the following code in doFinally ,

                    return Operators.discardOnCancel(conn, () -> {
                        logger.debug("release discardOnCancel");
                        ref.release().subscribe();
                        return false;
                    })
                            .doFinally(s -> {
                                logger.debug("doFinally conn {}", s.toString());
                                if (!connection.isClosed()) {
                                    logger.debug("close onfinally");
                                    connection.close().subscribe();
                                }

                            });
                })

and with this all connections was released ... but we dont if this is the better solution

mp911de commented 2 years ago

We could reproduce the issue even with the recent changes on r2dbc pool

Please note that we cannot treat the problem as solved unless the leak in usingWhen is fixed. That lies outside of R2DBC Pool.

rogtejada commented 2 years ago

@mp911de

This approach seems an idea the connections stop beeing stucked, but i'm not sure if it closes the connection prematurely. With this change 3 unit tests will fail.

shouldReusePooledConnectionAfterTimeout() -> no exception signal is emitted
shouldReportMetrics() -> it expects that aquiredSize will be one but it is zero since the connection is already closed
shouldCreateMultipleConnections() -> it expects two connections but one of them is already closed

Eventhough the behaviour seems correct the unit tests show that somehow the connection is beeing closed when it should still be open. @mp911de do you have any concerns about that? I would like to hear your opinion on that matter.

This is not a final solution but could be a workaround while the reactor problem is not solved

@mp911de in our tests we add the following code in doFinally ,

                    return Operators.discardOnCancel(conn, () -> {
                        logger.debug("release discardOnCancel");
                        ref.release().subscribe();
                        return false;
                    })
                            .doFinally(s -> {
                                logger.debug("doFinally conn {}", s.toString());
                                if (!connection.isClosed()) {
                                    logger.debug("close onfinally");
                                    connection.close().subscribe();
                                }

                            });
                })

and with this all connections was released ... but we dont if this is the better solution

figroc commented 2 years ago

We could reproduce the issue even with the recent changes on r2dbc pool

Please note that we cannot treat the problem as solved unless the leak in usingWhen is fixed. That lies outside of R2DBC Pool.

Is this usingWhen used in the r2dbc-proxy? I don't find any usingWhen in the r2dbc-pool.

Ernir commented 2 years ago

Commenting for increased visibility - looks like the reactor-core team has a potential solution, see https://github.com/reactor/reactor-core/pull/2859 .

If you can reproduce the issue locally, confirmation that the fix works is desired. See https://github.com/reactor/reactor-core/issues/2836#issuecomment-999797297 .

rogtejada commented 2 years ago

@Ernir @mp911de

The change done in the reactor core seems to fix this issue. However when running a lot of simultaneous requests (50k+) an error shows up

Error starting response. Replying error status io.r2dbc.postgresql.client.ReactorNettyClient$RequestQueueException: Cannot exchange messages because the request queue limit is exceeded

I guess that it is just a limitation of a single instance processing a lot of requests to try to simulate the lock behavior, in a real application this requests will be shared with multilple instances and this queue limit will be enough, am i right? There are some concerns we should address on this queue limitation?

figroc commented 2 years ago

@rogtejada Have you published a snapshot? We would like to test it in our project.

rogtejada commented 2 years ago

@figroc No. An issue was opened on project-reactor (https://github.com/reactor/reactor-core/issues/2836) where a solution was proposed in this PR (https://github.com/reactor/reactor-core/pull/2859). I have just build this branch locally and used this artifact to run my tests

mp911de commented 2 years ago

I can confirm that upgrading to Reactor Core 3.4.14 snapshots fixes the issue.

r2dbc / r2dbc-pool