Closed rogtejada closed 2 years ago
@mp911de
Searching in: https://github.com/r2dbc/r2dbc-pool/issues/143 I have cloned the source code posted here: https://github.com/stefannegele/connection-leak-demo
The behavior is the same, connection not released after cancellation. I just need to bump the numbers of executions from 5 to 10 to achieve the same result.
I have also tested it against the new 0.8.8 version and the problems still occurs.
But it seems more simple to reproduce with the following code presented on the repository above
@mp911de
I was looking over the implmentation of the connection pool https://github.com/r2dbc/r2dbc-pool/blob/main/src/main/java/io/r2dbc/pool/ConnectionPool.java#:~:text=PoolBuilder%3CConnection%2C%20PoolConfig,Runtime.getRuntime().availableProcessors())%3B
The PoolBuilder have a method to register a releaseHandler, is it possible to force implementation of a release method in SPI Connection interface? This way PostgresqlConnection in r2dbc-postgresql will need to have a release mechanism and it could be registered on the pool creation.
I have tried using commitTransaction as release handler and it seems that the connection gets released, but not always a commit will be the best approach to release the connection. I would like to hear your opinion on that matter.
I think we have the same issue in production. For version 0.8.7.RELEASE
Connection leak from grafana
our service sometime get close by grpc when clients refused their connections. I'll provide more information about the problem. But situation is not good :(
Does 0.9.0.M2 solve the problem?
The PoolBuilder have a method to register a releaseHandler, is it possible to force implementation of a release method in SPI Connection interface? This way PostgresqlConnection in r2dbc-postgresql will need to have a release mechanism and it could be registered on the pool creation.
You can register a customizer through ConnectionPoolConfiguration.Builder.customizer(…)
if you want to. Note that releaseHandler
gets only called when the connection gets released by calling Connection.close()
.
I'm still wondering about the cause. If you can reliably reproduce the issue, then I would like to ask you to either register a doOnDiscard(…)
callback that closes the connection (close + subscribe) or move around ´doFinally´. I think that timeout
/onErrorMap
/retry
operators might interfere with the cancellation and that the cleanup guards only catch some cancellations but not all of them.
@DenisMd 0.9 updates to the new R2DBC spec, it's not a rewrite of the pool or the like.
I was able to reproduce the leaks too. Simple code for webflux request:
log.info("Before 1 sec") databaseClient.sql("select pg_sleep(1)").await() log.info("After 1 sec")
Normally log for 1 standalone request
2021-11-08 14:42:24.487 INFO [,2f3d52ee53e5e389,2f3d52ee53e5e389] 13589 --- [ctor-http-nio-6] a.p.s.core.service.SdiService$Companion : Before 1 sec 2021-11-08 14:42:24.487 DEBUG [,2f3d52ee53e5e389,2f3d52ee53e5e389] 13589 --- [ctor-http-nio-6] io.r2dbc.pool.ConnectionPool : Obtaining new connection from the driver 2021-11-08 14:42:24.487 DEBUG [,2f3d52ee53e5e389,2f3d52ee53e5e389] 13589 --- [ctor-http-nio-6] io.r2dbc.postgresql.QUERY : Executing query: select pg_sleep(1) 2021-11-08 14:42:25.590 DEBUG [,2f3d52ee53e5e389,2f3d52ee53e5e389] 13589 --- [actor-tcp-nio-1] io.r2dbc.pool.PooledConnection : Releasing connection 2021-11-08 14:42:25.590 INFO [,2f3d52ee53e5e389,2f3d52ee53e5e389] 13589 --- [actor-tcp-nio-1] a.p.s.core.service.SdiService$Companion : After 1 sec
Then i launched jmeter load-scenario(100 hundred requests in parallel and 3.5 sec timeout for response).
After completion a prometheus-metric was:
r2dbc_pool_acquired_connections{name="connectionFactory",} 5.0
and never go down
build.config
plugins { id("application") id("org.springframework.boot") version "2.5.1" id("io.spring.dependency-management") version "1.0.11.RELEASE" kotlin("jvm") version "1.5.10" kotlin("plugin.spring") version "1.5.10" id("org.jlleitschuh.gradle.ktlint") version "10.1.0" id("org.jetbrains.kotlin.kapt") version "1.5.10" }
dependencies {
implementation("org.springframework.boot:spring-boot-starter-actuator")
implementation("org.springframework.boot:spring-boot-starter-webflux")
// Database r2dbc
implementation("org.springframework.boot:spring-boot-starter-data-r2dbc")
runtimeOnly("io.r2dbc:r2dbc-postgresql")
}
Interesting case, if i execute sql-request in a transaction there will be no leaks
log.info("Before 1 sec") transactionalOperator.executeAndAwait { databaseClient.sql("select pg_sleep(1)").await() } log.info("After 1 sec")
Thanks for your reproducers. I'm finally able to reproduce the issue although it requires multiple attempts. Trying to figure out what's happening and where connections get lost.
Refining the allocation and cancellation guards indeed helped to reduce (fixed?) the test runs in which connections were lost in the connection pool. While tracing code paths that might skip connection release, I found that Flux.usingWhen(…)
seems to see a connection that is emitted but not released. I need to analyze this further.
I reported the issue back to the Reactor team to get their input.
@mp911de We could reproduce the issue even with the recent changes on r2dbc pool. But we agree that somehow the problem may be on the reactor implementation. Will investigate it on our side, any new discovery will be posted here.
@mp911de in our tests we add the following code in doFinally ,
return Operators.discardOnCancel(conn, () -> {
logger.debug("release discardOnCancel");
ref.release().subscribe();
return false;
})
.doFinally(s -> {
logger.debug("doFinally conn {}", s.toString());
if (!connection.isClosed()) {
logger.debug("close onfinally");
connection.close().subscribe();
}
});
})
and with this all connections was released ... but we dont if this is the better solution
We could reproduce the issue even with the recent changes on r2dbc pool
Please note that we cannot treat the problem as solved unless the leak in usingWhen
is fixed. That lies outside of R2DBC Pool.
@mp911de
This approach seems an idea the connections stop beeing stucked, but i'm not sure if it closes the connection prematurely. With this change 3 unit tests will fail.
shouldReusePooledConnectionAfterTimeout() -> no exception signal is emitted
shouldReportMetrics() -> it expects that aquiredSize will be one but it is zero since the connection is already closed
shouldCreateMultipleConnections() -> it expects two connections but one of them is already closed
Eventhough the behaviour seems correct the unit tests show that somehow the connection is beeing closed when it should still be open. @mp911de do you have any concerns about that? I would like to hear your opinion on that matter.
This is not a final solution but could be a workaround while the reactor problem is not solved
@mp911de in our tests we add the following code in doFinally ,
return Operators.discardOnCancel(conn, () -> { logger.debug("release discardOnCancel"); ref.release().subscribe(); return false; }) .doFinally(s -> { logger.debug("doFinally conn {}", s.toString()); if (!connection.isClosed()) { logger.debug("close onfinally"); connection.close().subscribe(); } }); })
and with this all connections was released ... but we dont if this is the better solution
We could reproduce the issue even with the recent changes on r2dbc pool
Please note that we cannot treat the problem as solved unless the leak in
usingWhen
is fixed. That lies outside of R2DBC Pool.
Is this usingWhen
used in the r2dbc-proxy
? I don't find any usingWhen
in the r2dbc-pool
.
Commenting for increased visibility - looks like the reactor-core
team has a potential solution, see https://github.com/reactor/reactor-core/pull/2859 .
If you can reproduce the issue locally, confirmation that the fix works is desired. See https://github.com/reactor/reactor-core/issues/2836#issuecomment-999797297 .
@Ernir @mp911de
The change done in the reactor core seems to fix this issue. However when running a lot of simultaneous requests (50k+) an error shows up
Error starting response. Replying error status io.r2dbc.postgresql.client.ReactorNettyClient$RequestQueueException: Cannot exchange messages because the request queue limit is exceeded
I guess that it is just a limitation of a single instance processing a lot of requests to try to simulate the lock behavior, in a real application this requests will be shared with multilple instances and this queue limit will be enough, am i right? There are some concerns we should address on this queue limitation?
@rogtejada Have you published a snapshot? We would like to test it in our project.
@figroc No. An issue was opened on project-reactor (https://github.com/reactor/reactor-core/issues/2836) where a solution was proposed in this PR (https://github.com/reactor/reactor-core/pull/2859). I have just build this branch locally and used this artifact to run my tests
I can confirm that upgrading to Reactor Core 3.4.14 snapshots fixes the issue.
Bug Report
Versions
Current Behavior
This issue was first opened on r2dbc-postgresql, some discussion can be seen there https://github.com/pgjdbc/r2dbc-postgresql/issues/459
Then it was opened here: https://github.com/r2dbc/r2dbc-pool/issues/140
The result was a fix on the pool but it does not fix the problem.
We have created a new project without spring dependencies and the connection keeps getting stucked.
You can find the source code here: https://github.com/jfrossetto/r2dbc-stress-cancel
LOGFILE: https://github.com/jfrossetto/r2dbc-stress-cancel/blob/master/log_stresscancel.txt
There was another issue opened with the same problem here: https://github.com/r2dbc/r2dbc-pool/issues/143
Stack trace
``` // your stack trace here ```Table schema
Input Code
```sql -- your SQL here; ```Steps to reproduce
Input Code
```java // your code here; ```Expected behavior/code
Possible Solution
Additional context