Closed pjgg closed 3 years ago
/cc @tsegismont
I will look into this @gsmet
This problem is usually solved by setting an appropriate idle timeout for the connection pool.
This problem is usually solved by setting an appropriate idle timeout for the connection pool.
It is ConnectOptions.setIdleTimeout()
, I believe.
On the test that cover this use case, we have set an idle timeout of 1sec, as you can see in the application.property file.
The idle timeout value it's not relevant. The issue is a race condition between idle-timeout
and when the user uses this connection. In other words, sometimes the user could get a connection that is available on PostgreSQL pg_stat_activity
table, but some millisecond later when the user uses this connection it's not available anymore. And then you got the reported exception.
Take a look at the test, especially when this issue takes place. For example, in the provided scenario, I could reproduce if the user borrows a connection just 3 milliseconds ahead of the idle expiration time. So, In my opinion, looks that the Idle value is not too relevant could be 1 second, 5 min, or 1h, but if the user requests a connection 3 ms ahead of this idle time, then you could lose the connection.
Note: this issue happens on Postgresql reactive data source.
t + 2
. Current time = t
pg_stat_activity
table . Current time = t + 1
t + 3
Fail to read any response from the server, the underlying connection might get lost unexpectedly.
because the given connection (step 2) is not available anymore. idleTimeout
is an option that applies to the TCP client connection, not to the pool:
Idle timeout at the pool level is not supported yet but planned for Vert.x 4.1.0 https://github.com/eclipse-vertx/vertx-sql-client/issues/41
I think we can close this (not a bug).
I decided to walk an extra mile, just to try to make a stronger point about this issue:
First of all, I double-checked the Quarkus doc and it says "before it is removed from the pool". So looks that this property(quarkus.datasource.reactive.idle-timeout) applies to the pool.
But, please consider this extra test I created: (https://github.com/quarkus-qe/beefy-scenarios/pull/71/files#diff-f9fa26a38bdba2a7d0e4ba2bb3706ab3db9faa3c590f91c84e492dbfad8dcf21R103). There you have an integration test over Quarkus QE repository. This test validates the behavior of releasing Idle connection using a pool and it works well as expected, so Idle looks like is not an issue.
And also please check this scenario I just created for this conversation, reproducing the issue that I told you in my previous comment (this test) but using Vertx 4.0.0 (without quarkus) and also without a DB pool (just using connections), and do you know what? ... I have exactly the same issue. (scenario); And the reason is described in the previous comment.
IMHO this is a bug you can reproduce also in your local environment and it's happening now in production in several places Vertx /quarkus in several versions, so please check the tests above and let me know your thoughts.
First of all, I double-checked the Quarkus doc and it says "before it is removed from the pool". So looks that this property(quarkus.datasource.reactive.idle-timeout) applies to the pool.
Indeed the doc is confusing. I will change it to "before it is closed". After it's closed, it's removed from the pool, whether or not the user has borrowed it from the pool at the time the timeout expired.
@tsegismont how is it not a bug to have Fail to read any response from the server, the underlying connection might get lost unexpectedly.
errors from a pool? I might have missed something?
It's not a bug because the feature does not exist yet, see https://github.com/quarkusio/quarkus/issues/14608#issuecomment-769151568
The idle timeout we have now applies at the TCP connection level and works as expected.
It is true though that the documentation is incorrect and I just push a change to update it.
It's not a bug because the feature does not exist yet (on pool level )
This issue is talking about getting a connection that is closed (also, fails on Vertx 4 without a pool). This issue. Vertx PgPool lost connection unexpectedly
can't be the normal behavior of your DB connections.
Could you provide an example about how to use this property quarkus.datasource.reactive.idle-timeout
and
PgPool postgresql
?. Or it's not supported this combination?, and if it's supported, could you make this test works?
Do you know if this behavior is the same in Db2 or MySQL?
This property has been added in #11498 in order to solve #11149.
Until eclipse-vertx/vertx-sql-client#41 is solved, we have nothing better than the TCP connection level properties, as discussed in the original issue.
In other words, there is currently no way to fix this test.
But this is not a bug, because the limitation has been discussed when the properties were added.
And yes that applies to DB2 and MySQL too.
I am an vertx end-user using it in production environments. The issue mentioned has occasionally appeared in production. I am not going to evaluate the internal architecture of vertx but as an end-user I consider that it is a bug in the toolkit that forces me to develop an external layer to validate if the connection given by the vertx pool is valid, with the consequent loss of performance. The other option I have is not to set the idle timeout but that causes the database connections to be saturated over time.
@abesada sorry for the inconvenience.
The other option I have is not to set the idle timeout but that causes the database connections to be saturated over time.
This sounds like a different problem. Can you open a separate issue and explain what you see? Thank you
If I'm reading this correctly, users can get connection from Vert.x PgPool which is already closed.
@pjgg say it's a bug, because end users should get valid connections only
@tsegismont says it's a known limitation because some parts are missing on Vert.x side
the limitation has been discussed when the properties were added
is mentioned in one of the above comments.
Where should users get the info that closed connection could be received from PgPool?
I lean towards bug case, users should really get only valid connections.
TBH we shouldn't force people to plan for this case, Quarkus/Vert.x is about good developer experience.
@abesada sorry for the inconvenience.
The other option I have is not to set the idle timeout but that causes the database connections to be saturated over time.
This sounds like a different problem. Can you open a separate issue and explain what you see? Thank you
This is not a bug, it is a consequence of not set an idle timeout. Imagine a database instance in the cloud that has a limit of 100 concurrent connections. Now you create microservices around that database. If each microservice is allowed up to a maximum of 25 connections, in theory you can only build four microservices to ensure you don't saturate it, but if you define an idle timeout then you can put more microservices with the same resources (this is my case) without incurring more costs. The attached graph is a real example of what I am talking about. It is divided into two parts:
@rsvoboda we said the reconnect-attempts
, reconnect-interval
and idle-timeout
options were TCP-level options here and here.
At this point we are forcing users to plan for the situation when the connection is closed. Users/Customers need to add their own code to recover from this situation.
Only if they activate this property, which is a low-level TCP property, disabled by default.
It's not clear to me how the end user finds out that he/she needs to handle this situation.
If users catch a failure, they have to retry (example code with Mutiny)
I proposed an update to the documentation in #14737 . If that is not enough we can discuss how to expand the documentation. Or perhaps to deprecate this property until a pool-level idle timeout is supported upstream.
@abesada thanks for the details. Indeed, without idle-timeout
timeout support at the pool level you can only use the TCP-level option which, as you said, occasionally appears in production.
With your previous comment I wasn't sure if you had hit a bug that prevented to use the pool altogether after some time. It seems it is not the case.
@tsegismont thank you for the pointers, so the limitations are documented as comments in the GH issue, we miss the details in Quarkus guide.
On the strategy I'd rather have input from @gsmet or @gavinking: A) expand the documentation and keep the implementation as is B) deprecate the property until a pool-level idle timeout is supported upstream.
Any idea about the plans for https://github.com/eclipse-vertx/vertx-sql-client/issues/41? On the roadmap / not in the near furure / idle ...
If the decision is about enhancing the docs, https://quarkus.io/guides/reactive-sql-clients should contain the retry example and details linked from https://github.com/quarkusio/quarkus/issues/14608#issuecomment-770995090
Well if I'm understanding all this correctly, it seems I had the same misunderstanding, and added a setting to HR that calls setIdleTimeout()
, assuming that this was a "pool" timeout.
I guess I would say it's sorta pretty much critical that vert.x implements this feature in both 3 and 4 branches.
@rsvoboda Vert.x 4.1 will come a new pool implementation. The refactoring PR is being reviewed. eclipse-vertx/vertx-sql-client#41 is planned for 4.1 but on hold until the refactoring PR is merged.
In the meantime, I will update #14737 to add the retry example and details about the TCP-level idle-timeout property.
@gavinking I will check with the team how we can prioritize an implementation of pool idle timeout in 3.9 branch too.
Excellent, thanks.
@tsegismont I have added a proposal pool Idle implementation that passes all the tests:
Reopening the issue as it got auto-closed by merging JavaDoc update PR.
we recently upgraded to Vertx 4.1, could someone check if this issue can be closed please?
This issue is fixed in Quarkus Upstream that brings Vertx 4.1 and can be closed!. CC @Sanne
Awesome, many thanks @pjgg .
@gsmet I suppose we can't retroactively tag #14608 as having fixed this issue? Could you please check if the milestones/flags need adjustments?
I am facing this issue with vertx version 4.3.1. Is this expected or the fix to vertx version 4.1 should handle this?
@Gayathri2102 you may get the exception because pool idle timeout is not set or is too high.
See https://quarkus.io/guides/reactive-sql-clients.html#pooled-connection-idle-timeout
I want to leave here comment, maybe there is would be the case of somebody in the future, who look here. Our team got same issue with Grafana k6 stress test 1000 Virtual Users + Haproxy maxconn(200) less then client pool max size (800) configured.
When a user gets a connection from
pgPool
some milliseconds behind IDLE expiration time, then "sometimes" when you are going to use the connection, you get the following error from the server:Fail to read any response from the server, the underlying connection might get lost unexpectedly.
So, looks that the connection it's available when you request it, but then disappears when you use it.
Expected behavior The connection should be validated (maybe by a query), before return an available connection.
Actual behavior An end-user could get a connection that is not available anymore.
To Reproduce
You could run the following test
Jira Ref