quarkusio / quarkus

Quarkus: Supersonic Subatomic Java.
https://quarkus.io
Apache License 2.0
13.82k stars 2.69k forks source link

Vertx PgPool lost connection unexpectedly #14608

Closed pjgg closed 3 years ago

pjgg commented 3 years ago

When a user gets a connection frompgPool some milliseconds behind IDLE expiration time, then "sometimes" when you are going to use the connection, you get the following error from the server:

Fail to read any response from the server, the underlying connection might get lost unexpectedly.

So, looks that the connection it's available when you request it, but then disappears when you use it.

Expected behavior The connection should be validated (maybe by a query), before return an available connection.

Actual behavior An end-user could get a connection that is not available anymore.

To Reproduce

You could run the following test

Jira Ref

gsmet commented 3 years ago

/cc @tsegismont

tsegismont commented 3 years ago

I will look into this @gsmet

gavinking commented 3 years ago

This problem is usually solved by setting an appropriate idle timeout for the connection pool.

gavinking commented 3 years ago

This problem is usually solved by setting an appropriate idle timeout for the connection pool.

It is ConnectOptions.setIdleTimeout(), I believe.

pjgg commented 3 years ago

On the test that cover this use case, we have set an idle timeout of 1sec, as you can see in the application.property file.

The idle timeout value it's not relevant. The issue is a race condition between idle-timeout and when the user uses this connection. In other words, sometimes the user could get a connection that is available on PostgreSQL pg_stat_activity table, but some millisecond later when the user uses this connection it's not available anymore. And then you got the reported exception.

Take a look at the test, especially when this issue takes place. For example, in the provided scenario, I could reproduce if the user borrows a connection just 3 milliseconds ahead of the idle expiration time. So, In my opinion, looks that the Idle value is not too relevant could be 1 second, 5 min, or 1h, but if the user requests a connection 3 ms ahead of this idle time, then you could lose the connection.

Note: this issue happens on Postgresql reactive data source.

Timeline

  1. user borrows a connection with an idle of t + 2. Current time = t
  2. pgPool returns an available connection that is on pg_stat_activity table . Current time = t + 1
  3. user after doing some stuff that takes some time, make a query with the given connection (Step 2). Current time = t + 3
  4. PgPool throws an exception Fail to read any response from the server, the underlying connection might get lost unexpectedly. because the given connection (step 2) is not available anymore.
tsegismont commented 3 years ago

idleTimeout is an option that applies to the TCP client connection, not to the pool:

https://github.com/quarkusio/quarkus/blob/b49c2eac24a0383f5c815aef960a809a8aa0f57e/extensions/reactive-datasource/runtime/src/main/java/io/quarkus/reactive/datasource/runtime/DataSourceReactiveRuntimeConfig.java#L112-L116

Idle timeout at the pool level is not supported yet but planned for Vert.x 4.1.0 https://github.com/eclipse-vertx/vertx-sql-client/issues/41

I think we can close this (not a bug).

pjgg commented 3 years ago

I decided to walk an extra mile, just to try to make a stronger point about this issue:

First of all, I double-checked the Quarkus doc and it says "before it is removed from the pool". So looks that this property(quarkus.datasource.reactive.idle-timeout) applies to the pool.

But, please consider this extra test I created: (https://github.com/quarkus-qe/beefy-scenarios/pull/71/files#diff-f9fa26a38bdba2a7d0e4ba2bb3706ab3db9faa3c590f91c84e492dbfad8dcf21R103). There you have an integration test over Quarkus QE repository. This test validates the behavior of releasing Idle connection using a pool and it works well as expected, so Idle looks like is not an issue.

And also please check this scenario I just created for this conversation, reproducing the issue that I told you in my previous comment (this test) but using Vertx 4.0.0 (without quarkus) and also without a DB pool (just using connections), and do you know what? ... I have exactly the same issue. (scenario); And the reason is described in the previous comment.

IMHO this is a bug you can reproduce also in your local environment and it's happening now in production in several places Vertx /quarkus in several versions, so please check the tests above and let me know your thoughts.

tsegismont commented 3 years ago

First of all, I double-checked the Quarkus doc and it says "before it is removed from the pool". So looks that this property(quarkus.datasource.reactive.idle-timeout) applies to the pool.

Indeed the doc is confusing. I will change it to "before it is closed". After it's closed, it's removed from the pool, whether or not the user has borrowed it from the pool at the time the timeout expired.

gsmet commented 3 years ago

@tsegismont how is it not a bug to have Fail to read any response from the server, the underlying connection might get lost unexpectedly. errors from a pool? I might have missed something?

tsegismont commented 3 years ago

It's not a bug because the feature does not exist yet, see https://github.com/quarkusio/quarkus/issues/14608#issuecomment-769151568

The idle timeout we have now applies at the TCP connection level and works as expected.

It is true though that the documentation is incorrect and I just push a change to update it.

pjgg commented 3 years ago

It's not a bug because the feature does not exist yet (on pool level )

This issue is talking about getting a connection that is closed (also, fails on Vertx 4 without a pool). This issue. Vertx PgPool lost connection unexpectedly can't be the normal behavior of your DB connections.

Could you provide an example about how to use this property quarkus.datasource.reactive.idle-timeout and PgPool postgresql?. Or it's not supported this combination?, and if it's supported, could you make this test works?

Do you know if this behavior is the same in Db2 or MySQL?

tsegismont commented 3 years ago

This property has been added in #11498 in order to solve #11149.

Until eclipse-vertx/vertx-sql-client#41 is solved, we have nothing better than the TCP connection level properties, as discussed in the original issue.

In other words, there is currently no way to fix this test.

But this is not a bug, because the limitation has been discussed when the properties were added.

And yes that applies to DB2 and MySQL too.

abesada commented 3 years ago

I am an vertx end-user using it in production environments. The issue mentioned has occasionally appeared in production. I am not going to evaluate the internal architecture of vertx but as an end-user I consider that it is a bug in the toolkit that forces me to develop an external layer to validate if the connection given by the vertx pool is valid, with the consequent loss of performance. The other option I have is not to set the idle timeout but that causes the database connections to be saturated over time.

tsegismont commented 3 years ago

@abesada sorry for the inconvenience.

The other option I have is not to set the idle timeout but that causes the database connections to be saturated over time.

This sounds like a different problem. Can you open a separate issue and explain what you see? Thank you

rsvoboda commented 3 years ago

If I'm reading this correctly, users can get connection from Vert.x PgPool which is already closed.

@pjgg say it's a bug, because end users should get valid connections only

@tsegismont says it's a known limitation because some parts are missing on Vert.x side

the limitation has been discussed when the properties were added is mentioned in one of the above comments. Where should users get the info that closed connection could be received from PgPool?

I lean towards bug case, users should really get only valid connections.

TBH we shouldn't force people to plan for this case, Quarkus/Vert.x is about good developer experience.

abesada commented 3 years ago

@abesada sorry for the inconvenience.

The other option I have is not to set the idle timeout but that causes the database connections to be saturated over time.

This sounds like a different problem. Can you open a separate issue and explain what you see? Thank you

This is not a bug, it is a consequence of not set an idle timeout. Imagine a database instance in the cloud that has a limit of 100 concurrent connections. Now you create microservices around that database. If each microservice is allowed up to a maximum of 25 connections, in theory you can only build four microservices to ensure you don't saturate it, but if you define an idle timeout then you can put more microservices with the same resources (this is my case) without incurring more costs. The attached graph is a real example of what I am talking about. It is divided into two parts:

tsegismont commented 3 years ago

@rsvoboda we said the reconnect-attempts, reconnect-interval and idle-timeout options were TCP-level options here and here.

At this point we are forcing users to plan for the situation when the connection is closed. Users/Customers need to add their own code to recover from this situation.

Only if they activate this property, which is a low-level TCP property, disabled by default.

It's not clear to me how the end user finds out that he/she needs to handle this situation.

If users catch a failure, they have to retry (example code with Mutiny)

I proposed an update to the documentation in #14737 . If that is not enough we can discuss how to expand the documentation. Or perhaps to deprecate this property until a pool-level idle timeout is supported upstream.

tsegismont commented 3 years ago

@abesada thanks for the details. Indeed, without idle-timeout timeout support at the pool level you can only use the TCP-level option which, as you said, occasionally appears in production.

With your previous comment I wasn't sure if you had hit a bug that prevented to use the pool altogether after some time. It seems it is not the case.

rsvoboda commented 3 years ago

@tsegismont thank you for the pointers, so the limitations are documented as comments in the GH issue, we miss the details in Quarkus guide.

On the strategy I'd rather have input from @gsmet or @gavinking: A) expand the documentation and keep the implementation as is B) deprecate the property until a pool-level idle timeout is supported upstream.

Any idea about the plans for https://github.com/eclipse-vertx/vertx-sql-client/issues/41? On the roadmap / not in the near furure / idle ...

If the decision is about enhancing the docs, https://quarkus.io/guides/reactive-sql-clients should contain the retry example and details linked from https://github.com/quarkusio/quarkus/issues/14608#issuecomment-770995090

gavinking commented 3 years ago

Well if I'm understanding all this correctly, it seems I had the same misunderstanding, and added a setting to HR that calls setIdleTimeout(), assuming that this was a "pool" timeout.

I guess I would say it's sorta pretty much critical that vert.x implements this feature in both 3 and 4 branches.

pjgg commented 3 years ago

I have created a PR test, on vertx-sql-client as a scenario "showroom"

tsegismont commented 3 years ago

@rsvoboda Vert.x 4.1 will come a new pool implementation. The refactoring PR is being reviewed. eclipse-vertx/vertx-sql-client#41 is planned for 4.1 but on hold until the refactoring PR is merged.

In the meantime, I will update #14737 to add the retry example and details about the TCP-level idle-timeout property.

@gavinking I will check with the team how we can prioritize an implementation of pool idle timeout in 3.9 branch too.

gavinking commented 3 years ago

Excellent, thanks.

pjgg commented 3 years ago

@tsegismont I have added a proposal pool Idle implementation that passes all the tests:

https://github.com/eclipse-vertx/vertx-sql-client/pull/881

rsvoboda commented 3 years ago

Reopening the issue as it got auto-closed by merging JavaDoc update PR.

Sanne commented 3 years ago

we recently upgraded to Vertx 4.1, could someone check if this issue can be closed please?

pjgg commented 3 years ago

This issue is fixed in Quarkus Upstream that brings Vertx 4.1 and can be closed!. CC @Sanne

Sanne commented 3 years ago

Awesome, many thanks @pjgg .

Sanne commented 3 years ago

@gsmet I suppose we can't retroactively tag #14608 as having fixed this issue? Could you please check if the milestones/flags need adjustments?

Gayathri2102 commented 1 year ago

I am facing this issue with vertx version 4.3.1. Is this expected or the fix to vertx version 4.1 should handle this?

tsegismont commented 1 year ago

@Gayathri2102 you may get the exception because pool idle timeout is not set or is too high.

See https://quarkus.io/guides/reactive-sql-clients.html#pooled-connection-idle-timeout

lasteris commented 9 months ago

I want to leave here comment, maybe there is would be the case of somebody in the future, who look here. Our team got same issue with Grafana k6 stress test 1000 Virtual Users + Haproxy maxconn(200) less then client pool max size (800) configured.