Closed xin-tsla closed 2 years ago
Hi @vmihailenco @elliotcourant, when you have time, could you take a look the PR?
In some times, case <-ctx.Done():
returns its element latter than case <-fnDone:
case <-fnDone: // fn has finished, skip cancel
case <-ctx.Done():
So, even ctx.Done() did have some element, but
case fnDone <- struct{}{}: // signal fn finish, skip cancel goroutine
runs faster.
Also, found that during some times, some queries returns
read tcp [::1]:63295->[::1]:5432: i/o timeout
which is faster than ctx.Done() returns, so the cancel request does not have a chance to be sent to db.
I think the root cause is that the write/read timeout for each conn
, some times, can be faster than ctx.Done() returns a struct{}{}.
In normal case, we expect ctx.Done() is faster than connection's timeout, so in the current logic, a cancel request is send.
But in some case, connection's timeout happens first and faster than ctx.Done(), then in the current logic the connection will be removed but not sending a cancel request.
In this PR, we add code to send a cancel request when there is a connection timeout due to write / read connection timeout settings.
Will add some test to cover this test case.
I can take a look this week sorry for the slow response, but even so; you might be hard pressed in getting this merged as this library has pretty much stopped. It is mostly stable and still being used in production, but idk if Vladmir will merge anything else to it.
@elliotcourant Thanks for taking a look!
Code wise this looks fine to me; the mock seems to stray a bit from the patterns I've seen elsewhere in this repo though (but no need to change it).
I'm going to look through the code itself locally more tonight.
Digging into it more;
cancelRequest seems to work just fine, I would maybe note in the log message though that this "cancelRequest" function can fail and that is not necessarily indicative of the problem, but probably just a symptom of it? Like if the Postgres db crashes we might end up in a "bad conn" state due to something like a timeout, which would mean that the cancelRequest
function itself will also fail. Maybe the thing to do would be to also log the error that is being passed to releaseConn
as the initial cause? Otherwise I could see people being mislead by the log output of cancelRequest
.
Hopefully that makes sense. Please let me know if it does not.
@elliotcourant , thanks for taking a look! I agreed that the cancel log may be confusing if the db crashes. The issue we observed was the db was running but the long running queries did not get cancelled. In go-pg, when a query is executed, there are two deadlines set up for the query, one is from context, and the other is from conn object in the connection pool.
The following time series diagram shows that if the ctx.Done
(green box) returns / closes earlier than connection's deadline (the red box) due to deadline exceeds, then the current library will send a cancel request and remove the connection from the connection pool.
The following time series diagram shows that if the connection's deadline (the red box) returns earlier than the ctx.Done()
(green box), then the library remove the connection from the connection pool but does not send a cancel request to db.
@elliotcourant , thanks for reviewing!
For the isBadConn
method, I am thinking if I just cancel only for 57014
. Currently isBadConn
returns true if 25P02
and 57014
25P02
is the transaction is aborted right? Which means the database is not spinning on anything anyway and thus there is nothing to cancel, the conn should just be thrown out?
@elliotcourant , yes, the 25P02
is aborted transaction, I think so, the database will not spinning on the query anyway.
So, it is no need to cancel for the 25P02
. From the previous behavior, the connection with aborted transaction will be throw out.
@elliotcourant , I just changed the code a little bit to just cancel the case for 75014
, I think it could fix the test fail?
@elliotcourant , wondering how could I trigger a test ci-pipeline?
Just triggered one, this looks good to me.
@elliotcourant , thanks for reviewing!
This looks good to me, I want to talk to @vmihailenco about merging though.
@elliotcourant , thanks so much!
Thanks @vmihailenco @elliotcourant @CyborgMaster !
Hi @vmihailenco @elliotcourant , wondering will we release a new tag version for the change ?
I will tag a release as soon as I return from a business trip this weekend, sorry for the delay!
Thanks so much! @elliotcourant
@xin-tsla Release has been tagged!
@elliotcourant , thanks so much! Much appreciate it!
Hi All,
Update: The issue this PR tries to solve is that we observed some long running queries did not get cancelled.
~Changes in the PR:~ ~1. Fix a bug for sending cancel request to terminate a long running query~ ~2. Update a var name
RetryStatementTimeout
toRetryCancelledQuery
,~ ~reference:https://www.postgresql.org/docs/current/errcodes-appendix.html~ ~In the table, 57014 is query_canceled.~ ~During a local experiment, the query which is cancelled by sending cancel request also returns 57014.~