Fix a bug for sending cancel request to terminate a long running query

xin-tsla commented 2 years ago

Hi All,

Update: The issue this PR tries to solve is that we observed some long running queries did not get cancelled.

~Changes in the PR:~ ~1. Fix a bug for sending cancel request to terminate a long running query~ ~2. Update a var name RetryStatementTimeout to RetryCancelledQuery,~ ~reference:https://www.postgresql.org/docs/current/errcodes-appendix.html~ ~In the table, 57014 is query_canceled.~ ~During a local experiment, the query which is cancelled by sending cancel request also returns 57014.~

xin-tsla commented 2 years ago

Hi @vmihailenco @elliotcourant, when you have time, could you take a look the PR?

xin-tsla commented 2 years ago

In some times, case <-ctx.Done(): returns its element latter than case <-fnDone:

            case <-fnDone: // fn has finished, skip cancel
            case <-ctx.Done():

So, even ctx.Done() did have some element, but

case fnDone <- struct{}{}: // signal fn finish, skip cancel goroutine

runs faster.

xin-tsla commented 2 years ago

Also, found that during some times, some queries returns

read tcp [::1]:63295->[::1]:5432: i/o timeout

which is faster than ctx.Done() returns, so the cancel request does not have a chance to be sent to db.

I think the root cause is that the write/read timeout for each conn, some times, can be faster than ctx.Done() returns a struct{}{}. In normal case, we expect ctx.Done() is faster than connection's timeout, so in the current logic, a cancel request is send. But in some case, connection's timeout happens first and faster than ctx.Done(), then in the current logic the connection will be removed but not sending a cancel request. In this PR, we add code to send a cancel request when there is a connection timeout due to write / read connection timeout settings. Will add some test to cover this test case.

elliotcourant commented 2 years ago

I can take a look this week sorry for the slow response, but even so; you might be hard pressed in getting this merged as this library has pretty much stopped. It is mostly stable and still being used in production, but idk if Vladmir will merge anything else to it.

xin-tsla commented 2 years ago

@elliotcourant Thanks for taking a look!

elliotcourant commented 2 years ago

Code wise this looks fine to me; the mock seems to stray a bit from the patterns I've seen elsewhere in this repo though (but no need to change it).

I'm going to look through the code itself locally more tonight.

elliotcourant commented 2 years ago

Digging into it more;

cancelRequest seems to work just fine, I would maybe note in the log message though that this "cancelRequest" function can fail and that is not necessarily indicative of the problem, but probably just a symptom of it? Like if the Postgres db crashes we might end up in a "bad conn" state due to something like a timeout, which would mean that the cancelRequest function itself will also fail. Maybe the thing to do would be to also log the error that is being passed to releaseConn as the initial cause? Otherwise I could see people being mislead by the log output of cancelRequest.

Hopefully that makes sense. Please let me know if it does not.

xin-tsla commented 2 years ago

@elliotcourant , thanks for taking a look! I agreed that the cancel log may be confusing if the db crashes. The issue we observed was the db was running but the long running queries did not get cancelled. In go-pg, when a query is executed, there are two deadlines set up for the query, one is from context, and the other is from conn object in the connection pool.

The following time series diagram shows that if the ctx.Done(green box) returns / closes earlier than connection's deadline (the red box) due to deadline exceeds, then the current library will send a cancel request and remove the connection from the connection pool. investigation-long-running-query drawio (3)

The following time series diagram shows that if the connection's deadline (the red box) returns earlier than the ctx.Done() (green box), then the library remove the connection from the connection pool but does not send a cancel request to db. investigation-long-running-query drawio (4)

xin-tsla commented 2 years ago

@elliotcourant , thanks for reviewing! For the isBadConn method, I am thinking if I just cancel only for 57014. Currently isBadConn returns true if 25P02 and 57014

elliotcourant commented 2 years ago

25P02 is the transaction is aborted right? Which means the database is not spinning on anything anyway and thus there is nothing to cancel, the conn should just be thrown out?

xin-tsla commented 2 years ago

@elliotcourant , yes, the 25P02 is aborted transaction, I think so, the database will not spinning on the query anyway. So, it is no need to cancel for the 25P02. From the previous behavior, the connection with aborted transaction will be throw out.

xin-tsla commented 2 years ago

@elliotcourant , I just changed the code a little bit to just cancel the case for 75014, I think it could fix the test fail?

xin-tsla commented 2 years ago

@elliotcourant , wondering how could I trigger a test ci-pipeline?

elliotcourant commented 2 years ago

Just triggered one, this looks good to me.

xin-tsla commented 2 years ago

@elliotcourant , thanks for reviewing!

elliotcourant commented 2 years ago

This looks good to me, I want to talk to @vmihailenco about merging though.

xin-tsla commented 2 years ago

@elliotcourant , thanks so much!

xin-tsla commented 2 years ago

Thanks @vmihailenco @elliotcourant @CyborgMaster !

xin-tsla commented 2 years ago

Hi @vmihailenco @elliotcourant , wondering will we release a new tag version for the change ?

elliotcourant commented 2 years ago

I will tag a release as soon as I return from a business trip this weekend, sorry for the delay!

xin-tsla commented 2 years ago

Thanks so much! @elliotcourant

elliotcourant commented 2 years ago

@xin-tsla Release has been tagged!

xin-tsla commented 2 years ago

@elliotcourant , thanks so much! Much appreciate it!

go-pg / pg

Fix a bug for sending cancel request to terminate a long running query #1952