Open jchampio opened 3 years ago
Hi @jchampio ; thanks for your analysis and repro steps! Could you also work on a PR to fix it? Note that I will be away in vacations next week, so I will be able to review when back, so sometime in November.
Hi @DimCitus, this week has been crazy and I haven't gotten a chance to look at it yet. I think the main question I have is what the strategy should be when the monitor notices that a previous version is installed -- do we issue backwards-compatible queries, or do we bail out and wait for the extension upgrade?
I'm not sure how we would go with the backward-compability, it seems to me that we would have a AutoFailoverNode *pgAutoFailoverNode
partially filled, and at the moment the code doesn't know about that. Looks quite complex to make this happen, for a dubious benefit.
I think bail-out and wait for the extension upgrade is the better path forward.
If our 3 nodes went in bad state then how we can identify which will make primary in this case becase in 3 nides our postgres services is not running.
(This shares similarities with #810, but I can't tell whether they have the same root cause.)
We're seeing intermittent crashes when upgrading the monitor from 1.4.2 to 1.6.2:
Repro
It looks like there's a race condition in one of the background workers, so reproduction is tricky, but I've been able to reliably reproduce this stack trace with the following GUC settings on the monitor:
With the above settings for the monitor, do the following:
As soon as the monitor notices the bad secondary, I get the assertion failure in the server logs.
RCA(?)
Best I can tell,
SetNodeHealthState()
is to blame:In this case, the SPI query performs a
RETURNING node.*
that returns a row from the 1.4 version of pg_auto_failover. This row has only 19 columns. Since(healthState != previousHealthState)
(because we stopped one of the secondaries during the upgrade), we callTupleToAutoFailoverNode()
, which expects a 1.6 tuple (with 21 columns). That code proceeds to walk off the end of the row and crash.If the monitor is able to update to 1.6 before a node's health state changes, we win the race and do not see the crash, which is why having those GUC values pulled down helps with reproduction.