hapostgres / pg_auto_failover

Postgres extension and service for automated failover and high-availability
Other
1.06k stars 112 forks source link

Monitor crashes immediately post-upgrade (1.4.2->1.6.2) #821

Open jchampio opened 2 years ago

jchampio commented 2 years ago

(This shares similarities with #810, but I can't tell whether they have the same root cause.)

We're seeing intermittent crashes when upgrading the monitor from 1.4.2 to 1.6.2:

TRAP: FailedAssertion("attnum <= tupleDesc->natts", File: "heaptuple.c", Line: 89)
postgres: pg_auto_failover monitor: pg_auto_failover monitor healthcheck worker pg_auto_failover (ExceptionalCondition+0x80)[0x5624881373f5]
postgres: pg_auto_failover monitor: pg_auto_failover monitor healthcheck worker pg_auto_failover (getmissingattr+0x85)[0x562487c77c23]
/usr/local/pgsql/lib/pgautofailover.so(TupleToAutoFailoverNode+0x1c50)[0x7f5d17a2a346]
/usr/local/pgsql/lib/pgautofailover.so(SetNodeHealthState+0x1f5)[0x7f5d17a34308]
/usr/local/pgsql/lib/pgautofailover.so(+0xfeb0)[0x7f5d17a2feb0]
/usr/local/pgsql/lib/pgautofailover.so(+0x1034e)[0x7f5d17a3034e]
/usr/local/pgsql/lib/pgautofailover.so(HealthCheckWorkerMain+0x17d)[0x7f5d17a307ff]
postgres: pg_auto_failover monitor: pg_auto_failover monitor healthcheck worker pg_auto_failover (StartBackgroundWorker+0x245)[0x562487f70904]
postgres: pg_auto_failover monitor: pg_auto_failover monitor healthcheck worker pg_auto_failover (+0x3d1748)[0x562487f7f748]
postgres: pg_auto_failover monitor: pg_auto_failover monitor healthcheck worker pg_auto_failover (+0x3d18c4)[0x562487f7f8c4]
postgres: pg_auto_failover monitor: pg_auto_failover monitor healthcheck worker pg_auto_failover (+0x3d20eb)[0x562487f800eb]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7f5d1923d3c0]
/lib/x86_64-linux-gnu/libc.so.6(sigprocmask+0x4b)[0x7f5d18b0e48b]
postgres: pg_auto_failover monitor: pg_auto_failover monitor healthcheck worker pg_auto_failover (+0x3d2a94)[0x562487f80a94]
postgres: pg_auto_failover monitor: pg_auto_failover monitor healthcheck worker pg_auto_failover (PostmasterMain+0x10f7)[0x562487f81dc7]
postgres: pg_auto_failover monitor: pg_auto_failover monitor healthcheck worker pg_auto_failover (main+0x210)[0x562487ecad9f]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f5d18aef0b3]
postgres: pg_auto_failover monitor: pg_auto_failover monitor healthcheck worker pg_auto_failover (_start+0x2e)[0x562487c6c14e]

Repro

It looks like there's a race condition in one of the background workers, so reproduction is tricky, but I've been able to reliably reproduce this stack trace with the following GUC settings on the monitor:

pgautofailover.health_check_period = 1000
pgautofailover.health_check_timeout = 100
pgautofailover.health_check_max_retries = 1
pgautofailover.health_check_retry_delay = 1

With the above settings for the monitor, do the following:

As soon as the monitor notices the bad secondary, I get the assertion failure in the server logs.

RCA(?)

Best I can tell, SetNodeHealthState() is to blame:

void
SetNodeHealthState(int64 nodeId,
                   char *nodeName,
                   char *nodeHost,
                   uint16 nodePort,
                   int previousHealthState,
                   int healthState)
{
    ...

    if (HaMonitorHasBeenLoaded())
    {
        initStringInfo(&query);
        appendStringInfo(&query,
                         "UPDATE " AUTO_FAILOVER_NODE_TABLE
                         "   SET health = %d, healthchecktime = now() "
                         " WHERE nodeid = %lld "
                         "   AND nodehost = %s AND nodeport = %d "
                         " RETURNING node.*",
                         healthState,
                         (long long) nodeId,
                         quote_literal_cstr(nodeHost),
                         nodePort);

        pgstat_report_activity(STATE_RUNNING, query.data);

        spiStatus = SPI_execute(query.data, false, 0);
        Assert(spiStatus == SPI_OK_UPDATE_RETURNING);

        /*
         * We should have 0 or 1 row impacted, because of pkey on nodeid. We
         * might have updated zero rows when a node is concurrently being
         * DELETEd, because of the default REPETEABLE READ isolation level.
         */
        if (SPI_processed == 1)
        {
            if (healthState != previousHealthState)
            {
                HeapTuple heapTuple = SPI_tuptable->vals[0];
                AutoFailoverNode *pgAutoFailoverNode =
                    TupleToAutoFailoverNode(SPI_tuptable->tupdesc, heapTuple);

In this case, the SPI query performs a RETURNING node.* that returns a row from the 1.4 version of pg_auto_failover. This row has only 19 columns. Since (healthState != previousHealthState) (because we stopped one of the secondaries during the upgrade), we call TupleToAutoFailoverNode(), which expects a 1.6 tuple (with 21 columns). That code proceeds to walk off the end of the row and crash.

If the monitor is able to update to 1.6 before a node's health state changes, we win the race and do not see the crash, which is why having those GUC values pulled down helps with reproduction.

DimCitus commented 2 years ago

Hi @jchampio ; thanks for your analysis and repro steps! Could you also work on a PR to fix it? Note that I will be away in vacations next week, so I will be able to review when back, so sometime in November.

jchampio commented 2 years ago

Hi @DimCitus, this week has been crazy and I haven't gotten a chance to look at it yet. I think the main question I have is what the strategy should be when the monitor notices that a previous version is installed -- do we issue backwards-compatible queries, or do we bail out and wait for the extension upgrade?

DimCitus commented 2 years ago

I'm not sure how we would go with the backward-compability, it seems to me that we would have a AutoFailoverNode *pgAutoFailoverNode partially filled, and at the moment the code doesn't know about that. Looks quite complex to make this happen, for a dubious benefit.

I think bail-out and wait for the extension upgrade is the better path forward.

Rajeevshar commented 3 months ago

If our 3 nodes went in bad state then how we can identify which will make primary in this case becase in 3 nides our postgres services is not running.