neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.57k stars 423 forks source link

Add tests for xid, multixid and oid wraparound #258

Open lubennikovaav opened 3 years ago

lubennikovaav commented 3 years ago

Branch nonrel_wal_test_xid_wraparound contains a test, that illustrates the problem with xid wraparound. It fails on main too (with different symptoms), but I suggest concentrating our efforts on nonrel_wal.

In a current shape, the test is too hacky to propose as PR. I'll try to straighten it up. If you have any ideas on how to make it better, I'm all ears.

See also a discussion in hackers about similar testing scenario: https://www.postgresql.org/message-id/flat/CAP4vRV5gEHFLB7NwOE6_dyHAeVfkvqF8Z_g5GaCQZNgBAE0Frw%40mail.gmail.com#e10861372aec22119b66756ecbac581c

hlinnaka commented 3 years ago

Branch nonrel_wal_test_xid_wraparound contains a test, that illustrates the problem with xid wraparound. It fails on main too (with different symptoms), but I suggest concentrating our efforts on nonrel_wal.

That's strange. On 'main', the page server doesn't know anything about XIDs or wraparound, so how come it's broken? I'd like to at least understand what the problem is, before we dismiss it.

In a current shape, the test is too hacky to propose as PR. I'll try to straighten it up. If you have any ideas on how to make it better, I'm all ears.

If we had a feature to import an arbitrary PostgreSQL data directory into the repository, this test could use it. I think we'll need that sooner or later anyway, so maybe implement it now.

knizhnik commented 3 years ago

I doesn't understand what this test is demonstrated. First of all there is no any xid wraparound. You need to execute 2^31 transactions to cause XID wraparound.

This test is failed in the following assertion:

        # ensure that oldest_xid advanced
>       assert int(oldest_xid) < int(oldest_xid1)
E       AssertionError: assert 727 < 727
E        +  where 727 = int('727')
E        +  and   727 = int('727')

But 50000 transactions performed by pgbench in this test seems to be not enough to advance oldestXid in checkpoint, taken in account that autovacuum_freeze_max_age=100000 (twice more). If we perform 500000 transactions (ten times more), then this check passed.

The next failure is:

        # ensure that we restored fields correctly
        assert int(xid1) <= int(xid2)
>       assert int(oldest_xid1) == int(oldest_xid2)
E       AssertionError: assert 403881 == 727
E        +  where 403881 = int('403881')
E        +  and   727 = int('727')

We update checkpoint.oldestXid on CLOG_TRUNCATE records, but it was not performed.

knizhnik commented 3 years ago

With commit c564272 this test is passed at nonrel_wal branch with number of pgbench transactions increased till 500000. But it takes about half an hour time to compete.

hlinnaka commented 3 years ago

With commit c564272 this test is passed at nonrel_wal branch with number of pgbench transactions increased till 500000. But it takes about half an hour time to compete.

Ah yeah, consuming enough transactions to trigger XID wraparound can take a while... What I've done in the past is to write a C function specifically for testing that advances ShmemVariableCache->nextXid directly. the ExtendCLOG(), ExtendCommitTs() and ExtendSUBTRANS() expect to be called for every XID, because they need to extend the slru files whenever you cross a page boundary, so one safe way to do that is to skip over XIDs that are "not interesting" to those Extend*() functions, and call GetNewTransactionId() as usual for others. Something like this:

#define BORING_XID_INTERVAL 1024
    LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
    nextXid = ShmemVariableCache->nextXid;
    remainder = nextXid % BORING_XID_INTERVAL;
    if (remainder >= FirstNormalTransactionId &&
        remainder < BORING_XID_INTERVAL - 1 &&
        !TransactionIdFollowsOrEquals(nextXid, ShmemVariableCache->xidVacLimit))
    {
        /* bump nextXid to just before the next non-boring XID */ 
        nextXid += BORING_XID_INTERVAL - remainder - 1;
        LWLockRelease(XidGenLock);
    }
    else
    {
        LWLockRelease(XidGenLock);
        /* call GetNewTransactionId() or something to get a new XID the usual way */
    }

Andres Freund and @lubennikovaav had some discussion on pgsql-hackers on adding a standard way to test XID wraparound to upstream: https://www.postgresql.org/message-id/CAP4vRV5gEHFLB7NwOE6_dyHAeVfkvqF8Z_g5GaCQZNgBAE0Frw%40mail.gmail.com. That would be very handy. We had a function like that in GPDB as well (https://github.com/greenplum-db/gpdb/blob/master/src/test/regress/regress_gp.c#L1955), but it would be good to get it into upstream, Postgres could use some tests around XID wraparound too.

hlinnaka commented 3 years ago

https://github.com/zenithdb/zenith/pull/351 comments out the code to handle CLOG and multixid offsets wraparound. Just by looking at the code, those looked wrong to me.

stepashka commented 2 years ago

the test is not part of any CI ATM because it takes a lot of time to run this test

stepashka commented 2 years ago

the next step is to commit this :)