Closed LizardWizzard closed 8 months ago
It is supposedly a race in the test, that should be fixed by #558
Noticed it twice since https://github.com/neondatabase/neon/pull/1872#issuecomment-1145381010 today.
We've revamped the SK <-> PS wal streaming and made PS to connect to SK more eager than before with callmemaybe, might be very related.
Looks like the problem is that RecordTransactionAbort(bool isSubXact) is not performing XLogFLush:
* We do not flush XLOG to disk here, since the default assumption after a
* crash would be that we aborted, anyway. For the same reason, we don't
* need to worry about interlocking against checkpoint start.
So it is really possible that last aborted transactions will never reach pageserver. I do not think that it will cause some problems .... except rare test failures because of pg_xacts file mismatch. I see the following ways of addressing this problem:
It is not good to introduce some changes in Postgres core just to make our tests pass. So I do not like 1. But disabling or complicating this check is also not exciting proposal. Any better suggestion?
We can do 1, but issue fsync manually from python code; we already have neon_xlogflush
in neontest.c
for exactly this purpose. And yes, we'd need to add wait for propagation. Or this aborted xact emerges immediately before/during shutdown? Do you have an idea what is an xact this is?
If this turns out to be not enough, I'd probably just skip pg_xact comparison.
@knizhnik do you have an idea why it happens in test_multixact? We can add flushing WAL, but I don't understand out of hand why it would help here.
We can do "Somehow force Postgres to flush WAL on termination request. But it is not only necessary to call XLogFlush, but also wait until walsender will be able to propagate this changes", but issue fsync manually from python code; we already have neon_xlogflush in neontest.c for exactly this purpose. And yes, we'd need to add wait for propagation. Or this aborted xact emerges immediately before/during shutdown? Do you have an idea what is an xact this is? If this turns out to be not enough, I'd probably just skip pg_xact comparison.
I'm going to see if I can figure this out
FYI:
If you want to test this faster and hopefully uncover flakiness faster you can comment out all but the relevant test in neon/vendor/postgresv14/src/test/regress/parallel_schedule
as follows:
Recent attempts failed to ease this failed: https://neon-github-public-dev.s3.amazonaws.com/reports/main/6124215906/index.html#suites/158be07438eb5188d40b466b6acfaeb3/eadf5e4006a9544f
This is still failing occasionally https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6659/7814559995/index.html#/testresult/23e61beb205b42f6
Looking at the current situation:
This is still failing occasionally https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6659/7814559995/index.html#/testresult/23e61beb205b42f6
Hmm, the test is supposed to write a diff of the file to a .filediff
file, which is supposed to be included as an attachment in the Allure report. I don't see it in the Allure report. Weird. When I simulate the failure locally by modifying the file, it does create a .filediff
file, and the allure_attach_from_dir() function does attach it. I wonder what's going on there?
This was hopefully fixed by https://github.com/neondatabase/neon/pull/6666. If not, and this still reoccurs, please reopen.
The modified basebackup invocation that uses an explicit LSN is leading to a timeout waiting for that LSN.
This seems to be failing more frequently often since #6666 . It is the most frequently failing flaky test over the last 4 days.
The next proposed change is https://github.com/neondatabase/neon/pull/6712
Found in main
@lubennikovaav is it because of the recent changes or this is a spurious error?