check_restored_datadir_content failure in pg_regress

LizardWizzard commented 3 years ago

Found in main

AssertionError: assert (['pg_xact/0000'], []) == ([], [])
  At index 0 diff: ['pg_xact/0000'] != []
  Full diff:
  - ([], [])
  + (['pg_xact/0000'], [])
batch_pg_regress/test_pg_regress.py:58: in test_pg_regress
    check_restored_datadir_content(zenith_cli, test_output_dir, pg)
fixtures/zenith_fixtures.py:1011: in check_restored_datadir_content
    assert (mismatch, error) == ([], [])
E   AssertionError: assert (['pg_xact/0000'], []) == ([], [])
E     At index 0 diff: ['pg_xact/0000'] != []
E     Full diff:
E     - ([], [])
E     + (['pg_xact/0000'], [])

@lubennikovaav is it because of the recent changes or this is a spurious error?

lubennikovaav commented 3 years ago

It is supposedly a race in the test, that should be fixed by #558

hlinnaka commented 2 years ago

This happened again: https://app.circleci.com/pipelines/github/neondatabase/neon/6807/workflows/f33d1c05-ceef-40d6-842f-ebc3237166a9/jobs/69445

bojanserafimov commented 2 years ago

happened again https://app.circleci.com/pipelines/github/neondatabase/neon/6851/workflows/7d3d9336-e0a5-4220-a70b-b2003daddeeb/jobs/70018/tests

SomeoneToIgnore commented 2 years ago

Noticed it twice since https://github.com/neondatabase/neon/pull/1872#issuecomment-1145381010 today.

We've revamped the SK <-> PS wal streaming and made PS to connect to SK more eager than before with callmemaybe, might be very related.

yeputons commented 2 years ago

Another one: https://app.circleci.com/pipelines/github/neondatabase/neon/7064/workflows/4fac5531-8bbe-48f3-8fb1-a3a1f9d485f5/jobs/72593

arssher commented 1 year ago

And again https://github.com/neondatabase/neon/actions/runs/3675484577/jobs/6215126982

knizhnik commented 1 year ago

Looks like the problem is that RecordTransactionAbort(bool isSubXact) is not performing XLogFLush:

 * We do not flush XLOG to disk here, since the default assumption after a
 * crash would be that we aborted, anyway.  For the same reason, we don't
 * need to worry about interlocking against checkpoint start.

So it is really possible that last aborted transactions will never reach pageserver. I do not think that it will cause some problems .... except rare test failures because of pg_xacts file mismatch. I see the following ways of addressing this problem:

Somehow force Postgres to flush WAL on termination request. But it is not only necessary to call XLogFlush, but also wait until walsender will be able to propagate this changes
Disable test at all: do not compare restored and original directory.
Do not compare pg_xacts segments
Smart comparison of pg_xacts segments (for example ignore aborted transaction or just last aborted transaction)

It is not good to introduce some changes in Postgres core just to make our tests pass. So I do not like 1. But disabling or complicating this check is also not exciting proposal. Any better suggestion?

arssher commented 1 year ago

We can do 1, but issue fsync manually from python code; we already have neon_xlogflush in neontest.c for exactly this purpose. And yes, we'd need to add wait for propagation. Or this aborted xact emerges immediately before/during shutdown? Do you have an idea what is an xact this is?

If this turns out to be not enough, I'd probably just skip pg_xact comparison.

koivunej commented 1 year ago

once more: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-4082/release/4806590168/index.html#suites/e90f9f55d45ab2a087333a860583a7c3/966501dacb4d63bf/

hlinnaka commented 1 year ago

another one: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-4304/5062384433/index.html#suites/e90f9f55d45ab2a087333a860583a7c3/7f35d1745fa08a30/

koivunej commented 1 year ago

with pg15: https://neon-github-public-dev.s3.amazonaws.com/reports/main/5158365624/index.html#suites/e90f9f55d45ab2a087333a860583a7c3/40f71b5882308c46/ (looks the same)

arssher commented 1 year ago

And again: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-4468/5244173359/index.html#suites/e90f9f55d45ab2a087333a860583a7c3/750933fb3b3439ce

@knizhnik do you have an idea why it happens in test_multixact? We can add flushing WAL, but I don't understand out of hand why it would help here.

koivunej commented 1 year ago

test_multixact: https://neon-github-public-dev.s3.amazonaws.com/reports/main/5464338139/index.html#suites/e90f9f55d45ab2a087333a860583a7c3/dfef62c0a4ada04c/

koivunej commented 1 year ago

test_pg_regress: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-4682/5531542238/index.html#suites/158be07438eb5188d40b466b6acfaeb3/fb64efef414b56e1

koivunej commented 1 year ago

test_pg_regress: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-4892/5752740496/index.html#/testresult/6e70689c82f43e93

awestover commented 1 year ago

We can do "Somehow force Postgres to flush WAL on termination request. But it is not only necessary to call XLogFlush, but also wait until walsender will be able to propagate this changes", but issue fsync manually from python code; we already have neon_xlogflush in neontest.c for exactly this purpose. And yes, we'd need to add wait for propagation. Or this aborted xact emerges immediately before/during shutdown? Do you have an idea what is an xact this is? If this turns out to be not enough, I'd probably just skip pg_xact comparison.

I'm going to see if I can figure this out

awestover commented 1 year ago

FYI: If you want to test this faster and hopefully uncover flakiness faster you can comment out all but the relevant test in neon/vendor/postgresv14/src/test/regress/parallel_schedule as follows:

suggested parallel_schedule

``` # ---------- # src/test/regress/parallel_schedule # # By convention, we put no more than twenty tests in any one parallel group; # this limits the number of connections needed to run the tests. # ---------- # # run tablespace by itself, and first, because it forces a checkpoint; # # we'd prefer not to have checkpoints later in the tests because that # # interferes with crash-recovery testing. # test: tablespace # # ---------- # # The first group of parallel tests # # ---------- # test: boolean char name varchar text int2 int4 int8 oid float4 float8 bit numeric txid uuid enum money rangetypes pg_lsn regproc # # ---------- # # The second group of parallel tests # # strings depends on char, varchar and text # # numerology depends on int2, int4, int8, float4, float8 # # multirangetypes depends on rangetypes # # multirangetypes shouldn't run concurrently with type_sanity # # ---------- # test: strings numerology point lseg line box path polygon circle date time timetz timestamp timestamptz interval inet macaddr macaddr8 multirangetypes create_function_0 # # ---------- # # Another group of parallel tests # # geometry depends on point, lseg, box, path, polygon and circle # # horology depends on interval, timetz, timestamp, timestamptz # # opr_sanity depends on create_function_0 # # ---------- # test: geometry horology tstypes regex type_sanity opr_sanity misc_sanity comments expressions unicode xid mvcc # # ---------- # # These four each depend on the previous one # # ---------- # test: create_function_1 # test: create_type # test: create_table # test: create_function_2 # # ---------- # # Load huge amounts of data # # We should split the data files into single files and then # # execute two copy tests parallel, to check that copy itself # # is concurrent safe. # # ---------- # test: copy copyselect copydml insert insert_conflict # # ---------- # # More groups of parallel tests # # ---------- # test: create_misc create_operator create_procedure create_schema # # These depend on create_misc and create_operator # test: create_index create_index_spgist create_view index_including index_including_gist # # ---------- # # Another group of parallel tests # # ---------- # test: create_aggregate create_function_3 create_cast constraints triggers select inherit typed_table vacuum drop_if_exists updatable_views roleattributes create_am hash_func errors infinite_recurse # # ---------- # # sanity_check does a vacuum, affecting the sort order of SELECT * # # results. So it should not run parallel to other tests. # # ---------- # test: sanity_check # # ---------- # # Another group of parallel tests # # Note: the ignore: line does not run random, just mark it as ignorable # # ---------- # ignore: random # test: select_into select_distinct select_distinct_on select_implicit select_having subselect union case join aggregates transactions random portals arrays btree_index hash_index update delete namespace test: prepared_xacts # # ---------- # # Another group of parallel tests # # ---------- # test: brin gin gist spgist privileges init_privs security_label collate matview lock replica_identity rowsecurity object_address tablesample groupingsets drop_operator password identity generated join_hash # # ---------- # # Additional BRIN tests # # ---------- # test: brin_bloom brin_multi # # ---------- # # Another group of parallel tests # # ---------- # test: create_table_like alter_generic alter_operator misc async dbsize misc_functions sysviews tsrf tid tidscan tidrangescan collate.icu.utf8 incremental_sort # # rules cannot run concurrently with any test that creates # # a view or rule in the public schema # # collate.*.utf8 tests cannot be run in parallel with each other # test: rules psql psql_crosstab amutils stats_ext collate.linux.utf8 # # run by itself so it can run parallel workers # test: select_parallel # test: write_parallel # test: vacuum_parallel # # no relation related tests can be put in this group # test: publication subscription # # ---------- # # Another group of parallel tests # # ---------- # test: select_views portals_p2 foreign_key cluster dependency guc bitmapops combocid tsearch tsdicts foreign_data window xmlmap functional_deps advisory_lock indirect_toast equivclass # # ---------- # # Another group of parallel tests (JSON related) # # ---------- # test: json jsonb json_encoding jsonpath jsonpath_encoding jsonb_jsonpath # # ---------- # # Another group of parallel tests # # NB: temp.sql does a reconnect which transiently uses 2 connections, # # so keep this parallel group to at most 19 tests # # ---------- # test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion truncate alter_table sequence polymorphism rowtypes returning largeobject with xml # # ---------- # # Another group of parallel tests # # ---------- # test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize # # event triggers cannot run concurrently with any test that runs DDL # # oidjoins is read-only, though, and should run late for best coverage # test: event_trigger oidjoins # # this test also uses event triggers, so likewise run it by itself # test: fast_default # # run stats by itself because its delay may be insufficient under heavy load # test: stats ```

koivunej commented 1 year ago

Recent attempts failed to ease this failed: https://neon-github-public-dev.s3.amazonaws.com/reports/main/6124215906/index.html#suites/158be07438eb5188d40b466b6acfaeb3/eadf5e4006a9544f

jcsp commented 9 months ago

This is still failing occasionally https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6659/7814559995/index.html#/testresult/23e61beb205b42f6

jcsp commented 9 months ago

Looking at the current situation:

We call CHECKPOINT in postgres before flushing to the pageserver, which is intended to make sure that the on-disk state is fully written, even if postgres wouldn't usually block on it (e.g. in the incomplete transaction case)
The test ignores pg_xact files presence/absence, but if the file is present in both the on-disk state and restored state, it is still compared.
The test is failing because the files are different

hlinnaka commented 9 months ago

This is still failing occasionally https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6659/7814559995/index.html#/testresult/23e61beb205b42f6

Hmm, the test is supposed to write a diff of the file to a .filediff file, which is supposed to be included as an attachment in the Allure report. I don't see it in the Allure report. Weird. When I simulate the failure locally by modifying the file, it does create a .filediff file, and the allure_attach_from_dir() function does attach it. I wonder what's going on there?

hlinnaka commented 9 months ago

This was hopefully fixed by https://github.com/neondatabase/neon/pull/6666. If not, and this still reoccurs, please reopen.

jcsp commented 9 months ago

https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6647/7843577515/index.html#suites/158be07438eb5188d40b466b6acfaeb3/526276cf23cb8ed5

The modified basebackup invocation that uses an explicit LSN is leading to a timeout waiting for that LSN.

jcsp commented 9 months ago

This seems to be failing more frequently often since #6666 . It is the most frequently failing flaky test over the last 4 days.

The next proposed change is https://github.com/neondatabase/neon/pull/6712

neondatabase / neon

check_restored_datadir_content failure in pg_regress #559