Failures in `test_restart_endpoint_after_switch_wal`

jcsp commented 1 day ago

This test is failing ~1.5% of the time.

It was added in https://github.com/neondatabase/neon/pull/8943

Examples:

Case 1

https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6560/11150102484/index.html#testresult/32593da5b1e125d1/retries

RuntimeError: Run ['/tmp/neon/bin/neon_local', 'endpoint', 'start', '--safekeepers', '1', 'ep-2'] failed:
  stdout:
    Starting existing endpoint ep-2...
    Starting postgres node at 'postgresql://cloud_admin@127.0.0.1:31674/postgres'
    SIGKILL & wait the started process
  stderr:
    command failed: timed out waiting to connect to compute_ctl HTTP

    Caused by:
        0: error sending request for url (http://127.0.0.1:31675/status)
        1: client error (Connect)
        2: tcp connect error: Connection refused (os error 111)
        3: Connection refused (os error 111)

Case 2

https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9232/11153678585/index.html#/testresult/ace582de47d6f7a9

test_runner/regress/test_wal_acceptor.py:1076: in test_restart_endpoint_after_switch_wal
    endpoint.safe_psql("SELECT 'works'")
test_runner/fixtures/neon_fixtures.py:330: in safe_psql
    return self.safe_psql_many([query], **kwargs)[0]
test_runner/fixtures/neon_fixtures.py:340: in safe_psql_many
    with closing(self.connect(**kwargs)) as conn:
test_runner/fixtures/neon_fixtures.py:284: in connect
    conn: PgConnection = psycopg2.connect(**self.conn_options(**kwargs))
/github/home/.cache/pypoetry/virtualenvs/non-package-mode-_pxWMzVK-py3.9/lib/python3.9/site-packages/psycopg2/__init__.py:122: in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
E   psycopg2.OperationalError: connection to server at "localhost" (127.0.0.1), port 28828 failed: server closed the connection unexpectedly
E       This probably means the server terminated abnormally
E       before or while processing the request.

erikgrinaker commented 17 hours ago

The compute fails to restart due to a panic in the WAL proposer (from ep-2/compute.log):

Case 1

PG:2024-10-02 19:33:38.546 GMT [392338] PANIC:  [WP] collected propEpochStartLsn 0/2000000, but basebackup LSN 0/152CBC8

Case 2

PG:2024-10-03 01:01:17.098 GMT [332945] PANIC:  [WP] collected propEpochStartLsn 0/2002078, but basebackup LSN 0/15206D8

I'm not familiar with the background here, perhaps @arssher can shed some light on it.

arssher commented 14 hours ago

https://github.com/neondatabase/neon/pull/9099 (merged 01.10 17:54 UTC) was supposed to fix this. I haven't exactly verified whether this PR is included in the runs above (with force pushes it's probably impossible), but let's see if failures repeat on fresh PRs.

erikgrinaker commented 14 hours ago

Random notes from poking around:

The propEpochStartLsn is taken from the safekeeper's flushLsn (the last durable LSN).
The basebackup LSN during recovery is taken from the control file, which we generate.
The pageserver constructs and sends the basebackup in response to an RPC call. The caller provides the LSN.
In case 1, the caller did provide an explicit 0/152CBC8 LSN for the base backup.
The basebackup is adjusted for the last record, but the WAL switch may not have added a new record.

Heading into a meeting, will pick this back up on Monday.

erikgrinaker commented 14 hours ago

Thanks @arssher, #9099 makes sense. Let's look for any newer failures.

erikgrinaker commented 13 hours ago

I checked all the failures on main after October 1st. None of them were because of this problem, but rather errors on test cleanup and such (e.g. postmaster.pid missing on stop). I'll have a closer look at them next week.

neondatabase / neon

Failures in `test_restart_endpoint_after_switch_wal` #9259

Case 1

Case 2