neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.57k stars 423 forks source link

Failures in `test_restart_endpoint_after_switch_wal` #9259

Open jcsp opened 1 day ago

jcsp commented 1 day ago

This test is failing ~1.5% of the time.

It was added in https://github.com/neondatabase/neon/pull/8943

Examples:

Case 1

https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6560/11150102484/index.html#testresult/32593da5b1e125d1/retries

RuntimeError: Run ['/tmp/neon/bin/neon_local', 'endpoint', 'start', '--safekeepers', '1', 'ep-2'] failed:
  stdout:
    Starting existing endpoint ep-2...
    Starting postgres node at 'postgresql://cloud_admin@127.0.0.1:31674/postgres'
    SIGKILL & wait the started process
  stderr:
    command failed: timed out waiting to connect to compute_ctl HTTP

    Caused by:
        0: error sending request for url (http://127.0.0.1:31675/status)
        1: client error (Connect)
        2: tcp connect error: Connection refused (os error 111)
        3: Connection refused (os error 111)

Case 2

https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9232/11153678585/index.html#/testresult/ace582de47d6f7a9

test_runner/regress/test_wal_acceptor.py:1076: in test_restart_endpoint_after_switch_wal
    endpoint.safe_psql("SELECT 'works'")
test_runner/fixtures/neon_fixtures.py:330: in safe_psql
    return self.safe_psql_many([query], **kwargs)[0]
test_runner/fixtures/neon_fixtures.py:340: in safe_psql_many
    with closing(self.connect(**kwargs)) as conn:
test_runner/fixtures/neon_fixtures.py:284: in connect
    conn: PgConnection = psycopg2.connect(**self.conn_options(**kwargs))
/github/home/.cache/pypoetry/virtualenvs/non-package-mode-_pxWMzVK-py3.9/lib/python3.9/site-packages/psycopg2/__init__.py:122: in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
E   psycopg2.OperationalError: connection to server at "localhost" (127.0.0.1), port 28828 failed: server closed the connection unexpectedly
E       This probably means the server terminated abnormally
E       before or while processing the request.
erikgrinaker commented 17 hours ago

The compute fails to restart due to a panic in the WAL proposer (from ep-2/compute.log):

I'm not familiar with the background here, perhaps @arssher can shed some light on it.

arssher commented 14 hours ago

https://github.com/neondatabase/neon/pull/9099 (merged 01.10 17:54 UTC) was supposed to fix this. I haven't exactly verified whether this PR is included in the runs above (with force pushes it's probably impossible), but let's see if failures repeat on fresh PRs.

erikgrinaker commented 14 hours ago

Random notes from poking around:

Heading into a meeting, will pick this back up on Monday.

erikgrinaker commented 14 hours ago

Thanks @arssher, #9099 makes sense. Let's look for any newer failures.

erikgrinaker commented 13 hours ago

I checked all the failures on main after October 1st. None of them were because of this problem, but rather errors on test cleanup and such (e.g. postmaster.pid missing on stop). I'll have a closer look at them next week.