pg16 test failed to walredo

koivunej commented 1 year ago

https://neon-github-public-dev.s3.amazonaws.com/reports/main/6325797340/index.html#suites/f588e0a787c49e67b29490359c589fae/79e288f72486af4d

2023-09-27T13:25:42.352576Z ERROR page_service_conn_main{peer_addr=127.0.0.1:39670}:process_query{tenant_id=e75c2ab8650857b0feb5ce54663d937a timeline_id=d79343ca5041fdd610bf22f8ff451223}:handle_pagerequests:handle_get_page_at_lsn_request{rel=1663/5/2663 blkno=2 req_lsn=0/150E4E0}:apply_wal_records{tenant_id=e75c2ab8650857b0feb5ce54663d937a pid=187248}: wal-redo-postgres: 2023-09-27 13:25:42.352 GMT [187248] LOG:  registered custom resource manager "neon" with ID 134

2023-09-27T13:25:42.352687Z ERROR page_service_conn_main{peer_addr=127.0.0.1:39680}:process_query{tenant_id=e75c2ab8650857b0feb5ce54663d937a timeline_id=d79343ca5041fdd610bf22f8ff451223}:handle_pagerequests:handle_get_page_at_lsn_request{rel=1663/5/2662 blkno=2 req_lsn=0/150E490}: error applying 8 WAL records 0/14FCB00..0/150E490 (2881 bytes) to base image with LSN 0/0 to reconstruct page image at LSN 0/270CCF0
2023-09-27T13:25:42.353198Z ERROR page_service_conn_main{peer_addr=127.0.0.1:39670}:process_query{tenant_id=e75c2ab8650857b0feb5ce54663d937a timeline_id=d79343ca5041fdd610bf22f8ff451223}:handle_pagerequests:handle_get_page_at_lsn_request{rel=1663/5/2663 blkno=2 req_lsn=0/150E4E0}: error applying 8 WAL records 0/14FDD48..0/150E4E0 (5237 bytes) to base image with LSN 0/0 to reconstruct page image at LSN 0/2749E90
2023-09-27T13:25:42.353240Z  INFO page_service_conn_main{peer_addr=127.0.0.1:39680}:process_query{tenant_id=e75c2ab8650857b0feb5ce54663d937a timeline_id=d79343ca5041fdd610bf22f8ff451223}:handle_pagerequests:handle_get_page_at_lsn_request{rel=1663/5/2662 blkno=2 req_lsn=0/150E490}:kill_and_wait_impl{pid=187248}: wait successful exit_status=signal: 9 (SIGKILL)
2023-09-27T13:25:42.364534Z ERROR page_service_conn_main{peer_addr=127.0.0.1:39680}:process_query{tenant_id=e75c2ab8650857b0feb5ce54663d937a timeline_id=d79343ca5041fdd610bf22f8ff451223}:handle_pagerequests:handle_get_page_at_lsn_request{rel=1663/5/2662 blkno=2 req_lsn=0/150E490}: error applying 8 WAL records 0/14FCB00..0/150E490 (2881 bytes) to base image with LSN 0/0 to reconstruct page image at LSN 0/270CCF0
2023-09-27T13:25:42.364608Z  INFO page_service_conn_main{peer_addr=127.0.0.1:39670}:process_query{tenant_id=e75c2ab8650857b0feb5ce54663d937a timeline_id=d79343ca5041fdd610bf22f8ff451223}:handle_pagerequests:handle_get_page_at_lsn_request{rel=1663/5/2663 blkno=2 req_lsn=0/150E4E0}:kill_and_wait_impl{pid=187251}: wait successful exit_status=signal: 9 (SIGKILL)

The test was most likely executing pgbench when this happened:

2023-09-27 13:25:41.523 INFO [neon_fixtures.py:1772] Running command '/tmp/neon/pg_install/v16/bin/pgbench -i -I dtGvp -s1 host=localhost port=20762 user=cloud_admin dbname=postgres options='-cstatement_timeout=120s ''
2023-09-27 13:25:45.415 INFO [neon_fixtures.py:350] Executing query: SELECT pg_current_wal_flush_lsn()

Related compute logs:

2023-09-27 13:25:41.805 GMT [186844] LOG:  [NEON_SMGR] libpagestore: connected to 'postgresql://no_user@localhost:20757'
2023-09-27 13:25:43.895 GMT [186844] ERROR:  could not read block 2 in rel 1663/5/2662.0 from page server at lsn 0/0150E490 at character 116
2023-09-27 13:25:43.895 GMT [186844] DETAIL:  page server returned error: Failed to reconstruct a page image:
2023-09-27 13:25:43.895 GMT [186844] STATEMENT:  SELECT state, to_char(state_change, 'YYYY-MM-DD"T"HH24:MI:SS.US"Z"') AS state_change
                             FROM pg_stat_activity
                             WHERE backend_type = 'client backend'
                                AND pid != pg_backend_pid()
                                AND usename != 'cloud_admin';

Nothing on the pgbench stderr about this.

So far checked:

no duplicates in staging, prod
no duplicates in test logs

Slack thread: https://neondb.slack.com/archives/C033RQ5SPDH/p1695827852887519

koivunej commented 1 year ago

Discussed in triage meeting:

we'll wait for .walredo in test runs (example) then produce a test case and hand over to compute team
if this happens in production, we or compute will have to figure out a test example from prod

jcsp commented 1 year ago

This looks interesting. The failure is in test_pageserver_chaos and there is a .walredo file in the test output.

https://neon-github-public-dev.s3.amazonaws.com/reports/pr-5580/6579368120/index.html#suites/094ca0c3798926b0b3676f7b1a8a5bdb/fd56578ed97e87a8

2023-10-19T19:44:52.353146Z ERROR page_service_conn_main{peer_addr=127.0.0.1:46222}:process_query{tenant_id=964bd1b5e088bdd30773977b048b7e87 timeline_id=52fc424b621442810e5f665bfb598ec5}:handle_pagerequests:handle_get_page_at_lsn_request{rel=1663/5/2662 blkno=2 req_lsn=0/150FA48}:apply_wal_records{tenant_id=964bd1b5e088bdd30773977b048b7e87 pid=84158}: erroring walredo input saved filename="walredo-1697744692353-2721-0.walredo"\n', '2023-10-19T19:44:52.353169Z ERROR page_service_conn_main{peer_addr=127.0.0.1:46222}:process_query{tenant_id=964bd1b5e088bdd30773977b048b7e87 timeline_id=52fc424b621442810e5f665bfb598ec5}:handle_pagerequests:handle_get_page_at_lsn_request{rel=1663/5/2662 blkno=2 req_lsn=0/150FA48}: error applying 4 WAL records 0/14FD348..0/150FA48 (2625 bytes) to base image with LSN 0/0 to reconstruct page image at LSN 0/2672F98 n_attempts=0: apply_wal_records

problame commented 11 months ago

For posterity: root cause analysis happened in https://app.incident.io/neondb/incidents/40?tab=attachments Slack channel .

List of fixes (all linked as attachments to that incident)

5541
5554
5556
5557
5559
5560
5587

In this Grafana query we see the count drop to zero (barring two known-broken timelines that fail walredo for other reasons):

neondatabase / neon

pg16 test failed to walredo #5404

5541

5554

5556

5557

5559

5560

5587