neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.78k stars 430 forks source link

pg16 test failed to walredo #5404

Closed koivunej closed 11 months ago

koivunej commented 1 year ago

https://neon-github-public-dev.s3.amazonaws.com/reports/main/6325797340/index.html#suites/f588e0a787c49e67b29490359c589fae/79e288f72486af4d

2023-09-27T13:25:42.352576Z ERROR page_service_conn_main{peer_addr=127.0.0.1:39670}:process_query{tenant_id=e75c2ab8650857b0feb5ce54663d937a timeline_id=d79343ca5041fdd610bf22f8ff451223}:handle_pagerequests:handle_get_page_at_lsn_request{rel=1663/5/2663 blkno=2 req_lsn=0/150E4E0}:apply_wal_records{tenant_id=e75c2ab8650857b0feb5ce54663d937a pid=187248}: wal-redo-postgres: 2023-09-27 13:25:42.352 GMT [187248] LOG:  registered custom resource manager "neon" with ID 134

2023-09-27T13:25:42.352687Z ERROR page_service_conn_main{peer_addr=127.0.0.1:39680}:process_query{tenant_id=e75c2ab8650857b0feb5ce54663d937a timeline_id=d79343ca5041fdd610bf22f8ff451223}:handle_pagerequests:handle_get_page_at_lsn_request{rel=1663/5/2662 blkno=2 req_lsn=0/150E490}: error applying 8 WAL records 0/14FCB00..0/150E490 (2881 bytes) to base image with LSN 0/0 to reconstruct page image at LSN 0/270CCF0
2023-09-27T13:25:42.353198Z ERROR page_service_conn_main{peer_addr=127.0.0.1:39670}:process_query{tenant_id=e75c2ab8650857b0feb5ce54663d937a timeline_id=d79343ca5041fdd610bf22f8ff451223}:handle_pagerequests:handle_get_page_at_lsn_request{rel=1663/5/2663 blkno=2 req_lsn=0/150E4E0}: error applying 8 WAL records 0/14FDD48..0/150E4E0 (5237 bytes) to base image with LSN 0/0 to reconstruct page image at LSN 0/2749E90
2023-09-27T13:25:42.353240Z  INFO page_service_conn_main{peer_addr=127.0.0.1:39680}:process_query{tenant_id=e75c2ab8650857b0feb5ce54663d937a timeline_id=d79343ca5041fdd610bf22f8ff451223}:handle_pagerequests:handle_get_page_at_lsn_request{rel=1663/5/2662 blkno=2 req_lsn=0/150E490}:kill_and_wait_impl{pid=187248}: wait successful exit_status=signal: 9 (SIGKILL)
2023-09-27T13:25:42.364534Z ERROR page_service_conn_main{peer_addr=127.0.0.1:39680}:process_query{tenant_id=e75c2ab8650857b0feb5ce54663d937a timeline_id=d79343ca5041fdd610bf22f8ff451223}:handle_pagerequests:handle_get_page_at_lsn_request{rel=1663/5/2662 blkno=2 req_lsn=0/150E490}: error applying 8 WAL records 0/14FCB00..0/150E490 (2881 bytes) to base image with LSN 0/0 to reconstruct page image at LSN 0/270CCF0
2023-09-27T13:25:42.364608Z  INFO page_service_conn_main{peer_addr=127.0.0.1:39670}:process_query{tenant_id=e75c2ab8650857b0feb5ce54663d937a timeline_id=d79343ca5041fdd610bf22f8ff451223}:handle_pagerequests:handle_get_page_at_lsn_request{rel=1663/5/2663 blkno=2 req_lsn=0/150E4E0}:kill_and_wait_impl{pid=187251}: wait successful exit_status=signal: 9 (SIGKILL)

The test was most likely executing pgbench when this happened:

2023-09-27 13:25:41.523 INFO [neon_fixtures.py:1772] Running command '/tmp/neon/pg_install/v16/bin/pgbench -i -I dtGvp -s1 host=localhost port=20762 user=cloud_admin dbname=postgres options='-cstatement_timeout=120s ''
2023-09-27 13:25:45.415 INFO [neon_fixtures.py:350] Executing query: SELECT pg_current_wal_flush_lsn()

Related compute logs:

2023-09-27 13:25:41.805 GMT [186844] LOG:  [NEON_SMGR] libpagestore: connected to 'postgresql://no_user@localhost:20757'
2023-09-27 13:25:43.895 GMT [186844] ERROR:  could not read block 2 in rel 1663/5/2662.0 from page server at lsn 0/0150E490 at character 116
2023-09-27 13:25:43.895 GMT [186844] DETAIL:  page server returned error: Failed to reconstruct a page image:
2023-09-27 13:25:43.895 GMT [186844] STATEMENT:  SELECT state, to_char(state_change, 'YYYY-MM-DD"T"HH24:MI:SS.US"Z"') AS state_change
                             FROM pg_stat_activity
                             WHERE backend_type = 'client backend'
                                AND pid != pg_backend_pid()
                                AND usename != 'cloud_admin';

Nothing on the pgbench stderr about this.

So far checked:

Slack thread: https://neondb.slack.com/archives/C033RQ5SPDH/p1695827852887519

koivunej commented 1 year ago

Discussed in triage meeting:

jcsp commented 1 year ago

This looks interesting. The failure is in test_pageserver_chaos and there is a .walredo file in the test output.

https://neon-github-public-dev.s3.amazonaws.com/reports/pr-5580/6579368120/index.html#suites/094ca0c3798926b0b3676f7b1a8a5bdb/fd56578ed97e87a8

2023-10-19T19:44:52.353146Z ERROR page_service_conn_main{peer_addr=127.0.0.1:46222}:process_query{tenant_id=964bd1b5e088bdd30773977b048b7e87 timeline_id=52fc424b621442810e5f665bfb598ec5}:handle_pagerequests:handle_get_page_at_lsn_request{rel=1663/5/2662 blkno=2 req_lsn=0/150FA48}:apply_wal_records{tenant_id=964bd1b5e088bdd30773977b048b7e87 pid=84158}: erroring walredo input saved filename="walredo-1697744692353-2721-0.walredo"\n', '2023-10-19T19:44:52.353169Z ERROR page_service_conn_main{peer_addr=127.0.0.1:46222}:process_query{tenant_id=964bd1b5e088bdd30773977b048b7e87 timeline_id=52fc424b621442810e5f665bfb598ec5}:handle_pagerequests:handle_get_page_at_lsn_request{rel=1663/5/2662 blkno=2 req_lsn=0/150FA48}: error applying 4 WAL records 0/14FD348..0/150FA48 (2625 bytes) to base image with LSN 0/0 to reconstruct page image at LSN 0/2672F98 n_attempts=0: apply_wal_records
problame commented 11 months ago

For posterity: root cause analysis happened in https://app.incident.io/neondb/incidents/40?tab=attachments Slack channel .

List of fixes (all linked as attachments to that incident)

In this Grafana query we see the count drop to zero (barring two known-broken timelines that fail walredo for other reasons):

image