Closed koivunej closed 11 months ago
Discussed in triage meeting:
.walredo
in test runs (example) then produce a test case and hand over to compute teamThis looks interesting. The failure is in test_pageserver_chaos
and there is a .walredo
file in the test output.
2023-10-19T19:44:52.353146Z ERROR page_service_conn_main{peer_addr=127.0.0.1:46222}:process_query{tenant_id=964bd1b5e088bdd30773977b048b7e87 timeline_id=52fc424b621442810e5f665bfb598ec5}:handle_pagerequests:handle_get_page_at_lsn_request{rel=1663/5/2662 blkno=2 req_lsn=0/150FA48}:apply_wal_records{tenant_id=964bd1b5e088bdd30773977b048b7e87 pid=84158}: erroring walredo input saved filename="walredo-1697744692353-2721-0.walredo"\n', '2023-10-19T19:44:52.353169Z ERROR page_service_conn_main{peer_addr=127.0.0.1:46222}:process_query{tenant_id=964bd1b5e088bdd30773977b048b7e87 timeline_id=52fc424b621442810e5f665bfb598ec5}:handle_pagerequests:handle_get_page_at_lsn_request{rel=1663/5/2662 blkno=2 req_lsn=0/150FA48}: error applying 4 WAL records 0/14FD348..0/150FA48 (2625 bytes) to base image with LSN 0/0 to reconstruct page image at LSN 0/2672F98 n_attempts=0: apply_wal_records
For posterity: root cause analysis happened in https://app.incident.io/neondb/incidents/40?tab=attachments Slack channel .
List of fixes (all linked as attachments to that incident)
In this Grafana query we see the count drop to zero (barring two known-broken timelines that fail walredo for other reasons):
https://neon-github-public-dev.s3.amazonaws.com/reports/main/6325797340/index.html#suites/f588e0a787c49e67b29490359c589fae/79e288f72486af4d
The test was most likely executing pgbench when this happened:
Related compute logs:
Nothing on the pgbench stderr about this.
So far checked:
Slack thread: https://neondb.slack.com/archives/C033RQ5SPDH/p1695827852887519