neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
13.18k stars 367 forks source link

Bug: `walreceiver` did not restart after erroring our #8172

Open kelvich opened 2 days ago

kelvich commented 2 days ago

Got an interesting case with one of the production read-only endpoints. Walreceiver errored out and died:

2024-06-18 14:09:16.961  {"app":"NeonVM","endpoint_id":"ep-winter-rice-59233042","pod":"compute-lingering-forest-a2yogi5o-6kzkf","_entry":"PG:2024-06-18 14:09:16.598 GMT ttid=a27b300c2ff46c602a1635ab92d236f3/03baf9167a86378faa6375d8273d0f6d sqlstate=53100 [493] FATAL:  could not write to file \"pg_wal/xlogtemp.493\": No space left on device"}
2024-06-18 08:22:29.288 {"app":"NeonVM","endpoint_id":"ep-winter-rice-59233042","pod":"compute-lingering-forest-a2yogi5o-6kzkf","_entry":"PG:2024-06-18 08:22:29.127 GMT ttid=a27b300c2ff46c602a1635ab92d236f3/03baf9167a86378faa6375d8273d0f6d sqlstate=00000 [493] LOG:  skipping missing configuration file \"/var/db/postgres/compute/pgdata/compute_ctl_temp_override.conf\""}
2024-06-18 08:22:24.446 {"app":"NeonVM","pod":"compute-lingering-forest-a2yogi5o-6kzkf","_entry":"PG:2024-06-18 08:22:24.347 GMT ttid=a27b300c2ff46c602a1635ab92d236f3/03baf9167a86378faa6375d8273d0f6d sqlstate=00000 [493] LOG:  started streaming WAL from primary at 3/49000000 on timeline 1"}

but then it did not start again.

https://neondb.slack.com/archives/C04DGM6SMTM/p1719394592373479 https://console.neon.tech/admin/regions/aws-eu-central-1/computes/compute-lingering-forest-a2yogi5o

Heikki suggested to try to manually reproduce by adding elog(FATAL, "crashme") in walsender.