neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.23k stars 406 forks source link

WalResidentTimeline guard timeout in `test_pgbench_intensive_init_workload` #8260

Open jcsp opened 2 months ago

jcsp commented 2 months ago

https://neon-github-public-dev.s3.amazonaws.com/reports/main/9788782185/index.html#suites/9681106e61a1222669b9d22ab136d07b/d1be7eff769cc3f6/

WARN {cid=11 ttid=73223ddb74413de6940dc60f77d4e716/f8859612f7b238e064ee175c86656e98}:WAL receiver: timeout while acquiring WalResidentTimeline guard, statuses StateSnapshot => UpdateControlFile\n'

This test is intentionally aggressive, but it runs on a dedicated machine, so if we're hitting timeouts it might be a sign of a real issue.

jcsp commented 2 months ago

@petuhovskiy can you help us interpret the warning message? Does this imply that a download from S3 was taking a long time?

petuhovskiy commented 2 months ago

Thanks for the ping, looked at the logs, it seems that node is just overloaded (CPU/disk) and everything is very slow.

This message means that manager was not fast enough to reply in 30 seconds. This usually means that timeline shared state was blocked by disk operations, and manager was waiting for the mutex.

S3 evictions are not enabled yet.

jcsp commented 2 months ago

Triage: interesting that this is generating a 30s timeout, looks like the test isn't ultra-heavy (not using multiple threads or anything like that). Let's investigate the pgbench config. vs the hardware we're running on.