neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
13.22k stars 369 forks source link

`test_compute_pageserver_connection_stress` flakiness #6688

Open jcsp opened 4 months ago

jcsp commented 4 months ago

This is a test the injects page_service request failures.

Occasionally it fails in compute startup, failing to get basebackup.

I thought https://github.com/neondatabase/neon/pull/6537 would fix this, but it appears it hasn't, so the question is: are the new retries not working, or is this test somehow failing differently?

save-buffer commented 4 months ago

Is there a chance you could link an Allure report to a run that failed?

jcsp commented 4 months ago

Here's one: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6658/7832195602/index.html#suites/17ce3111e92c0f109f76121e2725061b/210d0ad1c5e47345/retries

(I slacked a link to a wiki page that lets you directly fetch recent failures)

save-buffer commented 4 months ago

Seems that there are two issues:

  1. Sometimes it really does fail five times (see here and here)
  2. Sometimes there's some free() on invalid pointer that's happening in postgres (see here and here)
jcsp commented 4 months ago

This has failed 14 times in past 48h -- @save-buffer any progress stabilizing it?

save-buffer commented 3 months ago

Should be fixed by #6976

jcsp commented 3 months ago

Test is still failing frequently over the last 3 days since #6688 merged

petuhovskiy commented 3 months ago

Another failure: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-7079/8233321973/index.html#/testresult/8c4d8d2b95cdcb4a

In the compute logs I see:

2024-03-11T13:22:56.193741Z ERROR error while post_apply_config: handle_neon_extension_upgrade: connection closed
PG:2024-03-11 13:22:56.174 GMT [7830] LOG:  [NEON_SMGR] [shard 0] libpagestore: connected to 'postgresql://no_user@localhost:30544'
PG:2024-03-11 13:22:56.304 GMT [7816] LOG:  server process (PID 7830) was terminated by signal 11: Segmentation fault
PG:2024-03-11 13:22:56.304 GMT [7816] DETAIL:  Failed process was running: ALTER EXTENSION neon UPDATE
PG:2024-03-11 13:22:56.304 GMT [7816] LOG:  terminating any other active server processes
PG:2024-03-11 13:22:56.305 GMT [7816] LOG:  shutting down because restart_after_crash is off

So it looks like this test is good at finding bugs, but our postgres code is not solid enough yet to survive unusual compute<->ps connection breaks.

save-buffer commented 3 months ago

https://github.com/neondatabase/neon/pull/7095 Here's another bug I discovered, maybe it'll help stabilize it

save-buffer commented 3 months ago

Ok https://github.com/neondatabase/neon/pull/7095 is merged, let's keep an eye on the test, and if it stops failing as much we can close the issue again

jcsp commented 3 months ago

4 failures in last 3 days, so there's still work to do here.

save-buffer commented 3 months ago

Opened #7281

kelvich commented 2 weeks ago

@save-buffer still flaky