test_branching_with_pgbench failure

neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.

https://neon.tech

Apache License 2.0

14.78k stars 430 forks source link

test_branching_with_pgbench failure #5854

Closed koivunej closed 5 months ago

koivunej commented 11 months ago

Failures on Postgres 16

test_branching_with_pgbench[flat-1-10]: debug

Unclear what caused this. Problem has been noticed with both cascade and flat. Merging #5520 made it less frequent on the merges.

Some failure logs show what looks like "rogue primary compute" but we haven't been able to understand which could that be, we might even be missing some logging.

@petuhovskiy had an idea about collapsing all of the logs.

Originally posted by @koivunej in https://github.com/neondatabase/neon/issues/5851#issuecomment-1806281319

koivunej commented 11 months ago

More instances: https://neon-github-public-dev.s3.amazonaws.com/reports/main/6942621581/index.html#suites/ffbb7f9930a77115316b58ff32b7c719/9a0d540d56f9c375/ -- see also retries.

koivunej commented 11 months ago

More: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-5902/6959521385/index.html#suites/ffbb7f9930a77115316b58ff32b7c719/7709742b5531b61f

lubennikovaav commented 10 months ago

One more instance of this https://neon-github-public-dev.s3.amazonaws.com/reports/main/7005229440/index.html#testresult/377a59f7390cf70c with this error in the repo/endpoints/ep-5/compute.log

2023-11-27 13:10:07.595 GMT [135907] PANIC:  WAL acceptor 127.0.0.1:18369 with term 3 rejected our request, our term 2

bayandin commented 10 months ago

3 more examples:

problame commented 10 months ago

Slack thread with investigation & some optimistic insights: https://neondb.slack.com/archives/C04KGFVUWUQ/p1701169779613809

koivunej commented 10 months ago

It seems merging #5520 may have changed the situation. Upon further inspection it should had not helped, but perhaps it adding a new test case helped which affected the test execution order.

koivunej commented 10 months ago

Re: idea about us not collecting all of the logs: #5992 fixes one such case, but I fail to see how it could affect this particular test case.

koivunej commented 10 months ago

https://neon-github-public-dev.s3.amazonaws.com/reports/main/7050113183/index.html#suites/ffbb7f9930a77115316b58ff32b7c719/c4ad61d05db29119 caused the #6004 PR, but quoting @arssher:

Figure out what cases initial 5m timeout.

jcsp commented 5 months ago

This doesn't appear to have failed in the recent past, based on test database output