Closed koivunej closed 5 months ago
More instances: https://neon-github-public-dev.s3.amazonaws.com/reports/main/6942621581/index.html#suites/ffbb7f9930a77115316b58ff32b7c719/9a0d540d56f9c375/ -- see also retries.
One more instance of this
https://neon-github-public-dev.s3.amazonaws.com/reports/main/7005229440/index.html#testresult/377a59f7390cf70c
with this error in the repo/endpoints/ep-5/compute.log
2023-11-27 13:10:07.595 GMT [135907] PANIC: WAL acceptor 127.0.0.1:18369 with term 3 rejected our request, our term 2
3 more examples:
Slack thread with investigation & some optimistic insights: https://neondb.slack.com/archives/C04KGFVUWUQ/p1701169779613809
It seems merging #5520 may have changed the situation. Upon further inspection it should had not helped, but perhaps it adding a new test case helped which affected the test execution order.
Re: idea about us not collecting all of the logs: #5992 fixes one such case, but I fail to see how it could affect this particular test case.
https://neon-github-public-dev.s3.amazonaws.com/reports/main/7050113183/index.html#suites/ffbb7f9930a77115316b58ff32b7c719/c4ad61d05db29119 caused the #6004 PR, but quoting @arssher:
Figure out what cases initial 5m timeout.
This doesn't appear to have failed in the recent past, based on test database output
Unclear what caused this. Problem has been noticed with both
cascade
andflat
. Merging #5520 made it less frequent on the merges.Some failure logs show what looks like "rogue primary compute" but we haven't been able to understand which could that be, we might even be missing some logging.
@petuhovskiy had an idea about collapsing all of the logs.
Originally posted by @koivunej in https://github.com/neondatabase/neon/issues/5851#issuecomment-1806281319