Open LizardWizzard opened 1 year ago
Suggested fix is to increase number of retries in smgr so it can hide pageserver restart from the client.
DoD is to have a test that verifies that retries in smgr actually work and hide pageserver restart. I e start query, stop pageserver, sleep for some time, start pageserver back up again and query shouldnt fail.
Related: #4205
But flush failure used to be recoverable error!
static NeonResponse *
page_server_request(void const *req)
{
NeonResponse* resp;
do {
while (!page_server->send((NeonRequest *) req) || !page_server->flush());
MyPState->ring_flush = MyPState->ring_unused;
consume_prefetch_responses();
resp = page_server->receive();
} while (resp == NULL);
return resp;
}
So if client observed SMGR error, then connection can not be reestablished using 10 attempts with 1 second timeout.
Steps to reproduce
Have a running compute, restart pageserver
Expected result
No errors
Actual result
ERROR XX000 (internal_error) [NEON_SMGR] failed to flush page requests:
Environment
prod
Logs, links