neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.28k stars 408 forks source link

Client observes SMGR error during pageserver restarts: "failed to flush page requests": #4497

Open LizardWizzard opened 1 year ago

LizardWizzard commented 1 year ago

Steps to reproduce

Have a running compute, restart pageserver

Expected result

No errors

Actual result

ERROR XX000 (internal_error) [NEON_SMGR] failed to flush page requests:

Environment

prod

Logs, links

LizardWizzard commented 1 year ago

Suggested fix is to increase number of retries in smgr so it can hide pageserver restart from the client.

DoD is to have a test that verifies that retries in smgr actually work and hide pageserver restart. I e start query, stop pageserver, sleep for some time, start pageserver back up again and query shouldnt fail.

koivunej commented 1 year ago

Related: #4205

koivunej commented 12 months ago

Now the error looks like: https://github.com/neondatabase/neon/blob/d7fa2dba2d3eaad9f7693f25ae07ed77f5ba9bf8/pgxn/neon/libpagestore.c#L370

knizhnik commented 12 months ago

But flush failure used to be recoverable error!

static NeonResponse *
page_server_request(void const *req)
{
    NeonResponse* resp;
    do {
        while (!page_server->send((NeonRequest *) req) || !page_server->flush());
        MyPState->ring_flush = MyPState->ring_unused;
        consume_prefetch_responses();
        resp = page_server->receive();
    } while (resp == NULL);
    return resp;
}

So if client observed SMGR error, then connection can not be reestablished using 10 attempts with 1 second timeout.