If pageserver is stopped and restarted, next query in compute node fails

hlinnaka commented 2 years ago

If pageserver is stopped, the connections from compute node are broken. The compute node doesn't notice that immediately, but it will cause the next request to fail, usually on the next query. That causes user-visible errors if pageserver is restarted, or the tenant is moved from one pageserver to another.

Let's have the smgr code in compute node retry the GetPage (or other) request to the page server, if the connection is broken. Retrying the connection once is probably enough.

We have a python test that tests restarting the pageserver while compute node is running, and it's currently closing the client connection to the compute node too to work aroud this issue. See test_runner/batch_others/test_pageserver_restart.py:

    # Stopping the pageserver breaks the connection from the postgres backend to
    # the page server, and causes the next query on the connection to fail. Start a new
    # postgres connection too, to avoid that error. (Ideally, the compute node would
    # handle that and retry internally, without propagating the error to the user, but
    # currently it doesn't...)
    pg_conn = pg.connect()
    cur = pg_conn.cursor()

When this issue is completed, we can remove that from the test.

LizardWizzard commented 2 years ago

It might be also an improvement to explicitly send Terminate messages and handle them accordingly

shanyp commented 1 year ago

@vadim2404 added the right labels for you to follow up

petuhovskiy commented 1 year ago

Issue still exists and happened recently with our benchmarks on staging. link to slack thread

We were running select-only pgbench and it failed: (github job)

Command '['/tmp/neon/pg_install/v14/bin/pgbench', '-S', '-c4', '-T3600', '-P2', '--progress-timestamp', "user=bayandin dbname=neondb host=ep-tight-mode-092778.us-east-2.aws.neon.build options='-cstatement_timeout=0 '"]' returned non-zero exit status 2.

stderr is the following: (allure report)

progress: 1684906121.512 s, 2949.5 tps, lat 1.356 ms stddev 0.375
progress: 1684906123.512 s, 2933.5 tps, lat 1.363 ms stddev 0.449
progress: 1684906125.512 s, 2068.5 tps, lat 1.403 ms stddev 0.484
progress: 1684906127.514 s, 0.0 tps, lat 0.000 ms stddev 0.000
progress: 1684906129.514 s, 0.0 tps, lat 0.000 ms stddev 0.000
pgbench: error: client 0 script 0 aborted in command 1 query 0: ERROR:  [NEON_SMGR] could not establish connection to pageserver
DETAIL:  connection to server at "pageserver-99.us-east-2.aws.neon.build" (10.6.81.23), port 6400 failed: Connection refused
    Is the server running on that host and accepting TCP/IP connections?
pgbench: error: client 2 script 0 aborted in command 1 query 0: ERROR:  [NEON_SMGR] could not establish connection to pageserver
DETAIL:  connection to server at "pageserver-99.us-east-2.aws.neon.build" (10.6.81.23), port 6400 failed: Connection refused
    Is the server running on that host and accepting TCP/IP connections?
pgbench: error: client 3 script 0 aborted in command 1 query 0: ERROR:  [NEON_SMGR] could not establish connection to pageserver
DETAIL:  connection to server at "pageserver-99.us-east-2.aws.neon.build" (10.6.81.23), port 6400 failed: Connection refused
    Is the server running on that host and accepting TCP/IP connections?
pgbench: error: client 1 script 0 aborted in command 1 query 0: ERROR:  [NEON_SMGR] could not establish connection to pageserver
DETAIL:  connection to server at "pageserver-99.us-east-2.aws.neon.build" (10.6.81.23), port 6400 failed: Connection refused
    Is the server running on that host and accepting TCP/IP connections?
pgbench: fatal: Run was aborted; the above results are incomplete.

Project is orange-heart-876188, link to admin page.

In the pageserver logs we can see that pageserver received a shutdown request at 2023-05-24T05:28:44.962, probably because of the redeploy. Here is the corresponding pageserver logs (REVERSE ORDER, NEWEST FIRST):

2023-05-24T05:28:45.694284Z  INFO layer flush task{tenant=f1628a5a92a5d5056bc4f2cc187e7f1b timeline=131f29c7be5b44716a7e9bfd5cfd2b56}:flush_frozen_layer{tenant_id=f1628a5a92a5d5056bc4f2cc187e7f1b timeline_id=131f29c7be5b44716a7e9bfd5cfd2b56 layer=inmem-0000001A63CFDE21-0000001A6CBFECB1}: Rebuilt layer map. Did 1 insertions to process a batch of 1 updates.
2023-05-24T05:28:44.974673Z  INFO walreceiver_connection{tenant_id=f1628a5a92a5d5056bc4f2cc187e7f1b timeline_id=131f29c7be5b44716a7e9bfd5cfd2b56 node_id=60}: Replication stream finished: connection closed
2023-05-24T05:28:44.974613Z  INFO wal_connection_manager{tenant=f1628a5a92a5d5056bc4f2cc187e7f1b timeline=131f29c7be5b44716a7e9bfd5cfd2b56}: Connection manager loop ended, shutting down
2023-05-24T05:28:44.974608Z  INFO wal_connection_manager{tenant=f1628a5a92a5d5056bc4f2cc187e7f1b timeline=131f29c7be5b44716a7e9bfd5cfd2b56}: Timeline dropped state updates sender, stopping wal connection manager loop
2023-05-24T05:28:44.974575Z  INFO wal_connection_manager{tenant=f1628a5a92a5d5056bc4f2cc187e7f1b timeline=131f29c7be5b44716a7e9bfd5cfd2b56}: timeline entered terminal state Stopping, stopping wal connection manager loop
2023-05-24T05:28:44.962642Z  INFO handle_pagerequests{tenant_id=f1628a5a92a5d5056bc4f2cc187e7f1b timeline_id=131f29c7be5b44716a7e9bfd5cfd2b56}: shutdown request received in page handler
2023-05-24T05:28:44.962628Z  INFO handle_pagerequests{tenant_id=f1628a5a92a5d5056bc4f2cc187e7f1b timeline_id=131f29c7be5b44716a7e9bfd5cfd2b56}: shutdown request received in page handler
2023-05-24T05:28:44.962615Z  INFO handle_pagerequests{tenant_id=f1628a5a92a5d5056bc4f2cc187e7f1b timeline_id=131f29c7be5b44716a7e9bfd5cfd2b56}: shutdown request received in page handler
2023-05-24T05:28:44.962500Z  INFO handle_pagerequests{tenant_id=f1628a5a92a5d5056bc4f2cc187e7f1b timeline_id=131f29c7be5b44716a7e9bfd5cfd2b56}: shutdown request received in page handler
2023-05-24T05:28:44.962491Z  INFO handle_pagerequests{tenant_id=f1628a5a92a5d5056bc4f2cc187e7f1b timeline_id=131f29c7be5b44716a7e9bfd5cfd2b56}: shutdown request received in page handler
2023-05-24T05:28:44.962512Z  INFO handle_pagerequests{tenant_id=f1628a5a92a5d5056bc4f2cc187e7f1b timeline_id=131f29c7be5b44716a7e9bfd5cfd2b56}: shutdown request received in page handler
2023-05-24T05:28:44.962526Z  INFO handle_pagerequests{tenant_id=f1628a5a92a5d5056bc4f2cc187e7f1b timeline_id=131f29c7be5b44716a7e9bfd5cfd2b56}: shutdown request received in page handler
2023-05-24T05:28:44.962490Z  INFO handle_pagerequests{tenant_id=f1628a5a92a5d5056bc4f2cc187e7f1b timeline_id=131f29c7be5b44716a7e9bfd5cfd2b56}: shutdown request received in page handler

And we see that compute->pageserver were dropped at the same time, and pageserver rejects new connections, compute logs (REVERSE ORDER, NEWEST FIRST):

2023-05-24 05:28:44.966 GMT [339] DETAIL:  connection to server at "pageserver-99.us-east-2.aws.neon.build" (10.6.81.23), port 6400 failed: Connection refused
2023-05-24 05:28:44.966 GMT [339] LOG:  [NEON_SMGR] could not establish connection to pageserver
2023-05-24 05:28:44.966 GMT [340] STATEMENT:  SELECT abalance FROM pgbench_accounts WHERE aid = 221871879;
        Is the server running on that host and accepting TCP/IP connections?
2023-05-24 05:28:44.966 GMT [340] DETAIL:  connection to server at "pageserver-99.us-east-2.aws.neon.build" (10.6.81.23), port 6400 failed: Connection refused
2023-05-24 05:28:44.966 GMT [340] LOG:  [NEON_SMGR] could not establish connection to pageserver
2023-05-24 05:28:44.963 GMT [338] LOG:  [NEON_SMGR] dropping connection to page server due to error
2023-05-24 05:28:44.962 GMT [339] LOG:  [NEON_SMGR] dropping connection to page server due to error
2023-05-24 05:28:44.962 GMT [337] LOG:  [NEON_SMGR] dropping connection to page server due to error
2023-05-24 05:28:44.962 GMT [340] LOG:  [NEON_SMGR] dropping connection to page server due to error
2023-05-24 05:28:37.268 GMT [352] LOG:  [NEON_SMGR] libpagestore: connected to 'host=pageserver-99.us-east-2.aws.neon.build port=6400'

So it looks like the issue here is that pageserver stop+start takes some time, and if compute tries to connect to the pageserver at that time, user still receives an error.

knizhnik commented 1 year ago

There is a limit for amount of attempts compute tries to reconnect to page server - if page server is not restarted in 5 seconds, then compute will give up and return error. Amy be 5 seconds is too small, but infinite waiting is also not acceptable.

arssher commented 1 year ago

Just bumped into this again while importing data into staging pageserver.

SSL SYSCALL error: EOF detected
ERROR:  [NEON_SMGR] could not establish connection to pageserver
DETAIL:  connection to server at "pageserver-0.eu-west-1.aws.neon.build" (10.10.23.6), port 6400 failed: Connection refused
    Is the server running on that host and accepting TCP/IP connections?
CONTEXT:  COPY DocumentChunk, line 337420

vadim2404 commented 7 months ago

@knizhnik , @save-buffer

Considering that you've already implemented reconnection, I hope this issue should gone. Can you confirm it?

hlinnaka commented 4 months ago

I'm closing this, as we implemented retries and the original issue was fixed by that.

There's still the question of whether we retry for long enough. Arguably we should retry much longer, e.g. with exponential backoff if the pageserver is down, to avoid aborting user queries for pageserver outages that last for a few minutes for more. The counter-argument is that a user might prefer to fail quickly with a clear error message, rather than block for several minutes. But that's a different issue.

save-buffer commented 4 months ago

with exponential backoff if the pageserver is down

We should already be doing exponential backoff:

        /*
         * If we did other tasks between reconnect attempts, then we won't
         * need to wait as long as a full delay.
         */
        if (us_since_last_attempt < shard->delay_us)
        {
            pg_usleep(shard->delay_us - us_since_last_attempt);
        }

        /* update the delay metric */
        shard->delay_us = Min(shard->delay_us * 2, MAX_RECONNECT_INTERVAL_USEC);

And I believe we'll just keep spinning, retrying at MAX_RECONNECT_INTERVAL_USEC (though how does max_statement_timeout play into this?)

neondatabase / neon

If pageserver is stopped and restarted, next query in compute node fails #1138