neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
13.67k stars 385 forks source link

walproposer ResetConnection takes a lot of time #7534

Open petuhovskiy opened 2 months ago

petuhovskiy commented 2 months ago

We had a long sync-safekeepers on shutdown:

PG:2024-04-25 17:21:55.252 GMT ttid=XXX/YYY sqlstate=00000 [1289] LOG:  [WP] received AcceptorGreeting from safekeeper safekeeper-1.us-east-1.aws.neon.tech:6401, term=893
PG:2024-04-25 17:21:55.251 GMT ttid=XXX/YYY sqlstate=00000 [1289] LOG:  [WP] connected with node safekeeper-1.us-east-1.aws.neon.tech:6401
PG:2024-04-25 17:21:55.250 GMT ttid=XXX/YYY sqlstate=00000 [1289] LOG:  [WP] connecting with node safekeeper-1.us-east-1.aws.neon.tech:6401
PG:2024-04-25 17:21:50.245 GMT ttid=XXX/YYY sqlstate=00000 [1289] LOG:  [WP] connecting with node safekeeper-2.us-east-1.aws.neon.tech:6401
PG:2024-04-25 17:21:50.241 GMT ttid=XXX/YYY sqlstate=00000 [1289] LOG:  [WP] connecting with node safekeeper-0.us-east-1.aws.neon.tech:6401
2024-04-25T17:21:50.217283Z  INFO syncing safekeepers on shutdown

Most of the time has been spent in connecting to nodes. More specifically, we have:

void
WalProposerStart(WalProposer *wp)
{

    /* Initiate connections to all safekeeper nodes */
    for (int i = 0; i < wp->n_safekeepers; i++)
    {
        ResetConnection(&wp->safekeeper[i]);
    }

    WalProposerLoop(wp);
}

static void
ResetConnection(Safekeeper *sk)
{
...
wp_log(LOG, "connecting with node %s:%s", sk->host, sk->port);
...
}

There are 3 possible explanations:

  1. something with network
  2. something with client (CPU/RAM/resources exhaustion)
  3. a bug where ResetConnection can get blocked

I looked at safekeeper logs:

2024-04-25T17:21:55.256282Z  INFO {cid=562809 ttid=XXX/YYY}: got query "START_WAL_PUSH"
2024-04-25T17:21:52.080006Z  INFO {cid=562810 ttid=QQQ/WWW}: got query "TIMELINE_STATUS"

2024-04-25T17:21:55.254134Z  INFO {cid=564739 ttid=XXX/YYY}: got query "START_WAL_PUSH"
2024-04-25T17:21:52.080974Z  INFO {cid=564740 ttid=QQQ/WWW}: got query "TIMELINE_STATUS"

2024-04-25T17:21:55.251300Z  INFO {cid=562411 ttid=XXX/YYY}: got query "START_WAL_PUSH"
2024-04-25T17:21:55.247237Z  INFO {cid=562410 ttid=QQQ/WWW}: got query "TIMELINE_STATUS"

I fetched logs for adjacent cid, and made a conclusion that 2/3 safekeepers accepted connection before 17:21:52.

Steps to reproduce

unknown

Expected result

sync safekeepers should be quick (<100ms)

Actual result

syncing safekeepers took more than 5 seconds

Logs, links

jcsp commented 2 months ago

Next steps: add more logs to diagnose it

arssher commented 2 months ago

Very likely it is DNS resolution hiccup, it is the only blocking place here as far as I'm aware.