Open petuhovskiy opened 2 months ago
We had a long sync-safekeepers on shutdown:
PG:2024-04-25 17:21:55.252 GMT ttid=XXX/YYY sqlstate=00000 [1289] LOG: [WP] received AcceptorGreeting from safekeeper safekeeper-1.us-east-1.aws.neon.tech:6401, term=893 PG:2024-04-25 17:21:55.251 GMT ttid=XXX/YYY sqlstate=00000 [1289] LOG: [WP] connected with node safekeeper-1.us-east-1.aws.neon.tech:6401 PG:2024-04-25 17:21:55.250 GMT ttid=XXX/YYY sqlstate=00000 [1289] LOG: [WP] connecting with node safekeeper-1.us-east-1.aws.neon.tech:6401 PG:2024-04-25 17:21:50.245 GMT ttid=XXX/YYY sqlstate=00000 [1289] LOG: [WP] connecting with node safekeeper-2.us-east-1.aws.neon.tech:6401 PG:2024-04-25 17:21:50.241 GMT ttid=XXX/YYY sqlstate=00000 [1289] LOG: [WP] connecting with node safekeeper-0.us-east-1.aws.neon.tech:6401 2024-04-25T17:21:50.217283Z INFO syncing safekeepers on shutdown
Most of the time has been spent in connecting to nodes. More specifically, we have:
void WalProposerStart(WalProposer *wp) { /* Initiate connections to all safekeeper nodes */ for (int i = 0; i < wp->n_safekeepers; i++) { ResetConnection(&wp->safekeeper[i]); } WalProposerLoop(wp); } static void ResetConnection(Safekeeper *sk) { ... wp_log(LOG, "connecting with node %s:%s", sk->host, sk->port); ... }
There are 3 possible explanations:
ResetConnection
I looked at safekeeper logs:
2024-04-25T17:21:55.256282Z INFO {cid=562809 ttid=XXX/YYY}: got query "START_WAL_PUSH" 2024-04-25T17:21:52.080006Z INFO {cid=562810 ttid=QQQ/WWW}: got query "TIMELINE_STATUS" 2024-04-25T17:21:55.254134Z INFO {cid=564739 ttid=XXX/YYY}: got query "START_WAL_PUSH" 2024-04-25T17:21:52.080974Z INFO {cid=564740 ttid=QQQ/WWW}: got query "TIMELINE_STATUS" 2024-04-25T17:21:55.251300Z INFO {cid=562411 ttid=XXX/YYY}: got query "START_WAL_PUSH" 2024-04-25T17:21:55.247237Z INFO {cid=562410 ttid=QQQ/WWW}: got query "TIMELINE_STATUS"
I fetched logs for adjacent cid, and made a conclusion that 2/3 safekeepers accepted connection before 17:21:52.
unknown
sync safekeepers should be quick (<100ms)
syncing safekeepers took more than 5 seconds
Next steps: add more logs to diagnose it
Very likely it is DNS resolution hiccup, it is the only blocking place here as far as I'm aware.
We had a long sync-safekeepers on shutdown:
Most of the time has been spent in connecting to nodes. More specifically, we have:
There are 3 possible explanations:
ResetConnection
can get blockedI looked at safekeeper logs:
I fetched logs for adjacent cid, and made a conclusion that 2/3 safekeepers accepted connection before 17:21:52.
Steps to reproduce
unknown
Expected result
sync safekeepers should be quick (<100ms)
Actual result
syncing safekeepers took more than 5 seconds
Logs, links