More problems with pageserver<>safekeeper reconnections

kelvich commented 1 year ago

Citing @arssher :

Interesting stuff happening:

Still many 'safekeeper did not set events for 3s' logs on pageserver. But now on Arthur's dashboard I see that sks push data fast enough (6ms for a batch of several thousand tlis) and on average number of pulls on pageserver matches it. So it seems like it is an issue of latency, not throughput: walreceiver manager sometimes gets stuck somewhere. Another hint in this direction is these logs: always immediately after removing safekeeper 'not sending updates' walreceiver registers it again, often these log records are even mixed up in loki.
Still huge number of reconnections after 2s timeout.

shanyp commented 1 year ago

@kelvich from the description this is a bug not a new feature

koivunej commented 1 year ago

is @arssher or @petuhovskiy looking into this, or have this on their list?

arssher commented 1 year ago

I'm looking.

neondatabase / neon