Still many 'safekeeper did not set events for 3s' logs on pageserver. But now on Arthur's dashboard I see that sks push data fast enough (6ms for a batch of several thousand tlis) and on average number of pulls on pageserver matches it. So it seems like it is an issue of latency, not throughput: walreceiver manager sometimes gets stuck somewhere. Another hint in this direction is these logs: always immediately after removing safekeeper 'not sending updates' walreceiver registers it again, often these log records are even mixed up in loki.
Still huge number of reconnections after 2s timeout.
Citing @arssher :
Interesting stuff happening:
https://neondb.slack.com/archives/C039YKBRZB4/p1683137321047759