Open stepashka opened 10 months ago
@arssher , will your fixes help with first two items?
slot may disappear on restart (hard to reproduce., only occured once)
I still don't know what was that. Stas tested manually and observed this once. There is known path by which slot might be lost, but this is highly unlikely (endpoint killed before logical message is committed to safekeepers), Stas case wasn't like that. Need more testing and reproduction.
in one case replication wasn't able to read WAL (hard to reproduce)
The more proper description would be 'if slot is lagging, on compute start replication might fail until the whole tail is downloaded'. We merged cap on max allowed lagging, but to really fix it we need to bring on demand WAL download from safekeepers to logical walsenders. We recently merged core patch: https://github.com/neondatabase/neon/pull/5948 but using it in logical walsenders is separate step. Shouldn't be hard, but I haven't started on that.
@arssher will you work on on-demand WAL download in walsenders? Is it a part of this epic [will it block announcing GA for logical replication]?
@kelvich
slot may disappear on restart (hard to reproduce., only occured once) I think nobody has been able to reproduce it so far. Reasonable question: is it a part of the epic's scope?
Renamed to "outbound" logical replication. When this Epic was started, it was "only" logical replication, but now it's two different types of replication.
Discussion with Stas: improve pageserver performance first
This week:
This week:
This week:
@tristan957 there is already monitoring on the storage side for aux v2
there is total aux size metrics per timeline
This week:
This week:
This week:
This week:
This week:
@tristan957 added tests for LR metrics in compute. PGv17 had broken metrics. Plan: get list of metrics from AWS.
https://github.com/neondatabase/cloud/issues/17261 failed due ENOSPACE, smth else is going there. Nikita helps there.
https://github.com/neondatabase/cloud/issues/17261 failed due ENOSPACE, smth else is going there. Nikita helps there.
@tristan957 any progress with this?
So the original problem was that test was passing at first, and then regressed and started failing with ENOSPACE. That was known in the beginning of Sept already
The plan we discussed was
What was the item we gut stuck at? Or do I miss something and we got some additional context here?
So the original problem was that test was passing at first, and then regressed and started failing with ENOSPACE. That was known in the beginning of Sept already
I think there was some miscommunication between Stas and myself here. The current problem is that the publisher endpoint will not even start at the moment, which is different from the ENOSPC issue we were previously running into.
@tristan957 any progress with this?
Last week, I was talking to Nikita K. about what is going on here because the endpoint was seemingly stuck. He and I came to the conclusion that the compute was failing to retrieve the basebackup from the pageserver due to some AUX files issues. After talking to the storage team, we determined that we should wait for Chi to come back from vacation to get his thoughts.
This week:
neon.logical_replication_max_snap_files
for automatic slot removal
DoD
logical replication is not in beta on neon anymore and wal_level = logical can be enabled by default on all project on neon platform
Follow-ups (out of scope):
Other related tasks and Epics
https://neondb.slack.com/archives/C04DGM6SMTM/p1703091242312799