Epic: move outbound logical replication out of Beta

stepashka commented 10 months ago

DoD

logical replication is not in beta on neon anymore and wal_level = logical can be enabled by default on all project on neon platform

### Tasks & bugs to fix
- [ ] https://github.com/neondatabase/neon/issues/6370
- [ ] https://github.com/neondatabase/neon/issues/6371
- [ ] https://github.com/neondatabase/neon/pull/6221
- [ ] https://github.com/neondatabase/neon/issues/6182
- [ ] https://github.com/neondatabase/neon/issues/6229
- [ ] https://github.com/neondatabase/cloud/pull/9015
- [ ] https://github.com/neondatabase/neon/issues/6257
- [ ] https://github.com/neondatabase/neon/issues/7593
- [x] Check that AUX v2 is default for all new tenants on pageserver
- [ ] https://github.com/neondatabase/neon/issues/8349
- [ ] https://github.com/neondatabase/cloud/issues/15226
- [ ] https://github.com/neondatabase/neon/issues/6626
- [x] Figure out observability of lagging publisher (slot retaining a lot of WAL)
- [ ] https://github.com/neondatabase/neon/issues/5885
- [ ] https://github.com/neondatabase/neon/issues/8931
- [ ] https://github.com/neondatabase/cloud/issues/17261
- [ ] https://github.com/neondatabase/neon/issues/8619
- [ ] Logical slots are copied to the replica and prevent WAL truncation, see https://github.com/neondatabase/neon/pull/9425#discussion_r1804820659

Follow-ups (out of scope):

https://github.com/neondatabase/neon/issues/6258

Other related tasks and Epics

https://github.com/neondatabase/cloud/issues/8892

https://neondb.slack.com/archives/C04DGM6SMTM/p1703091242312799

vadim2404 commented 9 months ago

@arssher , will your fixes help with first two items?

arssher commented 9 months ago

slot may disappear on restart (hard to reproduce., only occured once)

I still don't know what was that. Stas tested manually and observed this once. There is known path by which slot might be lost, but this is highly unlikely (endpoint killed before logical message is committed to safekeepers), Stas case wasn't like that. Need more testing and reproduction.

in one case replication wasn't able to read WAL (hard to reproduce)

The more proper description would be 'if slot is lagging, on compute start replication might fail until the whole tail is downloaded'. We merged cap on max allowed lagging, but to really fix it we need to bring on demand WAL download from safekeepers to logical walsenders. We recently merged core patch: https://github.com/neondatabase/neon/pull/5948 but using it in logical walsenders is separate step. Shouldn't be hard, but I haven't started on that.

vadim2404 commented 9 months ago

@arssher will you work on on-demand WAL download in walsenders? Is it a part of this epic [will it block announcing GA for logical replication]?

@kelvich

slot may disappear on restart (hard to reproduce., only occured once) I think nobody has been able to reproduce it so far. Reasonable question: is it a part of the epic's scope?

andreasscherbaum commented 7 months ago

Renamed to "outbound" logical replication. When this Epic was started, it was "only" logical replication, but now it's two different types of replication.

andreasscherbaum commented 7 months ago

Discussion with Stas: improve pageserver performance first

ololobus commented 3 months ago

This week:

[x] Sasha: Finish the aux v2 rollout (last batch tomorrow morning)

ololobus commented 3 months ago

This week:

[x] Waiting for the new compute image rollout

ololobus commented 2 months ago

This week:

[ ] Tristan: look at AUX files monitoring and limiting

skyzh commented 2 months ago

@tristan957 there is already monitoring on the storage side for aux v2

https://github.com/neondatabase/neon/blob/6949b45e1795816507f5025a474e15d718e07456/pageserver/src/metrics.rs#L588-L595

there is total aux size metrics per timeline

ololobus commented 1 month ago

This week:

[x] Finish https://github.com/neondatabase/neon/pull/8762

ololobus commented 1 month ago

This week:

[x] Alexey: Expose snap files count metric and add to dashboards
[ ] Anastasia: look at LR test failures

ololobus commented 1 month ago

This week:

[ ] Look at LR test failures (Tristan said that he will have a look on Thu)

ololobus commented 1 month ago

This week:

[ ] Konstantin: https://github.com/neondatabase/neon/issues/8931
[ ] Tristan: https://github.com/neondatabase/cloud/issues/17261

ololobus commented 3 weeks ago

This week:

[ ] Konstantin: try to write a repro script
[ ] Tristan: https://github.com/neondatabase/cloud/issues/17261

kelvich commented 1 week ago

@tristan957 added tests for LR metrics in compute. PGv17 had broken metrics. Plan: get list of metrics from AWS.

https://github.com/neondatabase/cloud/issues/17261 failed due ENOSPACE, smth else is going there. Nikita helps there.

ololobus commented 4 days ago

https://github.com/neondatabase/cloud/issues/17261 failed due ENOSPACE, smth else is going there. Nikita helps there.

@tristan957 any progress with this?

So the original problem was that test was passing at first, and then regressed and started failing with ENOSPACE. That was known in the beginning of Sept already

The plan we discussed was

Repro this failure on staging (just manually rerun test)
Watch disk usage and figure out what is eating the disk space
Is it valid disk usage? Then we have two options 3.1. It's something we want to fix -> investigate further and discuss/propose the fix 3.2. Yes, it's OK -> bump the compute size and add fixed sized compute flag
Start running the test again

What was the item we gut stuck at? Or do I miss something and we got some additional context here?

tristan957 commented 4 days ago

So the original problem was that test was passing at first, and then regressed and started failing with ENOSPACE. That was known in the beginning of Sept already

I think there was some miscommunication between Stas and myself here. The current problem is that the publisher endpoint will not even start at the moment, which is different from the ENOSPC issue we were previously running into.

@tristan957 any progress with this?

Last week, I was talking to Nikita K. about what is going on here because the endpoint was seemingly stuck. He and I came to the conclusion that the compute was failing to retrieve the basebackup from the pageserver due to some AUX files issues. After talking to the storage team, we determined that we should wait for Chi to come back from vacation to get his thoughts.

ololobus commented 3 days ago

This week:

[ ] @knizhnik and @MMeent review and finalize https://github.com/neondatabase/neon/pull/9007
[ ] @tristan957 get https://github.com/neondatabase/cloud/issues/17261 done, support storage team if needed
[ ] @tristan957 https://github.com/neondatabase/neon/issues/8619 add metric for the total size of snap files + add a size-based equivalent of neon.logical_replication_max_snap_files for automatic slot removal

neondatabase / neon

Epic: move outbound logical replication out of Beta #6213

DoD

Other related tasks and Epics