neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.75k stars 428 forks source link

Epic: move outbound logical replication out of Beta #6213

Open stepashka opened 10 months ago

stepashka commented 10 months ago

DoD

logical replication is not in beta on neon anymore and wal_level = logical can be enabled by default on all project on neon platform

### Tasks & bugs to fix
- [ ] https://github.com/neondatabase/neon/issues/6370
- [ ] https://github.com/neondatabase/neon/issues/6371
- [ ] https://github.com/neondatabase/neon/pull/6221
- [ ] https://github.com/neondatabase/neon/issues/6182
- [ ] https://github.com/neondatabase/neon/issues/6229
- [ ] https://github.com/neondatabase/cloud/pull/9015
- [ ] https://github.com/neondatabase/neon/issues/6257
- [ ] https://github.com/neondatabase/neon/issues/7593
- [x] Check that AUX v2 is default for all new tenants on pageserver
- [ ] https://github.com/neondatabase/neon/issues/8349
- [ ] https://github.com/neondatabase/cloud/issues/15226
- [ ] https://github.com/neondatabase/neon/issues/6626
- [x] Figure out observability of lagging publisher (slot retaining a lot of WAL)
- [ ] https://github.com/neondatabase/neon/issues/5885
- [ ] https://github.com/neondatabase/neon/issues/8931
- [ ] https://github.com/neondatabase/cloud/issues/17261
- [ ] https://github.com/neondatabase/neon/issues/8619
- [ ] Logical slots are copied to the replica and prevent WAL truncation, see https://github.com/neondatabase/neon/pull/9425#discussion_r1804820659

Follow-ups (out of scope):

Other related tasks and Epics

https://neondb.slack.com/archives/C04DGM6SMTM/p1703091242312799

vadim2404 commented 9 months ago

@arssher , will your fixes help with first two items?

arssher commented 9 months ago

slot may disappear on restart (hard to reproduce., only occured once)

I still don't know what was that. Stas tested manually and observed this once. There is known path by which slot might be lost, but this is highly unlikely (endpoint killed before logical message is committed to safekeepers), Stas case wasn't like that. Need more testing and reproduction.

in one case replication wasn't able to read WAL (hard to reproduce)

The more proper description would be 'if slot is lagging, on compute start replication might fail until the whole tail is downloaded'. We merged cap on max allowed lagging, but to really fix it we need to bring on demand WAL download from safekeepers to logical walsenders. We recently merged core patch: https://github.com/neondatabase/neon/pull/5948 but using it in logical walsenders is separate step. Shouldn't be hard, but I haven't started on that.

vadim2404 commented 9 months ago

@arssher will you work on on-demand WAL download in walsenders? Is it a part of this epic [will it block announcing GA for logical replication]?

@kelvich

slot may disappear on restart (hard to reproduce., only occured once) I think nobody has been able to reproduce it so far. Reasonable question: is it a part of the epic's scope?

andreasscherbaum commented 7 months ago

Renamed to "outbound" logical replication. When this Epic was started, it was "only" logical replication, but now it's two different types of replication.

andreasscherbaum commented 7 months ago

Discussion with Stas: improve pageserver performance first

ololobus commented 3 months ago

This week:

ololobus commented 3 months ago

This week:

ololobus commented 2 months ago

This week:

skyzh commented 2 months ago

@tristan957 there is already monitoring on the storage side for aux v2

https://github.com/neondatabase/neon/blob/6949b45e1795816507f5025a474e15d718e07456/pageserver/src/metrics.rs#L588-L595

there is total aux size metrics per timeline

ololobus commented 1 month ago

This week:

ololobus commented 1 month ago

This week:

ololobus commented 1 month ago

This week:

ololobus commented 1 month ago

This week:

ololobus commented 3 weeks ago

This week:

kelvich commented 1 week ago

@tristan957 added tests for LR metrics in compute. PGv17 had broken metrics. Plan: get list of metrics from AWS.

https://github.com/neondatabase/cloud/issues/17261 failed due ENOSPACE, smth else is going there. Nikita helps there.

ololobus commented 4 days ago

https://github.com/neondatabase/cloud/issues/17261 failed due ENOSPACE, smth else is going there. Nikita helps there.

@tristan957 any progress with this?

So the original problem was that test was passing at first, and then regressed and started failing with ENOSPACE. That was known in the beginning of Sept already

The plan we discussed was

  1. Repro this failure on staging (just manually rerun test)
  2. Watch disk usage and figure out what is eating the disk space
  3. Is it valid disk usage? Then we have two options 3.1. It's something we want to fix -> investigate further and discuss/propose the fix 3.2. Yes, it's OK -> bump the compute size and add fixed sized compute flag
  4. Start running the test again

What was the item we gut stuck at? Or do I miss something and we got some additional context here?

tristan957 commented 4 days ago

So the original problem was that test was passing at first, and then regressed and started failing with ENOSPACE. That was known in the beginning of Sept already

I think there was some miscommunication between Stas and myself here. The current problem is that the publisher endpoint will not even start at the moment, which is different from the ENOSPC issue we were previously running into.

@tristan957 any progress with this?

Last week, I was talking to Nikita K. about what is going on here because the endpoint was seemingly stuck. He and I came to the conclusion that the compute was failing to retrieve the basebackup from the pageserver due to some AUX files issues. After talking to the storage team, we determined that we should wait for Chi to come back from vacation to get his thoughts.

ololobus commented 3 days ago

This week: