neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.78k stars 430 forks source link

slot may disappear on restart (hard to reproduce., only occured once) #6370

Closed vadim2404 closed 8 months ago

save-buffer commented 8 months ago

Status update:

I asked kelvich how to reproduce, he said to set max_slot_wal_keep_size to -1 let slot lag by 8GB and restart compute. Currently on compute startup we set max_slot_wal_keep_size to 1024, so I'm not even able to set max_slot_wal_keep_size to -1. I tried on a preview environment setting it to -1 but haven't been able to reproduce.

    // Right now we download all the WAL files between the slot position and the current
    // WAL position. If the slot is lagging too much, we can download a lot of WAL files
    // and delay the compute startup. So we limit the number of WAL files we download.
    //
    // TODO: remove that once we roll out on-demand WAL download
    pgSettings["max_slot_wal_keep_size"] = "1024"
save-buffer commented 8 months ago

OK tried it again in preview environment with new pageserver and max_slot_wal_keep_size at -1, once again lagged by 8GB of WAL, restarted compute 3 times, each time I run select * from pg_replication_slots; I get

slot_name   plugin  slot_type   datoid  database    temporary   active  active_pid  xmin    catalog_xmin    restart_lsn confirmed_flush_lsn wal_status  safe_wal_size   two_phase   conflicting
1   neon_replication    wal2json    logical 16389   neondb  f   f           1029    0/25C1400   0/2822140   extended        f   f
2   wal_proposer_slot       physical            f   f               2/4EC7C68       reserved        f   

So my slot isn't disappearing; not 100% sure what to do here so I'll close this issue for now and we can keep an eye on it.

andreasscherbaum commented 7 months ago

It's released, could not reproduce bug