Hot standby GUCs missmatch problem

ololobus commented 1 month ago

Even after implementing GUCs syncing between primary and replica we still have edge cases (primary resize up after replica start). And it strikes us a few times a day.

One of the recent examples: https://neondb.slack.com/archives/C04DGM6SMTM/p1726517989824779

We discussed that with Heikki during the offsite. I think we should just change this code to emit warning and still try to continue the WAL redo. It probably should be covered by some GUC to turn it off easily if needed. The worst case should probably be the redo process crash, but then the compute will restart, so that's what we wanted anyway :) There could be some other edge-cases, though.

knizhnik commented 1 month ago

I did some experiments and noticed that it is actually not so easy to reproduce the problem. I tried to start replica for primary which has larger values of critical parameters (max_worker_processes, max_prepared_transactions,...) and it is normally started. So looks like it is checked by Postgres only if new GUC values are received through WAL. But all this GUCs can not be changed online: they require server restart.

So the problem arise if we restart primary with new parameters and do not restart replica. And there are three possible solutions:

Force replica restart if primary is restarted.
Ignore replaying GUC updates n replica
Just disarm check that values of this critical GUCs is not smaller at replica than at primary node.

3) seems to be the easiest choice. But it requires patching of Postgres core (and all Postgres submodules In our repo). I do not expect some troubles with it: as far as Postgres allows primary to have larger values of this GUCs than primary, then it should not deal with some structures which size depends on this GUCs.

ololobus commented 1 month ago

Yes, I primarily thought about option 3.

neondatabase / neon

Hot standby GUCs missmatch problem #9023