Open ololobus opened 1 month ago
I did some experiments and noticed that it is actually not so easy to reproduce the problem.
I tried to start replica for primary which has larger values of critical parameters (max_worker_processes
, max_prepared_transactions
,...) and it is normally started. So looks like it is checked by Postgres only if new GUC values are received through WAL. But all this GUCs can not be changed online: they require server restart.
So the problem arise if we restart primary with new parameters and do not restart replica. And there are three possible solutions:
3) seems to be the easiest choice. But it requires patching of Postgres core (and all Postgres submodules In our repo). I do not expect some troubles with it: as far as Postgres allows primary to have larger values of this GUCs than primary, then it should not deal with some structures which size depends on this GUCs.
Yes, I primarily thought about option 3.
Even after implementing GUCs syncing between primary and replica we still have edge cases (primary resize up after replica start). And it strikes us a few times a day.
One of the recent examples: https://neondb.slack.com/archives/C04DGM6SMTM/p1726517989824779
We discussed that with Heikki during the offsite. I think we should just change this code to emit warning and still try to continue the WAL redo. It probably should be covered by some GUC to turn it off easily if needed. The worst case should probably be the redo process crash, but then the compute will restart, so that's what we wanted anyway :) There could be some other edge-cases, though.