openstreetmap / operations

OSMF Operations Working Group issue tracking
https://operations.osmfoundation.org/
98 stars 13 forks source link

Enable Synchronous PostgresSQL Replication for a same site server #1012

Open Firefishy opened 9 months ago

Firefishy commented 9 months ago

PostgreSQL can be configured to synchronous replicate commits across instance. The default configuration is postgres follower instances is asynchronous.

Introduction on synchronous article... https://www.crunchydata.com/blog/synchronous-replication-in-postgresql

The synchronous configuration is very flexible with it's own DSL to describe the commit requirement(s). https://www.postgresql.org/docs/current/runtime-config-replication.html#GUC-SYNCHRONOUS-STANDBY-NAMES

It is likely not a good idea to have a single specifically named instance as the synchronous secondary as that would require both the leader and synchronous follower servers to both be online and in-sync for a commit to close. A better option would likely be something like: "FIRST 1 (srv1, srv2, srv3)"

grischard commented 9 months ago

Would FIRST() mean that if A and B are in one data centre and C and D in another, and A is the main server, that it will sync write to B most of the time, unless there is a problem with B, in which case it will gracefully degrade to sync writes to whichever of C and D is faster?

tomhughes commented 9 months ago

That's the theory, yes - it will wait for sync from the first server in the last which is currently connected.

mmd-osm commented 9 months ago

Isn't it a bit dangerous to include other data centers in the list? Back in October we've seen up to 4 hours in replication delay across data centers: https://prometheus.openstreetmap.org/d/ST-7bi5Gz/api-database?orgId=1&from=1698144474400&to=1698207987385&viewPanel=8

tomhughes commented 9 months ago

Well that was due to a specific failure, and karm was running then so it wouldn't have been an issue.

If you don't include them then if karm fails then nothing will be able to commit at all as there won't be any replica that will count towards meeting the quota.