Most of the GHES customers are set up in HA mode, and many of them (especially larger customers) have geo replication and multiple HA replicas in their configuration. However, we have seen many issues with setup / configure / teardown / failover with multiple replicas. For example: GHES HA failover not working when the primary is unavailable. This impacts the customer experience and deteriorates the perceived reliability of the system.
Intended Outcome
The efforts we have planned this quarter will make HA configuration scripts much more reliable, we will improve error handling, and handle failover scenarios more gracefully.
How will it work?
Improve connection checks and reliability of ghe-repl-* commands
Better error handling for offline replicas
Reliable teardown if primary or any of the replicas are not accessible
Prevent ghe-config-apply from running if replica is not fully setup
Summary
Most of the GHES customers are set up in HA mode, and many of them (especially larger customers) have geo replication and multiple HA replicas in their configuration. However, we have seen many issues with setup / configure / teardown / failover with multiple replicas. For example: GHES HA failover not working when the primary is unavailable. This impacts the customer experience and deteriorates the perceived reliability of the system.
Intended Outcome
The efforts we have planned this quarter will make HA configuration scripts much more reliable, we will improve error handling, and handle failover scenarios more gracefully.
How will it work?
ghe-repl-*
commandsghe-config-apply
from running if replica is not fully setup