Closed ryanemerson closed 1 month ago
@ryanemerson - I see that the metric vendor_jgroups_site_view_status
is now available in our cluster. It is present on all Infinispan nodes (assuming that all of them are site masters, then?) and it is 1
all the time (even if we take the second site offline during the setup of our data? This surprises me a bit, still I might not get the full meaning of that metric.
I see that the metric
vendor_jgroups_site_view_status
is now available in our cluster. It is present on all Infinispan nodes (assuming that all of them are site masters, then?) and it is1
all the time (even if we take the second site offline during the setup of our data? This surprises me a bit, still I might not get the full meaning of that metric.
Adding a comment here for interested parties who were not present for our discussion yesterday.
The vendor_jgroups_site_view_status
metric represents the status of the JGroups site view. It will return 0 if a site is unreachable, 1 if it's reachable and 2 if it's somehow unknown. Marking an Infinispan site offline has no impact on this metric, as that is implemented at a higher-level within Infinispan and does not change the JGroups site view.
In order for us to support Active/Active deployments we need to update the following items in the Keycloak HA guide:
We need to introduce an equivalent of the ^ two guides for Active/Active guides.
We should also add the following procedures:
Update the existing https://www.keycloak.org/high-availability/introduction page to link to dedicated Active/Passive and Active/Active overview page which has links to architecture specific Concepts, Building blocks and Operational procedure. Many of the building blocks will be re-usable, e.g. Deploy Keycloak for HA with the Keycloak Operator
Add the required Active/Active guides
Only include "Multi-site Deployments", "Active/Passive Overview" and "Active/Active Overview" thumbnails at https://www.keycloak.org/guides#high-availability
I've updated the crossdc-tests and associated actions so that the functional tests are executed against both Active/Active and Active/Passive deployments. To allow for the fact that both deployment types have different semantics, and not all tests will be applicable to both, I have created two tag annotation to control which tests are triggered: @ActiveActive
and @ActivePassive
. For example, the FailoverTest#logoutUserWithFailoverTest
will fail with Active/Active clusters as it expects a failover to occur from an Active to a Passive cluster.
Thanks for the review @pruivo. My intention was to add the TODO parts today, I just pushed the "WIP" commit so that I had a backup.
Operational guides added for Take Site Offline
and Bring Site Online
, as well as a building block to Deploy an AWS Lambda to guard against Split-Brain
.
We still need to add operational guides on how to synchronize sites state, but I think we first need to decide how users should do that as they could have conflicting state as there's a window during split-brain where both sites will be active (before split is detected and the STONITH Lambda fires) \cc @pruivo.
Protostream will land not today, so merging this one.
Resolves keycloak/keycloak#29303
Changes
User alert routing enabled on ROSA clusters
PrometheusRule used to trigger AWS Lambda webhook in the event of a split-brain so that only a single site remains in the global accelerator endpoints
Global Accelerator scripts refactored to use OpenTofu when creating AWS resources
Task created to deploy/undeploy Active/Active
Task created to simulate split-brain scenarios
'active-active' flag added to GH actions to differentiate between active/passive and active/active deployments
Global Accelerator Provisioning
The global accelerator provisioning uses a hybrid approach for creating AWS resources. The NLB required for the accelerator endpoints is created via Kubernetes LoadBalancer services in each of the nodes. This is done as it's much simpler than trying to explicitly provision NLBs for each sites using OpenTofu. Consequently, the OpenTofu accelerator module simply references these existing NLBs via data sources so that we can add them to the accelerator endpoint group.
Testing
Inspect the AWS Global Accelerator console and ensure that the endpoint group contains two endpoints, one for each site.
Simulate a split-brain scenario:
Navigate to the Openshift Console and ensure an event was fired, go to Observer -> Alerting and apply the "user" filter. A "SiteOffline" event should have been fired
Inspect the AWS Global Accelerator console and ensure that the endpoint group now only contains a single endpoint.
TODO
Still missing: