Active/Active XSite fencing. Resolves keycloak#29303

ryanemerson commented 2 months ago

Resolves keycloak/keycloak#29303

Changes

User alert routing enabled on ROSA clusters
PrometheusRule used to trigger AWS Lambda webhook in the event of a split-brain so that only a single site remains in the global accelerator endpoints
Global Accelerator scripts refactored to use OpenTofu when creating AWS resources
Task created to deploy/undeploy Active/Active
Task created to simulate split-brain scenarios
'active-active' flag added to GH actions to differentiate between active/passive and active/active deployments

Global Accelerator Provisioning

The global accelerator provisioning uses a hybrid approach for creating AWS resources. The NLB required for the accelerator endpoints is created via Kubernetes LoadBalancer services in each of the nodes. This is done as it's much simpler than trying to explicitly provision NLBs for each sites using OpenTofu. Consequently, the OpenTofu accelerator module simply references these existing NLBs via data sources so that we can add them to the accelerator endpoint group.

Testing

Provision an active/active deployment:

gh workflow run rosa-multi-az-cluster-create.yml -f activeActive=true -f clusterPrefix= -f region=

Inspect the AWS Global Accelerator console and ensure that the endpoint group contains two endpoints, one for each site.
Simulate a split-brain scenario:

cd provision/infinispan
PREFIX= ROSA_CLUSTER_NAME_1=$PREFIX-a ROSA_CLUSTER_NAME_2=$PREFIX-b NAMESPACE=runner-keycloak task crossdc-split

Navigate to the Openshift Console and ensure an event was fired, go to Observer -> Alerting and apply the "user" filter. A "SiteOffline" event should have been fired
Inspect the AWS Global Accelerator console and ensure that the endpoint group now only contains a single endpoint.

TODO

Still missing:

[x] Infinispan 15.0.4.Final: https://github.com/infinispan/infinispan/pull/12368
[ ] Scaling Benchmark Integration
[x] Webhook authentication
[x] Documentation
[ ] Infinispan 15.0.5.Final: https://github.com/infinispan/infinispan/pull/12484

ahus1 commented 1 month ago

@ryanemerson - I see that the metric vendor_jgroups_site_view_status is now available in our cluster. It is present on all Infinispan nodes (assuming that all of them are site masters, then?) and it is 1 all the time (even if we take the second site offline during the setup of our data? This surprises me a bit, still I might not get the full meaning of that metric.

ryanemerson commented 1 month ago

I see that the metric vendor_jgroups_site_view_status is now available in our cluster. It is present on all Infinispan nodes (assuming that all of them are site masters, then?) and it is 1 all the time (even if we take the second site offline during the setup of our data? This surprises me a bit, still I might not get the full meaning of that metric.

Adding a comment here for interested parties who were not present for our discussion yesterday.

The vendor_jgroups_site_view_status metric represents the status of the JGroups site view. It will return 0 if a site is unreachable, 1 if it's reachable and 2 if it's somehow unknown. Marking an Infinispan site offline has no impact on this metric, as that is implemented at a higher-level within Infinispan and does not change the JGroups site view.

ryanemerson commented 1 month ago

Documentation Changes Required

In order for us to support Active/Active deployments we need to update the following items in the Keycloak HA guide:

Building Blocks

We need to introduce an equivalent of the ^ two guides for Active/Active guides.

Blueprints

Deploy an AWS Route 53 loadbalancer: This needs to be replaced with a guide on how to deploy a AWS Global Accelerator

Operational Procedures

Fail over to the secondary site: We no longer have the notion of a Primary and Backup site, so this needs to be replaced entirely. Instead we should provide a guide on how to correctly remove one of the Active sites.
Switch over to the secondary site: No longer relevant, see above.
Recover from an out-of-sync passive site: This needs to be replaced with an equivalent guide that explains how to re-sync a active site that has been offline for some time. The procedure for doing the synchronisation should be the same, so we can embed a partial .adoc file here to reduce repetition.
Switch back to the primary site: No longer relevant

We should also add the following procedures:

"Recover from Active failure": Detail how to re-sync data and re-add an endpoint to the AWS Accelerator so both sites are available

Proposal

Update the existing https://www.keycloak.org/high-availability/introduction page to link to dedicated Active/Passive and Active/Active overview page which has links to architecture specific Concepts, Building blocks and Operational procedure. Many of the building blocks will be re-usable, e.g. Deploy Keycloak for HA with the Keycloak Operator
Add the required Active/Active guides
Only include "Multi-site Deployments", "Active/Passive Overview" and "Active/Active Overview" thumbnails at https://www.keycloak.org/guides#high-availability

ryanemerson commented 1 month ago

I've updated the crossdc-tests and associated actions so that the functional tests are executed against both Active/Active and Active/Passive deployments. To allow for the fact that both deployment types have different semantics, and not all tests will be applicable to both, I have created two tag annotation to control which tests are triggered: @ActiveActive and @ActivePassive. For example, the FailoverTest#logoutUserWithFailoverTest will fail with Active/Active clusters as it expects a failover to occur from an Active to a Passive cluster.

ryanemerson commented 1 month ago

Thanks for the review @pruivo. My intention was to add the TODO parts today, I just pushed the "WIP" commit so that I had a backup.

ryanemerson commented 1 month ago

Operational guides added for Take Site Offline and Bring Site Online, as well as a building block to Deploy an AWS Lambda to guard against Split-Brain.

We still need to add operational guides on how to synchronize sites state, but I think we first need to decide how users should do that as they could have conflicting state as there's a window during split-brain where both sites will be active (before split is detected and the STONITH Lambda fires) \cc @pruivo.

ahus1 commented 1 month ago

Protostream will land not today, so merging this one.

keycloak / keycloak-benchmark