keycloak / keycloak-benchmark

Keycloak Benchmark
https://www.keycloak.org/keycloak-benchmark/
Apache License 2.0
128 stars 71 forks source link

Gossip Router stability issues #473

Closed pruivo closed 1 year ago

pruivo commented 1 year ago

We found some connection issues between the Infinispan pods and Gossip Routers. This leads to some partitioned states where some messages go through and others don't.

Multiple Gossip Routers instances

The current implementation deploys 1 Gossip Router pod in each cluster. It brings the advantage of load balancing since each Infinispan pod will use both concurrently in a random fashion, and provides HA in case one of the Gossip Router pods fails.

During tests, it was observed that some pods lost connection to one of the Gossip Routers leading to unstable communication.

I propose to use a single Gossip Router globally to avoid unstable connections.

Multiple IPs for the same Gossip Router instance

Gossip Router expects a single connection for each Infinispan pod and we found some issues in the past (see JGRP-2722).

The Gossip Router is only able to use one connection to forward messages, meaning that the second connection is for receiving only. Also, the Gossip Router triggers a SUSPECT even when a connection is abruptly closed which leads to the "suspected" pod to removes from the view. TLTR: if one connection abruptly closes, the pod is removed from the view although there is a functional connection available. The pod is able to send data through the second connection Gossip Router is unable to send any data to it.

Fixing this in Gossip Router may improve the stability.

Disable SUSPECT events

In the same context as above, it is possible to disable SUSPECT events on the connection closed. May be worth considering. Failure detection will be based on heartbeats (FD_ALLx)

Skupper to the rescue?

Skupper (Red Hat Application Interconnect) allows two OCP clusters to connect. It is an alternative to using OCP Router to connect pods to the Gossip Router. If it is available in ROSA and it provides a single IP address for the Gossip Router, it may improve the stability of the cross-site connection.

ahus1 commented 1 year ago

@pruivo - reading about Skupper, would this then be "if you want to use cross-DC with Kubernetes, you'll need to use Skupper"?

This might increase the complexity of the setup, and not all organizations would "allow" such a technology in their cluster (the same as they might not "allow" a service mesh):

Let's discuss later today if the use of this tool justifies those costs.

pruivo commented 1 year ago

@pruivo - reading about Skupper, would this then be "if you want to use cross-DC with Kubernetes, you'll need to use Skupper"?

No! These tools, Skupper and Submariner are tools with want to support in the Infinispan operator. See JDG-6364 (internal).

ahus1 commented 1 year ago

Between the Infinispan pods and the gossip router, there are multiple instances which could terminate a TCP connection, and then the pod would need to (seamlessly) reconnect.

belaban commented 1 year ago

Wrt the connection between pods and the GossipRouter: you can enable heartbeats [1], so the connection will be left open, even when no regular traffic is encountered. [1] https://issues.redhat.com/browse/JGRP-2634

pruivo commented 1 year ago

Linking some related GH issues for the Infinispan operator https://github.com/infinispan/infinispan-operator/issues/1856 https://github.com/infinispan/infinispan-operator/issues/1857

ahus1 commented 1 year ago

Current status: Wait for next version of Infinispan Operator, then retry with 2 Routers. Running with one Router give a stable setup.