oracle / coherence

Oracle Coherence Community Edition
https://coherence.community
Universal Permissive License v1.0
427 stars 70 forks source link

Thoughts on running Coherence cluster over multiple availability zones in the cloud... #128

Closed javafanboy closed 2 months ago

javafanboy commented 2 months ago

In some older Coherence doc I have seen recommendations that it is not a good idea to run a Coherence cluster over geographically distributed sites unless 10Gbit bandwidth and "extremely low latency".

When using Cloud (at least AWS) availability zones ARE geographically distributed but using low latency dedicated fiber connections and the sites are within a limited "metro area" with the purpose to allow synchronous replication for databases and in my experience is in the "low single digit" ms latency range.

Is this considered "extremly low" from a Coherence point of view or not and is it REALLY a requirement to go as high as 10Gbit - there are after all still quite few smaller EC2 family members (VMs) and very rarely container instances that go that high - more typical EC2 figures for small instances are at best ~5Gbit or even lower and for "serverless" (=managed) containers (like ECS Fargate) less than 1Gbit sustained (burting can be much higher of course).

We have done some tests running Coherence over 3 AZs on smaller EC2 instances (max ~5Gbit) and so far it seem to have worked ok but we have not pushed the cluster that hard in performance tests yet...

An alternative that REALLY result in low latency is of course to use "placement groups" and deploy the whole cluster in one AZ (this can basically get us down to "same rack" latency level) but is MUCH less attractive from an availability point of view - when there are 3+ sufficiently geographically separated and in all imaginable ways independent sites to use you really want to take advantage of it if possible...

What experiences can others share? Are you running Coherence over multiple AZs in the cloud and if so how has it worked out??

mgamanho commented 2 months ago

Oracle Cloud has the same concept with ADs (Availability Domains). From a Coherence standpoint, this is fine. The risk you run with lower latency (we call that "stretch cluster") is that failures may lead to "split-brain" in which cluster nodes form island if they don't receive heartbeats on time or the TCP ring becomes broken.

But in our experience with ADs, this has not been much of an issue and failures are dealt with quite cleanly. You can't expect "same rack" levels of performance, but overall it is quite decent and I imagine AWS will perform similarly.

javafanboy commented 2 months ago

Thanks - that is in line with our early results!

thegridman commented 2 months ago

We have actually run Coherence clusters across Kubernetes clusters in different regions

https://medium.com/oracledevs/multi-kubernetes-cluster-connectivity-with-oke-and-cilium-for-stateful-workloads-on-oracle-cloud-763da3139843

As with your tests, the cluster was probably not under really heavy load, but it does work.

javafanboy commented 2 months ago

Thanks for the additional input - I was aware people had run Coherence on k8 but was not sure it was in cloud / over multiple AZs or if in-premises / private cloud ....