liqotech / liqo

Enable dynamic and seamless Kubernetes multi-cluster topologies
https://liqo.io
Apache License 2.0
1.11k stars 103 forks source link

Fail to start a distributed DB using unidirectional out-of-band peering #2386

Open rmedina97 opened 6 months ago

rmedina97 commented 6 months ago

What happened:

I attempted to deploy the Liqo example of a stateful application in my hierarchical architecture, comprising one consumer and two provider clusters. However, only the first POD, db-mariadb-galera-0, successfully starts. The second POD fails to connect with the first and enters a CrashLoopBackOff state. Both PODs were scheduled in the provider clusters.

What you expected to happen:

I expected the PODs to be able to communicate with each other.

How to reproduce it (as minimally and precisely as possible):

Create 3 clusters using k3s (with different POD and service CIDR), peer them with Liqo as 1 consumer and 2 providers, and install the example Helm chart.

Anything else we need to know?:

I found a working solution: the entire DB is able to start only if there is a working POD in the consumer cluster. Otherwise, only the first POD starts. Additionally, bidirectional peering between every cluster resolves the issue, but my preference is to adhere to the hierarchical structure. I first noticed this problem using the Percona XtraDB operator (another distributed DB application)with three PODs. In the event that the POD in the consumer cluster is deleted and scheduled to another provider cluster, this POD will again be in CrashLoopBackOff, but the other running PODs will continue to work as normal

Environment:

aleoli commented 6 months ago

Hi @RiccardoStud! For better reproducibility, how do you install the MariaDB-galera cluster? Do you use an operator or chart? If yes, please indicate which

rmedina97 commented 6 months ago

Hi @aleoli! I used the Helm chart from the Liqo guide, running the command: helm install db bitnami/mariadb-galera -n liqo-demo -f manifests/values.yaml. I only changed the namespace name to match mine. For additional context, when I develop the chart using only two of my clusters (one provider and one consumer), it functions normally.

fra98 commented 6 months ago

Hi @rmedina97. I reproduced your deployment and can confirm it is not working with this specific topology. This is because in Liqo by design pods on different leaf clusters can't communicate directly with original IPs, but they are remapped on the external CIDR of the originating cluster. The deployment could still work in some cases:

Please note that a new redesigned network will be merged soon and we will test again distributed DB scenarios.

rmedina97 commented 6 months ago

Thanks for the comprehensive answer, I will adopt one of the suggested solutions for now