Open jonathon2nd opened 2 years ago
Hi @jonathon2nd thank you for trying Liqo!
Based on tunnel-operator logs:
E0204 21:56:50.468163 1 tunnel-operator.go:467] an error occurred while creating iptables handler: cannot create Liqo default chains: cannot retrieve chains in table -> nat : running [/sbin/iptables -t nat -S --wait]: exit status 3: modprobe: can't change directory to /lib/modules: No such file or directory iptables v1.8.7 (legacy): can't initialize iptables table nat: Table does not exist (do you need to insmod?) Perhaps iptables or your kernel needs to be upgraded.
it seems that the iptables module is not loaded on the host where the Liqo-Gateway is running. I would suggest you to load the iptables module on the host and check if the problem is resolved.
After manually running modprobe -- ip_tables
on a host to test the pod runs. But then controler manager does not seem to start running right. Was wondering what you thought of that @alacuku
Thank you!
Probably some configuration generated by the k3s provider are wrong. I would suggest you to uninstall Liqo and then install it again.
First generate the values.yaml
file for the helm chart using liqoctl
:
liqoctl install k3s --generate-name --only-output-values
Then you need to change in the values.yaml
file the values for podCIDR, serviceCIDR, and apiServerURL. Here you have the full values of the helm chart: https://doc.liqo.io/installation/chart_values/.
After that you have change the file you can install Liqo with:
helm install -n liqo --create-namespace liqo liqo/liqo -f values.yaml --dependency-update
@jonathon2nd keep an eye on this issue #1094
Thanks @alacuku ! I will keep an eye on the issue.
So looking at the values, both of these look fine.
podCIDR: 10.42.0.0/16
serviceCIDR: 10.43.0.0/16
The other I am not sure about
apiServer:
address: https://rancher.example.com/k8s/clusters/c-ds39s
This directs to the cluster I am deploying to, but I am not sure if Liqo is happy with going through Rancher. I will play around with modifying this.
EDIT:
In addition, the Controller-Manager finally fails with this error.
W0207 21:34:23.095086 1 local-resource-monitor.go:259] No notifier is configured, an update will be lostMon, Feb 7 2022 2:37:54 pmE0207 21:37:54.813517 1 main.go:384] unable to start the liqo storage provisioner: Post "[https://10.43.0.1:443/api/v1/namespaces](https://10.43.0.1/api/v1/namespaces)": context canceled
It looks like it is able to derive the api url for the cluster https://10.43.0.1:443/api/v1/namespaces
, so I am not sure why it fails, as this is a valid url.
curling the url from a test box shows that it is up
root@test-7488587f55-k2fk2:/# curl https://10.43.0.1:443/api/v1/namespaces
curl: (60) SSL certificate problem: unable to get local issuer certificate
More details here: https://curl.haxx.se/docs/sslcerts.html
curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.
root@test-7488587f55-k2fk2:/# curl https://10.43.0.1:443/api/v1/namespaces -k
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {
},
"status": "Failure",
"message": "Unauthorized",
"reason": "Unauthorized",
"code": 401
Another thing I have noticed, is that Calico on the workers start to freak out when Liqo is installed, and we are not in BGP mode but we are using IPVS mode.
Tue, Feb 8 2022 11:41:24 am | bird: KIF: Invalid interface address 240.1.0.123 for liqo.vxlan
Tue, Feb 8 2022 11:41:26 am | bird: KIF: Invalid interface address 240.1.0.123 for liqo.vxlan
Adding the configuration as specified by https://doc.liqo.io/installation/advanced/ does not change the error or result
Also, I have tested with changing the address in values.yaml with one of the master ip's, no change in error on the Controller-Manager.
EDIT: Welp, adding the change for BGP results in the controller-manager working. I saw the spamming of the first set of lines and assumed.
W0208 18:47:55.871650 1 local-resource-monitor.go:259] No notifier is configured, an update will be lost
W0208 18:47:55.871730 1 local-resource-monitor.go:259] No notifier is configured, an update will be lost
W0208 18:47:55.871779 1 local-resource-monitor.go:259] No notifier is configured, an update will be lost
W0208 18:47:55.871818 1 local-resource-monitor.go:259] No notifier is configured, an update will be lost
W0208 18:47:55.871837 1 local-resource-monitor.go:259] No notifier is configured, an update will be lost
W0208 18:47:55.871859 1 local-resource-monitor.go:259] No notifier is configured, an update will be lost
I0208 18:49:20.273580 1 main.go:401] starting manager as controller manager
I0208 18:49:20.274344 1 controller.go:810] Starting provisioner controller liqo.io/storage_liqo-controller-manager-764df8dbf9-dt6x5_b1886094-1426-43eb-91bb-a9caa04611cf!
I0208 18:49:20.375105 1 controller.go:859] Started provisioner controller liqo.io/storage_liqo-controller-manager-764df8dbf9-dt6x5_b1886094-1426-43eb-91bb-a9caa04611cf!
For some reason, calico is still complaining
bird: KIF: Invalid interface address 240.1.0.123 for liqo.vxlan
bird: KIF: Invalid interface address 240.1.0.123 for liqo.vxlan
So, to get a working install for the record:
- modprobe -- ip_tables
as for some reason the automation with terraform for that fails, we have not looked into that yet. So I am still confused. I have tried the configureaton change with calico before when installing with liqocli and it did not work. The helm install must be different in some way with generating the values first and then installing with them unmodified. Will do some more testing and clarify more hopefully.
Thanks again for the help. It is very much appreciated.
I have two clusters now, and I am attempting to get cluster1 to peer to cluster2, as described in https://doc.liqo.io/usage/peering/
I run generate-add-command
on cluster2, and then run that on cluster1, and things seem to be happy
$ liqoctl add cluster cluster2 --auth-url <REDACTED> --id baa2587f-b46b-4450-9b65-92cef9168680 --token E
I0208 16:56:52.166325 492840 handler.go:53] * Initializing π...
I0208 16:56:52.863653 492840 handler.go:64] * Processing Cluster Addition π§...
Hooray π! You have correctly added the cluster cluster2 and activated an outgoing peering towards it.
You can now:
* Check the status of the peering to see when it is completely established π.
Every field of the foreigncluster (but IncomingPeering) should be in "Established":
kubectl get foreignclusters cluster2
* Check if the virtual node is correctly created (this should take less than ~30s) π¦:
kubectl get nodes liqo-baa2587f-b46b-4450-9b65-92cef9168680
* Ready to go! Let's deploy a simple cross-cluster application using Liqo π:
kubectl create ns liqo-demo # Let's create a demo namespace
kubectl label ns liqo-demo liqo.io/enabled=true # Enable Liqo offloading on this namespace (Check out https://doc.liqo.io/usage for more details).
kubectl apply -n liqo-demo -f https://get.liqo.io/app.yaml # Deploy a sample application in the namespace to trigger the offloading.
* For more information about Liqo have a look to: https://doc.liqo.io
However, when I view ForeignCluster
on cluster1, things seem to be hung up.
apiVersion: discovery.liqo.io/v1alpha1
kind: ForeignCluster
metadata:
creationTimestamp: "2022-02-08T23:56:53Z"
finalizers:
- crdReplicator.liqo.io
generation: 2
labels:
discovery.liqo.io/cluster-id: baa2587f-b46b-4450-9b65-92cef9168680
managedFields:
- apiVersion: discovery.liqo.io/v1alpha1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:finalizers:
.: {}
v:"crdReplicator.liqo.io": {}
manager: crd-replicator
operation: Update
time: "2022-02-08T23:56:53Z"
- apiVersion: discovery.liqo.io/v1alpha1
fieldsType: FieldsV1
fieldsV1:
f:spec:
f:clusterIdentity:
f:clusterID: {}
f:clusterName: {}
f:status:
.: {}
f:peeringConditions: {}
f:tenantNamespace:
.: {}
f:local: {}
f:remote: {}
manager: liqo-controller-manager
operation: Update
time: "2022-02-08T23:56:53Z"
- apiVersion: discovery.liqo.io/v1alpha1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:labels:
.: {}
f:discovery.liqo.io/cluster-id: {}
f:spec:
.: {}
f:clusterIdentity: {}
f:foreignAuthUrl: {}
f:incomingPeeringEnabled: {}
f:insecureSkipTLSVerify: {}
f:outgoingPeeringEnabled: {}
manager: liqoctl
operation: Update
time: "2022-02-08T23:56:53Z"
name: cluster2
resourceVersion: "111296"
uid: be1c92f8-79a4-46b9-b7b4-e8fed309d08c
spec:
clusterIdentity:
clusterID: baa2587f-b46b-4450-9b65-92cef9168680
clusterName: cluster2
foreignAuthUrl: <REDACTED>
incomingPeeringEnabled: Auto
insecureSkipTLSVerify: true
outgoingPeeringEnabled: "Yes"
status:
peeringConditions:
- lastTransitionTime: "2022-02-08T23:56:53Z"
message: This ForeignCluster seems to be processable
reason: ForeignClusterProcesssable
status: Success
type: ProcessForeignClusterStatus
- lastTransitionTime: "2022-02-08T23:56:53Z"
message: The Identity has been correctly accepted by the remote cluster
reason: IdentityAccepted
status: Established
type: AuthenticationStatus
- lastTransitionTime: "2022-02-08T23:56:53Z"
message: The remote cluster has not created a ResourceOffer in the Tenant Namespace
liqo-tenant-baa2587f-b46b-4450-9b65-92cef9168680 yet
reason: ResourceRequestPending
status: Pending
type: OutgoingPeering
- lastTransitionTime: "2022-02-08T23:56:53Z"
message: The NetworkConfig has not been found in the Tenant Namespace liqo-tenant-baa2587f-b46b-4450-9b65-92cef9168680
reason: NetworkConfigNotFound
status: None
type: NetworkStatus
- lastTransitionTime: "2022-02-08T23:56:53Z"
message: No ResourceRequest found in the Tenant Namespace liqo-tenant-baa2587f-b46b-4450-9b65-92cef9168680
reason: NoResourceRequest
status: None
type: IncomingPeering
tenantNamespace:
local: liqo-tenant-baa2587f-b46b-4450-9b65-92cef9168680
remote: liqo-tenant-001b1127-104f-4264-b43d-ece5da1bc014
I have not seen any pods with errors, so having a hard time figuring out where the hang up is. Does the above look familiar @alacuku ?
Could you check the logs of liqo-crd-replicator
pod? It seems that the resources are not replicated between the two clusters.
ah I missed it.
hmmm, strange. The url does load. Seem to be having some networking troubles, looking into it.
Trace[1235590107]: [30.001642952s] [30.001642952s] END
E0209 15:15:43.540711 1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.22.4/tools/cache/reflector.go:167: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: Get "https://rancher.example.com/k8s/clusters/c-ff8vz/apis/discovery.liqo.io/v1alpha1/namespaces/liqo-tenant-001b1127-104f-4264-b43d-ece5da1bc014/resourcerequests?labelSelector=liqo.io%2ForiginID%3D001b1127-104f-4264-b43d-ece5da1bc014%2Cliqo.io%2Freplicated%3Dtrue&limit=500&resourceVersion=0": dial tcp: i/o timeout
I0209 15:16:59.225195 1 trace.go:205] Trace[430396588]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.22.4/tools/cache/reflector.go:167 (09-Feb-2022 15:16:29.223) (total time: 30001ms):
Trace[430396588]: [30.001265665s] [30.001265665s] END
E0209 15:16:59.225239 1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.22.4/tools/cache/reflector.go:167: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: Get "https://rancher.example.com/k8s/clusters/c-ff8vz/apis/net.liqo.io/v1alpha1/namespaces/liqo-tenant-001b1127-104f-4264-b43d-ece5da1bc014/networkconfigs?labelSelector=liqo.io%2ForiginID%3D001b1127-104f-4264-b43d-ece5da1bc014%2Cliqo.io%2Freplicated%3Dtrue&limit=500&resourceVersion=0": dial tcp: i/o timeout
I0209 15:17:06.813907 1 trace.go:205] Trace[150682120]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.22.4/tools/cache/reflector.go:167 (09-Feb-2022 15:16:36.812) (total time: 30001ms):
Trace[150682120]: [30.001282293s] [30.001282293s] END
E0209 15:17:06.813952 1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.22.4/tools/cache/reflector.go:167: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: Get "https://rancher.example.com/k8s/clusters/c-ff8vz/apis/discovery.liqo.io/v1alpha1/namespaces/liqo-tenant-001b1127-104f-4264-b43d-ece5da1bc014/resourcerequests?labelSelector=liqo.io%2ForiginID%3D001b1127-104f-4264-b43d-ece5da1bc014%2Cliqo.io%2Freplicated%3Dtrue&limit=500&resourceVersion=0": dial tcp: i/o timeout
...
E0209 16:10:32.050980 1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.22.4/tools/cache/reflector.go:167: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: Get "https://rancher.example.com/k8s/clusters/c-ff8vz/apis/net.liqo.io/v1alpha1/namespaces/liqo-tenant-001b1127-104f-4264-b43d-ece5da1bc014/networkconfigs?labelSelector=liqo.io%2ForiginID%3D001b1127-104f-4264-b43d-ece5da1bc014%2Cliqo.io%2Freplicated%3Dtrue&limit=500&resourceVersion=0": dial tcp: i/o timeout
I0209 16:10:46.553130 1 trace.go:205] Trace[171691195]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.22.4/tools/cache/reflector.go:167 (09-Feb-2022 16:10:16.551) (total time: 30001ms):
Trace[171691195]: [30.001121086s] [30.001121086s] END
E0209 16:10:46.553177 1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.22.4/tools/cache/reflector.go:167: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: Get "https://rancher.example.com/k8s/clusters/c-ff8vz/apis/discovery.liqo.io/v1alpha1/namespaces/liqo-tenant-001b1127-104f-4264-b43d-ece5da1bc014/resourcerequests?labelSelector=liqo.io%2ForiginID%3D001b1127-104f-4264-b43d-ece5da1bc014%2Cliqo.io%2Freplicated%3Dtrue&limit=500&resourceVersion=0": dial tcp: i/o timeout
I0209 16:11:48.288526 1 trace.go:205] Trace[1206982569]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.22.4/tools/cache/reflector.go:167 (09-Feb-2022 16:11:18.287) (total time: 30001ms):
Trace[1206982569]: [30.001021664s] [30.001021664s] END
E0209 16:11:48.288583 1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.22.4/tools/cache/reflector.go:167: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: Get "https://rancher.example.com/k8s/clusters/c-ff8vz/apis/net.liqo.io/v1alpha1/namespaces/liqo-tenant-001b1127-104f-4264-b43d-ece5da1bc014/networkconfigs?labelSelector=liqo.io%2ForiginID%3D001b1127-104f-4264-b43d-ece5da1bc014%2Cliqo.io%2Freplicated%3Dtrue&limit=500&resourceVersion=0": dial tcp: i/o timeout
I0209 16:12:08.293354 1 trace.go:205] Trace[742106364]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.22.4/tools/cache/reflector.go:167 (09-Feb-2022 16:11:41.778) (total time: 26514ms):
Trace[742106364]: [26.51435703s] [26.51435703s] END
E0209 16:12:08.293402 1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.22.4/tools/cache/reflector.go:167: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: Get "https://rancher.example.com/k8s/clusters/c-ff8vz/apis/discovery.liqo.io/v1alpha1/namespaces/liqo-tenant-001b1127-104f-4264-b43d-ece5da1bc014/resourcerequests?labelSelector=liqo.io%2ForiginID%3D001b1127-104f-4264-b43d-ece5da1bc014%2Cliqo.io%2Freplicated%3Dtrue&limit=500&resourceVersion=0": dial tcp: lookup rancher.example.com on 10.43.0.10:53: read udp 10.42.239.221:34725->10.43.0.10:53: i/o timeout
Hi @jonathon2nd.
Do you confirm that https://rancher.example.com
points to the api server of the remote cluster?
The following log line:
E0209 16:12:08.293402 1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.22.4/tools/cache/reflector.go:167: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: Get "https://rancher.example.com/k8s/clusters/c-ff8vz/apis/discovery.liqo.io/v1alpha1/namespaces/liqo-tenant-001b1127-104f-4264-b43d-ece5da1bc014/resourcerequests?labelSelector=liqo.io%2ForiginID%3D001b1127-104f-4264-b43d-ece5da1bc014%2Cliqo.io%2Freplicated%3Dtrue&limit=500&resourceVersion=0": dial tcp: lookup rancher.example.com on 10.43.0.10:53: read udp 10.42.239.221:34725->10.43.0.10:53: i/o timeout
suggests that something is wrong with the dns resolution of the remote api server.
I would suggest to continue on the slack channel and come back here and update the issue after we have a solution. What do you think?
That sounds good to me @alacuku. :smile:
Current state. Wanted to make a consolidated post to document where I am at with the assistance of liqo devs :clap:
During testing we upgraded to Rancher 2.6.3. I have deployed two rke ks8 1.21.9 clusters with calico.
These nodes are created with terraform, in the play a few things happen. - modprobe -- ip_tables
, firewalld is disabled and stopped.
Add to the cluster yaml at creation
kube-controller:
extra_args:
cluster-signing-cert-file: "/etc/kubernetes/ssl/kube-ca.pem"
cluster-signing-key-file: "/etc/kubernetes/ssl/kube-ca-key.pem"
Make change to calico as Liqo requires: https://doc.liqo.io/installation/advanced/
- name: IP_AUTODETECTION_METHOD
value: skip-interface=liqo.*
Restart all nodes.
I then get the kubeconfigs for each clsuter and remove the rancher management cluster entry from them. I do not want rancher cluster downtime to effect liwo of two downstream clusters.
Then generate the values for install with liqoctl install k3s -n cluster1 --only-output-values
, and install liqo with helm helm install -n liqo --create-namespace liqo liqo/liqo -f values.yaml --dependency-update
At this point I install liqo on both clusters, generate an enrollment command from cluster2 and run on cluster1. Everything is installed and setup properly. I then run through the demo, https://doc.liqo.io/gettingstarted/helloworld/test/. But when I ping the remote pod from a test pod on cluster1 fails
root@test-59d74889d7-g7lhq:/# curl ${REMOTE_POD_IP}
curl: (28) Failed to connect to 10.41.114.5 port 80: Connection timed out
Ping to local works, but fails to remote
root@test-59d74889d7-g7lhq:/# ping ${REMOTE_POD_IP}
PING 10.41.114.5 (10.41.114.5): 56 data bytes
^C--- 10.41.114.5 ping statistics ---
7 packets transmitted, 0 packets received, 100% packet loss
root@test-59d74889d7-g7lhq:/# ping ${LOCAL_POD_IP}
PING 10.42.239.204 (10.42.239.204): 56 data bytes
64 bytes from 10.42.239.204: icmp_seq=0 ttl=63 time=0.136 ms
64 bytes from 10.42.239.204: icmp_seq=1 ttl=63 time=0.148 ms
^C--- 10.42.239.204 ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.136/0.142/0.148/0.000 ms
root@test-59d74889d7-g7lhq:/#
Upon looking at iptables, there is no forward rule for liqo and the default is drop.
Chain FORWARD (policy DROP 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination
1442K 450M cali-FORWARD all -- any any anywhere anywhere /* cali:wUHhoiAYhphO9Mso */
248 20296 KUBE-FORWARD all -- any any anywhere anywhere /* kubernetes forwarding rules */
248 20296 KUBE-SERVICES all -- any any anywhere anywhere ctstate NEW /* kubernetes service portals */
248 20296 KUBE-EXTERNAL-SERVICES all -- any any anywhere anywhere ctstate NEW /* kubernetes externally-visible service portals */
248 20296 DOCKER-USER all -- any any anywhere anywhere
248 20296 DOCKER-ISOLATION-STAGE-1 all -- any any anywhere anywhere
0 0 ACCEPT all -- any docker0 anywhere anywhere ctstate RELATED,ESTABLISHED
0 0 DOCKER all -- any docker0 anywhere anywhere
0 0 ACCEPT all -- docker0 !docker0 anywhere anywhere
0 0 ACCEPT all -- docker0 docker0 anywhere anywhere
219 18292 ACCEPT all -- any any anywhere anywhere /* cali:S93hcgKJrXEqnTfs */ /* Policy explicitly accepted packet. */ mark match 0x10000/0x10000
24 1704 MARK all -- any any anywhere anywhere /* cali:mp77cMpurHhyjLrM */ MARK or 0x10000
We are confused as to how this could be blocking liqo traffic with a ACCEPT all -- any any anywhere anywhere
rule in there.
So to test, I ran iptables -A FORWARD -i liqo.vxlan -j ACCEPT
on all hosts. After this the demo worked
root@test-59d74889d7-g7lhq:/# traceroute ${REMOTE_POD_IP}
traceroute to 10.41.114.5 (10.41.114.5), 30 hops max, 60 byte packets
1 10.1.0.125 (10.1.0.125) 1.243 ms 1.150 ms 1.124 ms
2 240.1.0.124 (240.1.0.124) 1.862 ms 1.789 ms 1.710 ms
3 169.254.100.1 (169.254.100.1) 1181.756 ms 1181.675 ms 1181.631 ms
4 10-41-114-5.liqo-demo.liqo-demo.svc.cluster.local (10.41.114.5) 1181.544 ms 1181.491 ms 1181.396 ms
5 10-41-114-5.liqo-demo.liqo-demo.svc.cluster.local (10.41.114.5) 1251.120 ms 1251.044 ms 1250.968 ms
6 10-41-114-5.liqo-demo.liqo-demo.svc.cluster.local (10.41.114.5) 1250.884 ms 1241.800 ms 1241.548 ms
root@test-59d74889d7-g7lhq:/# ping ${REMOTE_POD_IP}
PING 10.41.114.5 (10.41.114.5): 56 data bytes
64 bytes from 10.41.114.5: icmp_seq=0 ttl=59 time=3.099 ms
64 bytes from 10.41.114.5: icmp_seq=1 ttl=59 time=2.464 ms
^C--- 10.41.114.5 ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max/stddev = 2.464/2.782/3.099/0.318 ms
root@test-59d74889d7-g7lhq:/# curl ${REMOTE_POD_IP}
<!DOCTYPE html>
<html>
<head>
....
</body>
</html>
root@test-59d74889d7-g7lhq:/#
However, when we deploy a more complex app, such as postgres-ha, networking does not work. Postgres pods on cluster1 can not reach pods on cluster2.
Going to test out a couple changes, will update when done.
Alright, we got a working setup.
Like I mentioned before I tested out using changes from submariner for liqo, this was when I was trying to get anything working. Change clsuter2 CIDR's: https://submariner.io/getting-started/quickstart/managed-kubernetes/rancher/ Add IP pools: https://submariner.io/operations/deployment/calico/
Run another iptables command iptables -A FORWARD -i liqo.host -j ACCEPT
So for cluster2
services:
kube-api:
service_cluster_ip_range: 10.45.0.0/16
kube-controller:
cluster_cidr: 10.44.0.0/16
service_cluster_ip_range: 10.45.0.0/16
kubelet:
cluster_domain: cluster.local
cluster_dns_server: 10.45.0.10
And ran these yamls on the appropriate cluster
Run on cluster2
---
apiVersion: crd.projectcalico.org/v1
kind: IPPool
metadata:
name: svccluster1
spec:
cidr: 100.43.0.0/16
natOutgoing: false
disabled: true
---
apiVersion: crd.projectcalico.org/v1
kind: IPPool
metadata:
name: podcluster1
spec:
cidr: 10.42.0.0/16
natOutgoing: false
disabled: true
Run on cluster1
---
apiVersion: crd.projectcalico.org/v1
kind: IPPool
metadata:
name: svccluster2
spec:
cidr: 100.45.0.0/16
natOutgoing: false
disabled: true
---
apiVersion: crd.projectcalico.org/v1
kind: IPPool
metadata:
name: podcluster2
spec:
cidr: 10.44.0.0/16
natOutgoing: false
disabled: true
After that and setting up two fresh clusters, everything works (mostly). Not only does the demo fully work, but regular apps also work! Redis: Redis master fails over across the clusters without issue. Also notice that the headless service contains a list of ips for the pods, with are identical between the two clusters. This is required (I suspect) for FA/HA to work properly for redis
postgres-ha: This also deployed correctly. I wrote data into a table and then killed master, it failed over to another node on the other cluster, and data was replicated.
Only issue left that we can see at the moment, nodeport are not quite working right. The nodeports are correct on cluster1, and they show up on cluster2, but seem to flip between what I assign them and random ones. However I also see that the ip's are flipping as well between cluster1 and cluster2.
Another update, the cidr change for cluster2 is not necessary. During testing I redid clusters with terraform and I added the two iptable rules to conf.
- modprobe -- ip_tables
- iptables -A FORWARD -i liqo.host -j ACCEPT
- iptables -A FORWARD -i liqo.vxlan -j ACCEPT
Even though at vm creation those interfaces do not exist, it does not error and when the interfaces are added they work.
So, I redid the clusters and left the cidr's alone, and I have not encountered any other issues. The port is still flipping on cluster2, but my redis sentinel deployment is working and FA/HA is working.
There is also the minor issue of calico
and cattle-node-agent
workloads also attempting to deploy onto the liqo-* node. This is not great because this blocked redeploy of those workloads as long as that pod is stuck. For now I have applied the following to those workloads to get that to stop, but this is temp at best because a rancher upgrade on these clusters will remove this change.
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
...
- key: liqo.io/type
operator: NotIn
values:
- virtual-node
Otherwise it looks like this
Also, regarding 'Liqo does not support multi-level affinities/node-selection': https://liqo-io.slack.com/archives/C014Z7F25KP/p1642730194018100
This was a bit of a block regarding db's, as we want to have pods spread out to as many hosts without pod crowding. To solve this for our test redis, I added the following to the redis values.
For worker nodes on cluster1 I added zone:a
, then for the liqo-* node AND all workers on cluster2 I added zone:b
.
The topologyKey: zone
ensures that a close to equal amount of pods is deployed on both the main and foreign clusters.
The topologyKey: kubernetes.io/hostname
with ScheduleAnyway
is the special sauce that allows for pods to pile up on what cluster1 sees as a single node, liqo-*
. And then when the pods land on cluster2, they are scheduled as desired.
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app.kubernetes.io/name: redis
- maxSkew: 1
topologyKey: zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app.kubernetes.io/name: redis
Is your feature request related to a problem? Please describe. I would like to be able to deploy Liqo to a Rancher managed, k8s RKE cluster.
Describe the solution you'd like I would like RKE to be a supported option when installing Liqo, similar to k3s https://doc.liqo.io/installation/?provider=K3s
Describe alternatives you've considered The only install option that functions at all when attempting to install is k3s. When attempting kind or kubeadm, nothing happens.
Additional context First two options do not do anything
k3s does install, but the gateway fails to start
When the above is happening on a cluster with Calico, calico gives the following errors. Providing in case they are insightful. The errors occur when applying https://doc.liqo.io/installation/advanced/#calico or not on a fresh cluster.
I have tested with and without IPVS (which we use), I have tried flannel or calico (which we use). I would also like to mention we use Rocky Linux, I am unsure if that plays a part in the issues I am encountering.
Thank you for developing Liqo, it looks very exciting and something we would very much love to use for building out our multi cluster infra more.