submariner-io / submariner

Networking component for interconnecting Pods and Services across Kubernetes clusters.
https://submariner.io
Apache License 2.0
2.43k stars 193 forks source link

Gateway connections broken after suspend/resume of the host running the cluster vm #2571

Closed nirs closed 7 months ago

nirs commented 1 year ago

What happened: Running minikue clusters connected via submariner using kvm2 driver (each cluster is a vm). After the host running the minikube vms is suspended and resumed, the submariner gateway connection is broken, showing:

$ subctl show all --context dr1
GATEWAY   CLUSTER   REMOTE IP       NAT   CABLE DRIVER   SUBNETS        STATUS   RTT avg.     
dr2       dr2       192.168.122.2   no    vxlan          242.1.0.0/16   error    805.702µs    

CLUSTER   ENDPOINT IP       PUBLIC IP       CABLE DRIVER   TYPE     
dr1       192.168.122.207   109.186.6.184   vxlan          local    
dr2       192.168.122.2     109.186.6.184   vxlan          remote   

NODE   HA STATUS   SUMMARY                                  
dr1    active      0 connections out of 1 are established   

    Discovered network details via Submariner:
        Network plugin:  kindnet
        Service CIDRs:   [10.96.0.0/12]
        Cluster CIDRs:   [10.244.0.0/16]
        Global CIDR:     242.0.0.0/16

COMPONENT                       REPOSITORY           VERSION   
submariner-gateway              quay.io/submariner   0.15.1    
submariner-routeagent           quay.io/submariner   0.15.1    
submariner-globalnet            quay.io/submariner   0.15.1    
submariner-operator             quay.io/submariner   0.15.1    
submariner-lighthouse-agent     quay.io/submariner   0.15.1    
submariner-lighthouse-coredns   quay.io/submariner   0.15.1
$ subctl diagnose all --context dr1
 ✓ Checking Submariner support for the Kubernetes version
 ✓ Kubernetes version "v1.26.3" is supported

 ✓ Globalnet deployment detected - checking if globalnet CIDRs overlap
 ✓ Clusters do not have overlapping globalnet CIDRs
 ✓ Checking DaemonSet "submariner-gateway"
 ✓ Checking DaemonSet "submariner-routeagent"
 ✓ Checking DaemonSet "submariner-globalnet"
 ✓ Checking DaemonSet "submariner-metrics-proxy"
 ✓ Checking Deployment "submariner-lighthouse-agent"
 ✓ Checking Deployment "submariner-lighthouse-coredns"
 ✓ Checking the status of all Submariner pods
 ✓ Checking if gateway metrics are accessible from non-gateway nodes
 ✓ Skipping this check as it's a single node cluster
 ✓ Checking if globalnet metrics are accessible from non-gateway nodes
 ✓ Skipping this check as it's a single node cluster

 ✓ Checking Submariner support for the CNI network plugin
 ✓ The detected CNI network plugin ("kindnet") is supported
 ✗ Checking gateway connections
 ✗ Connection to cluster "dr2" is not established. Connection details:
{
  "status": "error",
  "statusMessage": "Failed to successfully ping the remote endpoint IP \"242.1.255.254\"",
  "endpoint": {
    "cluster_id": "dr2",
    "cable_name": "submariner-cable-dr2-192-168-122-2",
    "hostname": "dr2",
    "subnets": [
      "242.1.0.0/16"
    ],
    "private_ip": "192.168.122.2",
    "public_ip": "109.186.6.184",
    "nat_enabled": true,
    "backend": "vxlan",
    "backend_config": {
      "natt-discovery-port": "4490",
      "preferred-server": "false",
      "udp-port": "4500"
    }
  },
  "usingIP": "192.168.122.2",
  "latencyRTT": {
    "last": "251.438µs",
    "min": "174.751µs",
    "average": "805.702µs",
    "max": "285.334872ms",
    "stdDev": "10.2885ms"
  }
}
 ✓ Checking Submariner support for the kube-proxy mode 
 ✓ The kube-proxy mode is supported
 ✓ Checking the firewall configuration to determine if intra-cluster VXLAN traffic is allowed
 ✓ Skipping this check as it's a single node cluster
 ✓ Checking Globalnet configuration
 ✓ Globalnet is properly configured and functioning

 ✓ Checking if services have been exported properly
 ✓ All services have been exported properly

Skipping inter-cluster firewall check as it requires two kubeconfigs. Please run "subctl diagnose firewall inter-cluster" command manually.

subctl version: v0.15.1

Same output when running on the other cluster (dr2).

The connection between the cluster is not healing itself.

What you expected to happen: The connection between the clusters should handle error gracefully and heal itself after errors.

How to reproduce it (as minimally and precisely as possible): Start 3 minikube clusters:

minikube start -p hub --driver kvm2 --network default --cni kindnet
minikube start -p dr1 --driver kvm2 --network default --cni kindnet
minikube start -p dr2 --driver kvm2 --network default --cni kindnet

Deploy the broker on the hub:

subctl deploy-broker --context hub --globalnet

Connect clusters dr1 and dr2 to the broker:

subctl join broker-info.subm --context dr1 --clusterid dr1 --cable-driver vxlan
subctl join broker-info.subm --context dr2 --clusterid dr2 --cable-driver vxlan

Wait until all deployments in submariner-operator namespace are rolled out.

Wait until subctl show all returns exit code 0 - all connections are ok.

subctl show all --context hub
subctl show all --context dr1
subctl show all --context dr2

Test connectivity - I deployed nginx on both clusters, exported the service and accessed it from the other cluster, and delete the deployment.

Suspend the host running the vm Wait 35 minutes (waiting 1 minute did not reproduce) Wake up the host

Run subctl show all or subctl diangose all again - showing the errors above.

Anything else we need to know?:

@aswinsuryan suggested to delete the gateway pods:

$ kubectl get pod -n submariner-operator --context dr1
NAME                                             READY   STATUS    RESTARTS   AGE
submariner-gateway-vsv49                         1/1     Running   0          101m
submariner-globalnet-h24wl                       1/1     Running   0          101m
submariner-lighthouse-agent-6d9cb95b8d-rbbjc     1/1     Running   0          101m
submariner-lighthouse-coredns-56b9d48584-9mfvk   1/1     Running   0          101m
submariner-lighthouse-coredns-56b9d48584-xwfgh   1/1     Running   0          101m
submariner-metrics-proxy-c5z88                   2/2     Running   0          101m
submariner-operator-ccc86dcd6-6bsgg              1/1     Running   0          101m
submariner-routeagent-njdhn                      1/1     Running   0          101m

$ kubectl delete pod submariner-gateway-vsv49 -n submariner-operator --context dr1
pod "submariner-gateway-vsv49" deleted

$ kubectl get pod -n submariner-operator --context dr1
NAME                                             READY   STATUS    RESTARTS   AGE
submariner-gateway-fjjjn                         1/1     Running   0          4s
submariner-globalnet-h24wl                       1/1     Running   0          101m
submariner-lighthouse-agent-6d9cb95b8d-rbbjc     1/1     Running   0          101m
submariner-lighthouse-coredns-56b9d48584-9mfvk   1/1     Running   0          101m
submariner-lighthouse-coredns-56b9d48584-xwfgh   1/1     Running   0          101m
submariner-metrics-proxy-c5z88                   2/2     Running   0          101m
submariner-operator-ccc86dcd6-6bsgg              1/1     Running   0          101m
submariner-routeagent-njdhn                      1/1     Running   0          101m

This did not change anything, subctl show all still show an error:

$ subctl show all --context dr1
 ✓ Detecting broker(s)
 ✓ No brokers found

 ✓ Showing Connections
GATEWAY   CLUSTER   REMOTE IP       NAT   CABLE DRIVER   SUBNETS        STATUS   RTT avg.   
dr2       dr2       192.168.122.2   no    vxlan          242.1.0.0/16   error    0s         

 ✓ Showing Endpoints
CLUSTER   ENDPOINT IP       PUBLIC IP       CABLE DRIVER   TYPE     
dr1       192.168.122.207   109.186.6.184   vxlan          local    
dr2       192.168.122.2     109.186.6.184   vxlan          remote   

 ✓ Showing Gateways
NODE   HA STATUS   SUMMARY                                  
dr1    active      0 connections out of 1 are established   

 ✓ Showing Network details
    Discovered network details via Submariner:
        Network plugin:  kindnet
        Service CIDRs:   [10.96.0.0/12]
        Cluster CIDRs:   [10.244.0.0/16]
        Global CIDR:     242.0.0.0/16

 ✓ Showing versions
COMPONENT                       REPOSITORY           VERSION   
submariner-gateway              quay.io/submariner   0.15.1    
submariner-routeagent           quay.io/submariner   0.15.1    
submariner-globalnet            quay.io/submariner   0.15.1    
submariner-operator             quay.io/submariner   0.15.1    
submariner-lighthouse-agent     quay.io/submariner   0.15.1    
submariner-lighthouse-coredns   quay.io/submariner   0.15.1    
$ subctl show all --context dr2
 ✓ Detecting broker(s)
 ✓ No brokers found

 ✓ Showing Connections
GATEWAY   CLUSTER   REMOTE IP         NAT   CABLE DRIVER   SUBNETS        STATUS   RTT avg.      
dr1       dr1       192.168.122.207   no    vxlan          242.0.0.0/16   error    16.061827ms   

 ✓ Showing Endpoints
CLUSTER   ENDPOINT IP       PUBLIC IP       CABLE DRIVER   TYPE     
dr2       192.168.122.2     109.186.6.184   vxlan          local    
dr1       192.168.122.207   109.186.6.184   vxlan          remote   

 ✓ Showing Gateways
NODE   HA STATUS   SUMMARY                                  
dr2    active      0 connections out of 1 are established   

 ✓ Showing Network details
    Discovered network details via Submariner:
        Network plugin:  kindnet
        Service CIDRs:   [10.96.0.0/12]
        Cluster CIDRs:   [10.244.0.0/16]
        Global CIDR:     242.1.0.0/16

 ✓ Showing versions
COMPONENT                       REPOSITORY           VERSION   
submariner-gateway              quay.io/submariner   0.15.1    
submariner-routeagent           quay.io/submariner   0.15.1    
submariner-globalnet            quay.io/submariner   0.15.1    
submariner-operator             quay.io/submariner   0.15.1    
submariner-lighthouse-agent     quay.io/submariner   0.15.1    
submariner-lighthouse-coredns   quay.io/submariner   0.15.1    

Delete the gateway pod on the other cluster:

$ kubectl get pod -n submariner-operator --context dr2
NAME                                            READY   STATUS    RESTARTS   AGE
submariner-gateway-9lccs                        1/1     Running   0          103m
submariner-globalnet-47lcc                      1/1     Running   0          103m
submariner-lighthouse-agent-7666b5749f-w4qff    1/1     Running   0          103m
submariner-lighthouse-coredns-7fbf847d9-9b522   1/1     Running   0          103m
submariner-lighthouse-coredns-7fbf847d9-sgz46   1/1     Running   0          103m
submariner-metrics-proxy-wk8k7                  2/2     Running   0          103m
submariner-operator-ccc86dcd6-5fpvd             1/1     Running   0          103m
submariner-routeagent-fbc7r                     1/1     Running   0          103m

$ kubectl delete pod submariner-gateway-9lccs -n submariner-operator --context dr2
pod "submariner-gateway-9lccs" deleted

$ kubectl get pod -n submariner-operator --context dr2
NAME                                            READY   STATUS    RESTARTS   AGE
submariner-gateway-9khzt                        1/1     Running   0          7s
submariner-globalnet-47lcc                      1/1     Running   0          103m
submariner-lighthouse-agent-7666b5749f-w4qff    1/1     Running   0          103m
submariner-lighthouse-coredns-7fbf847d9-9b522   1/1     Running   0          103m
submariner-lighthouse-coredns-7fbf847d9-sgz46   1/1     Running   0          103m
submariner-metrics-proxy-wk8k7                  2/2     Running   0          103m
submariner-operator-ccc86dcd6-5fpvd             1/1     Running   0          103m
submariner-routeagent-fbc7r                     1/1     Running   0          103m

Now subctl show that the hosts are connected again:

$ subctl show all --context dr2
 ✓ Detecting broker(s)
 ✓ No brokers found

 ✓ Showing Connections
GATEWAY   CLUSTER   REMOTE IP         NAT   CABLE DRIVER   SUBNETS        STATUS      RTT avg.     
dr1       dr1       192.168.122.207   no    vxlan          242.0.0.0/16   connected   433.456µs    

 ✓ Showing Endpoints
CLUSTER   ENDPOINT IP       PUBLIC IP       CABLE DRIVER   TYPE     
dr2       192.168.122.2     109.186.6.184   vxlan          local    
dr1       192.168.122.207   109.186.6.184   vxlan          remote   

 ✓ Showing Gateways
NODE   HA STATUS   SUMMARY                               
dr2    active      All connections (1) are established   

 ✓ Showing Network details
    Discovered network details via Submariner:
        Network plugin:  kindnet
        Service CIDRs:   [10.96.0.0/12]
        Cluster CIDRs:   [10.244.0.0/16]
        Global CIDR:     242.1.0.0/16

 ✓ Showing versions
COMPONENT                       REPOSITORY           VERSION   
submariner-gateway              quay.io/submariner   0.15.1    
submariner-routeagent           quay.io/submariner   0.15.1    
submariner-globalnet            quay.io/submariner   0.15.1    
submariner-operator             quay.io/submariner   0.15.1    
submariner-lighthouse-agent     quay.io/submariner   0.15.1    
submariner-lighthouse-coredns   quay.io/submariner   0.15.1    

$ subctl show all --context dr1
 ✓ Detecting broker(s)
 ✓ No brokers found

 ✓ Showing Connections
GATEWAY   CLUSTER   REMOTE IP       NAT   CABLE DRIVER   SUBNETS        STATUS      RTT avg.     
dr2       dr2       192.168.122.2   no    vxlan          242.1.0.0/16   connected   425.475µs    

 ✓ Showing Endpoints
CLUSTER   ENDPOINT IP       PUBLIC IP       CABLE DRIVER   TYPE     
dr1       192.168.122.207   109.186.6.184   vxlan          local    
dr2       192.168.122.2     109.186.6.184   vxlan          remote   

 ✓ Showing Gateways
NODE   HA STATUS   SUMMARY                               
dr1    active      All connections (1) are established   

 ✓ Showing Network details
    Discovered network details via Submariner:
        Network plugin:  kindnet
        Service CIDRs:   [10.96.0.0/12]
        Cluster CIDRs:   [10.244.0.0/16]
        Global CIDR:     242.0.0.0/16

 ✓ Showing versions
COMPONENT                       REPOSITORY           VERSION   
submariner-gateway              quay.io/submariner   0.15.1    
submariner-routeagent           quay.io/submariner   0.15.1    
submariner-globalnet            quay.io/submariner   0.15.1    
submariner-operator             quay.io/submariner   0.15.1    
submariner-lighthouse-agent     quay.io/submariner   0.15.1    
submariner-lighthouse-coredns   quay.io/submariner   0.15.1    

And the connectivity tests is working again.

Environment:

yboaron commented 1 year ago

@nirs , so, after restarting gateway pods on both clusters connection recovered , right ?

nirs commented 1 year ago

@nirs , so, after restarting gateway pods on both clusters connection recovered , right ?

Right

yboaron commented 1 year ago

I think I understand the root cause, seems that ip rule and ip route tables 100,150 don't exist on cluster dr2.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.