submariner-io / submariner

Networking component for interconnecting Pods and Services across Kubernetes clusters.
https://submariner.io
Apache License 2.0
2.41k stars 190 forks source link

Submariner looks like it is in working condition, but ping from one pod is not going to other site pod #3169

Open BhavaniYalamanchili opened 2 weeks ago

BhavaniYalamanchili commented 2 weeks ago

ISSUE DESCRIPTION

The submariner looks fine with the show-all and diagnose all commands. However, when we tried pinging a pod of the remote site from the local site pod, it was not pinging, we got 100% packet loss

Setup:

Site 1: OCP upgrade from 4.13 to 4.14 is in progress Site 2: OCP 4.12

Submariner version: v0.16.3

Outputs collected:

show all and diagnose all command outputs for Site 1

sh-4.4$ /root/.local/bin/subctl show all --kubeconfig /tmp/local-kubeconfig
Cluster "local-config"
✓ Detecting broker(s)
NAMESPACE               NAME                COMPONENTS                        GLOBALNET   GLOBALNET CIDR   DEFAULT GLOBALNET SIZE   DEFAULT DOMAINS   
submariner-k8s-broker   submariner-broker   service-discovery, connectivity   no          242.0.0.0/8      65536                                      
 
✓ Showing Connections
GATEWAY                          CLUSTER   REMOTE IP      NAT   CABLE DRIVER   SUBNETS                        STATUS      RTT avg.     
control-1-ru2.ydc-xxxxx  site2     10.56.232.29   no    libreswan      172.31.0.0/16, 10.132.0.0/14   connected   737.086µs    
 
✓ Showing Endpoints
CLUSTER   ENDPOINT IP    PUBLIC IP      CABLE DRIVER   TYPE     
site1     10.56.226.29   10.56.226.29   libreswan      local    
site1     10.56.226.30   10.56.226.30   libreswan      local    
site1     10.56.226.31   10.56.226.31   libreswan      local    
site2     10.56.232.29   10.56.232.29   libreswan      remote   
 
✓ Showing Gateways
NODE                             HA STATUS   SUMMARY                               
control-1-ru2.odc-xxxxx  passive     There are no connections              
control-1-ru3.odc-xxxxx  passive     There are no connections              
control-1-ru4.odc-xxxxx active      All connections (1) are established   
 
✓ Showing Network details
    Discovered network details via Submariner:
        Network plugin:  OVNKubernetes
        Service CIDRs:   [172.30.0.0/16]
        Cluster CIDRs:   [10.128.0.0/14]
 
✓ Showing versions 
COMPONENT                       REPOSITORY           CONFIGURED   RUNNING                     ARCH    
submariner-gateway              quay.io/submariner   0.16.3       release-0.16-58e6b641a736   amd64   
submariner-routeagent           quay.io/submariner   0.16.3       release-0.16-58e6b641a736   amd64   
submariner-metrics-proxy        quay.io/submariner   0.16.3       release-0.16-03f30315107e   amd64   
submariner-operator             quay.io/submariner   0.16.3       release-0.16-133b18e07a09   amd64   
submariner-lighthouse-agent     quay.io/submariner   0.16.3       release-0.16-487f6296ce9b   amd64   
submariner-lighthouse-coredns   quay.io/submariner   0.16.3       release-0.16-487f6296ce9b   amd64   
 
 
sh-4.4$ /root/.local/bin/subctl diagnose all --kubeconfig /tmp/local-kubeconfig
Cluster "local-config"
✓ Checking Submariner support for the Kubernetes version
✓ Kubernetes version "v1.27.13+fd36fb9" is supported
 
✓ Non-Globalnet deployment detected - checking that cluster CIDRs do not overlap 
✓ Checking DaemonSet "submariner-gateway"
✓ Checking DaemonSet "submariner-routeagent"
✓ Checking DaemonSet "submariner-metrics-proxy"
✓ Checking Deployment "submariner-lighthouse-agent"
✓ Checking Deployment "submariner-lighthouse-coredns"
✓ Checking the status of all Submariner pods
✓ Checking that gateway metrics are accessible from non-gateway nodes
 
✓ Checking Submariner support for the CNI network plugin
✓ The detected CNI network plugin ("OVNKubernetes") is supported
✓ Checking OVN version 
✓ The ovn-nb database version 7.1.0 is supported
✓ Checking gateway connections
✓ Checking Submariner support for the kube-proxy mode
✓ Cluster is running with "OVNKubernetes" CNI which internally implements kube-proxy functionality
✓ Checking that firewall configuration allows intra-cluster VXLAN traffic
 
✓ Checking that services have been exported properly
 
 
 
Skipping inter-cluster firewall check as it requires two kubeconfigs. Please run "subctl diagnose firewall inter-cluster" command manually.

   show all and diagnose all command outputs for Site 2  

sh-4.4$ /root/.local/bin/subctl show all --kubeconfig /tmp/local-kubeconfig
Cluster "local-config"
✓ Detecting broker(s)
✓ No brokers found
 
✓ Showing Connections
GATEWAY                          CLUSTER   REMOTE IP      NAT   CABLE DRIVER   SUBNETS                        STATUS      RTT avg.     
control-1-ru4.odc-xxxxx  site1     10.56.226.31   no    libreswan      172.30.0.0/16, 10.128.0.0/14   connected   853.339µs    
 
✓ Showing Endpoints
CLUSTER   ENDPOINT IP    PUBLIC IP      CABLE DRIVER   TYPE     
site2     10.56.232.29   10.56.232.29   libreswan      local    
site1     10.56.226.31   10.56.226.31   libreswan      remote   
site2     10.56.232.30   10.56.232.30   libreswan      local    
site2     10.56.232.31   10.56.232.31   libreswan      local    
 
✓ Showing Gateways
NODE                             HA STATUS   SUMMARY                               
control-1-ru2.ydc-xxxxx   active      All connections (1) are established   
control-1-ru3.ydc-xxxxx  passive     There are no connections              
control-1-ru4.ydc-xxxxx  passive     There are no connections              
 
✓ Showing Network details
    Discovered network details via Submariner:
        Network plugin:  OVNKubernetes
        Service CIDRs:   [172.31.0.0/16]
        Cluster CIDRs:   [10.132.0.0/14]
 
✓ Showing versions 
COMPONENT                       REPOSITORY           CONFIGURED   RUNNING                     ARCH    
submariner-gateway              quay.io/submariner   0.16.3       release-0.16-58e6b641a736   amd64   
submariner-routeagent           quay.io/submariner   0.16.3       release-0.16-58e6b641a736   amd64   
submariner-metrics-proxy        quay.io/submariner   0.16.3       release-0.16-03f30315107e   amd64   
submariner-operator             quay.io/submariner   0.16.3       release-0.16-133b18e07a09   amd64   
submariner-lighthouse-agent     quay.io/submariner   0.16.3       release-0.16-487f6296ce9b   amd64   
submariner-lighthouse-coredns   quay.io/submariner   0.16.3       release-0.16-487f6296ce9b   amd64   
 
 
sh-4.4$ /root/.local/bin/subctl diagnose all --kubeconfig /tmp/local-kubeconfig
Cluster "local-config"
✓ Checking Submariner support for the Kubernetes version
✓ Kubernetes version "v1.25.16+a4e782e" is supported
 
✓ Non-Globalnet deployment detected - checking that cluster CIDRs do not overlap 
✓ Checking DaemonSet "submariner-gateway"
✓ Checking DaemonSet "submariner-routeagent"
✓ Checking DaemonSet "submariner-metrics-proxy"
✓ Checking Deployment "submariner-lighthouse-agent"
✓ Checking Deployment "submariner-lighthouse-coredns"
✓ Checking the status of all Submariner pods
✓ Checking that gateway metrics are accessible from non-gateway nodes
 
✓ Checking Submariner support for the CNI network plugin
✓ The detected CNI network plugin ("OVNKubernetes") is supported
✓ Checking OVN version 
✓ The ovn-nb database version 7.0.4 is supported
✓ Checking gateway connections
✓ Checking Submariner support for the kube-proxy mode
✓ Cluster is running with "OVNKubernetes" CNI which internally implements kube-proxy functionality
✓ Checking that firewall configuration allows intra-cluster VXLAN traffic
 
✓ Checking that services have been exported properly
 
 
Skipping inter-cluster firewall check as it requires two kubeconfigs. Please run "subctl diagnose firewall inter-cluster" command manually.
 
BhavaniYalamanchili commented 2 weeks ago

The issue is resolved now by reinstalling the submariner.

But some questions arise here:

  1. As you can see the outputs are showing all green, still there was a connection issue. We identify this by trying to communicate between the pods using the ping utility. Is there any other way like running some additional commands?
  2. If the communication is not good, why can't we see that error on the show all or diagnose all command outputs?
vthapar commented 2 weeks ago

@BhavaniYalamanchili You can run subctl diagnose which does a much more thorough diagnostic. subctl show mostly provides high level information about deployment, it can not torubleshoot end to end connectivity.

We also recommend running subctl verify which runs submariner E2E tests which covers all scenarios inculding pod to pod datapath.

BhavaniYalamanchili commented 1 week ago

Thanks @vthapar for the reply We will try to run this command and share the results and issues if found any.

dfarrell07 commented 1 day ago

@BhavaniYalamanchili Hope that helped, can we close this issue?