Help: cannot establish connectivity between two k3s clusters

ccwalterhk commented 3 years ago

Hi, I follow the instruction in submariner.io to test. I cannot establish connectivity between two K3S clusters. when I run the connectivity verification. It gives me below output (output-1).

output-1.txt

When I use the subctl show command as below, I got below output (output-2).

output -2.txt

below is my get node output (output-3).

output-3.txt

ccwalterhk commented 3 years ago

Additional info:

walter@uat6-server:~ walter@uat6-server:~ export KUBECONFIG=kubeconfig.cluster-a walter@uat6-server:~ kubectl config use-context cluster-a Switched to context "cluster-a".

walter@uat6-server:~ kubectl get namespace submariner-k8s-broker NAME STATUS AGE submariner-k8s-broker Active 15h walter@uat6-server:~ kubectl get crds | grep -iE 'submariner|multicluster.x-k8s.io' submariners.submariner.io 2021-10-30T12:43:16Z servicediscoveries.submariner.io 2021-10-30T12:43:16Z brokers.submariner.io 2021-10-30T12:43:16Z serviceimports.multicluster.x-k8s.io 2021-10-30T12:43:31Z serviceexports.multicluster.x-k8s.io 2021-10-30T12:43:31Z clusters.submariner.io 2021-10-30T12:43:33Z endpoints.submariner.io 2021-10-30T12:43:33Z gateways.submariner.io 2021-10-30T12:43:33Z clusterglobalegressips.submariner.io 2021-10-30T12:43:33Z globalegressips.submariner.io 2021-10-30T12:43:33Z globalingressips.submariner.io 2021-10-30T12:43:33Z

walter@uat6-server:~ kubectl -n submariner-k8s-broker get clusters.submariner.io No resources found in submariner-k8s-broker namespace.

ccwalterhk commented 3 years ago

walter@uat6-server:~ kubectl get pod -n submariner-operator NAME READY STATUS RESTARTS AGE submariner-operator-745d8c89d8-lh9v8 1/1 Running 2 15h submariner-routeagent-5vnnb 1/1 Running 0 52m submariner-routeagent-4fpfk 1/1 Running 0 52m submariner-lighthouse-agent-78cb477567-9g4p7 1/1 Running 0 52m submariner-lighthouse-coredns-7744cbd5b7-lfw2w 1/1 Running 0 52m submariner-lighthouse-coredns-7744cbd5b7-s6f6w 1/1 Running 0 52m submariner-gateway-94wms 0/1 CrashLoopBackOff 15 52m walter@uat6-server:~

ccwalterhk commented 3 years ago

walter@uat6-server:~ walter@uat6-server:~ walter@uat6-server:~ kubectl describe pod submariner-gateway-94wms -n submariner-operator Name: submariner-gateway-94wms Namespace: submariner-operator Priority: 0 Node: uat9-server/192.168.1.72 Start Time: Sun, 31 Oct 2021 03:48:13 +0000 Labels: app=submariner-gateway controller-revision-hash=86c987c55f pod-template-generation=1 Annotations: Status: Running IP: 192.168.1.72 IPs: IP: 192.168.1.72 Controlled By: DaemonSet/submariner-gateway Containers: submariner-gateway: Container ID: containerd://a9c41ce64043f323acaeb7ccd05fd64264334e6f2fc02058808e5ef021ef8c67 Image: quay.io/submariner/submariner-gateway:0.11.0 Image ID: quay.io/submariner/submariner-gateway@sha256:32af57b3a61f4191fb2fb2d860f045f30a969054d40deaf658b18da04a42706d Ports: 4500/UDP, 4490/UDP Host Ports: 4500/UDP, 4490/UDP Command: submariner.sh State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 255 Started: Sun, 31 Oct 2021 04:39:57 +0000 Finished: Sun, 31 Oct 2021 04:39:59 +0000 Ready: False Restart Count: 15 Environment: SUBMARINER_NAMESPACE: submariner-operator SUBMARINER_CLUSTERCIDR: 10.44.0.0/24 SUBMARINER_SERVICECIDR: 10.45.0.0/16 SUBMARINER_GLOBALCIDR:
SUBMARINER_CLUSTERID: cluster-a SUBMARINER_COLORCODES: blue SUBMARINER_DEBUG: false SUBMARINER_NATENABLED: false SUBMARINER_BROKER: k8s SUBMARINER_CABLEDRIVER:
BROKER_K8S_APISERVER: 192.168.1.38:6443 BROKER_K8S_APISERVERTOKEN: eyJhbGciOiJSUzI1NiIsImtpZCI6InZnN1pkcWxIVzdoejBBS1VKcFBGSWVySlppNHJ5RFcwTmFQcXpOZng1LVEifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJzdWJtYXJpbmVyLWs4cy1icm9rZXIiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlY3JldC5uYW1lIjoiY2x1c3Rlci1jbHVzdGVyLWEtdG9rZW4tdzRnNWciLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoiY2x1c3Rlci1jbHVzdGVyLWEiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC51aWQiOiI3YWQxZTJlMS1iNDBkLTQ0N2EtYjhlYy0wMDhiZTVjMmFiZGYiLCJzdWIiOiJzeXN0ZW06c2VydmljZWFjY291bnQ6c3VibWFyaW5lci1rOHMtYnJva2VyOmNsdXN0ZXItY2x1c3Rlci1hIn0.LMZ1Ju1MyQBXE_BhWRqFMKHrWjXGr0V6nFOslh_eJJSSmpfMJGiCQUe8RTQBnQ-9uqqXbd8PWnmgJuYYMJwH7wE50aW1dcdIt4cZ1eBeCOgC43JqmoMv88AUP4AezcIDxRDHzX4wTVUgZYKN97NBL5hUcB7XYCnqQOZ25D0A_xHG-8dGG9omeI3C9s_78VQl6-wPycttsZRbqnEfketgWyxvWJ0tAhW-hgS7_aNZcII3kp5Wm4HVHwVsFqNcp4T8NJZTdXkFOBmm_aZFCBcQULxtp8KS4Tlexj9zuSToSt5hdO7YeDqBkcdZW1LpiJoQzJT9hJGkqd-Xo8QA788fTA BROKER_K8S_REMOTENAMESPACE: submariner-k8s-broker BROKER_K8S_CA: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJkakNDQVIyZ0F3SUJBZ0lCQURBS0JnZ3Foa2pPUFFRREFqQWpNU0V3SHdZRFZRUUREQmhyTTNNdGMyVnkKZG1WeUxXTmhRREUyTXpVMU56TTFOakF3SGhjTk1qRXhNRE13TURVMU9USXdXaGNOTXpFeE1ESTRNRFUxT1RJdwpXakFqTVNFd0h3WURWUVFEREJock0zTXRjMlZ5ZG1WeUxXTmhRREUyTXpVMU56TTFOakF3V1RBVEJnY3Foa2pPClBRSUJCZ2dxaGtqT1BRTUJCd05DQUFTUHF4K3F6am42Z3FrN2JLcGFQSlhVY2EycWZMMXgrRTB1OFJ3c1cyUzUKZFhkUVVmNGxBZStwUmcwQzZRell1OUZtaEVKb0prdU5uRTRCTmhsYTNiaXJvMEl3UURBT0JnTlZIUThCQWY4RQpCQU1DQXFRd0R3WURWUjBUQVFIL0JBVXdBd0VCL3pBZEJnTlZIUTRFRmdRVXhlQXhXVTVIR3E4ekNYdDBndVVjCjcvS2p3UVF3Q2dZSUtvWkl6ajBFQXdJRFJ3QXdSQUlnSnlDOXVSRFdwNzBUM1J3WWFDT3BpWHBFWndpMFBIWVEKUkQwdDBrS293RTBDSUM2RlY0UTRWM0hxczBKMTFHcGcyOXRvL2JGUEwzaEVYUm5Qb0hZeUM1bEUKLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo= CE_IPSEC_PSK: GrDUpPL9nHLVevwxmHtkPvYCXdVz19E7cX728FjXRht1FD940VbgoTf11/6/+4qZ CE_IPSEC_DEBUG: false SUBMARINER_HEALTHCHECKENABLED: true SUBMARINER_HEALTHCHECKINTERVAL: 1 SUBMARINER_HEALTHCHECKMAXPACKETLOSSCOUNT: 5 NODE_NAME: (v1:spec.nodeName) POD_NAME: submariner-gateway-94wms (v1:metadata.name) CE_IPSEC_IKEPORT: 500 CE_IPSEC_NATTPORT: 4500 CE_IPSEC_PREFERREDSERVER: false CE_IPSEC_FORCEENCAPS: false Mounts: /etc/ipsec.d from ipsecd (rw) /lib/modules from libmodules (ro) /var/lib/ipsec/nss from ipsecnss (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hnwdm (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: ipsecd: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium:
SizeLimit: ipsecnss: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium:
SizeLimit: libmodules: Type: HostPath (bare host directory volume) Path: /lib/modules HostPathType:
kube-api-access-hnwdm: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: BestEffort Node-Selectors: submariner.io/gateway=true Tolerations: op=Exists node.kubernetes.io/disk-pressure:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/network-unavailable:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists node.kubernetes.io/pid-pressure:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists node.kubernetes.io/unschedulable:NoSchedule op=Exists Events: Type Reason Age From Message

Normal Scheduled 54m default-scheduler Successfully assigned submariner-operator/submariner-gateway-94wms to uat9-server Warning FailedMount 54m kubelet MountVolume.SetUp failed for volume "kube-api-access-hnwdm" : failed to sync configmap cache: timed out waiting for the condition Normal Pulled 52m (x5 over 54m) kubelet Container image "quay.io/submariner/submariner-gateway:0.11.0" already present on machine Normal Created 52m (x5 over 54m) kubelet Created container submariner-gateway Normal Started 52m (x5 over 54m) kubelet Started container submariner-gateway Warning BackOff 3m52s (x230 over 53m) kubelet Back-off restarting failed container

ccwalterhk commented 3 years ago

walter@uat6-server:~ walter@uat6-server:~ walter@uat6-server:~ sudo kubectl logs submariner-gateway-gkfgw -n submariner-operator --kubeconfig=kubeconfig.cluster-b

trap 'exit 1' SIGTERM SIGINT
export CHARON_PID_FILE=/var/run/charon.pid
CHARON_PID_FILE=/var/run/charon.pid
rm -f /var/run/charon.pid
SUBMARINER_VERBOSITY=2
'[' false == true ']'
DEBUG=-v=2 ++ cat /proc/sys/net/ipv4/conf/all/send_redirects
[[ 0 = 0 ]]
exec submariner-gateway -v=2 -alsologtostderr I1031 11:16:10.196944 1 main.go:93] Starting the submariner gateway engine W1031 11:16:10.197714 1 client_config.go:608] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. I1031 11:16:10.199319 1 main.go:121] Creating the cable engine I1031 11:16:11.886857 1 local_endpoint.go:221] Interface "lo" has "127.0.0.1" address I1031 11:16:11.886972 1 local_endpoint.go:221] Interface "ens33" has "192.168.1.73" address I1031 11:16:11.887062 1 local_endpoint.go:221] Interface "flannel.1" has "10.144.1.0" address I1031 11:16:11.887150 1 local_endpoint.go:221] Interface "cni0" has "10.144.1.1" address F1031 11:16:11.887354 1 main.go:134] Error creating local endpoint object from types.SubmarinerSpecification{ClusterCidr:[]string{"10.144.0.0/24"}, ColorCodes:[]string{"blue"}, GlobalCidr:[]string{}, ServiceCidr:[]string{"10.145.0.0/16"}, Broker:"k8s", CableDriver:"libreswan", ClusterID:"cluster-b", Namespace:"submariner-operator", PublicIP:"", Token:"", Debug:false, NATEnabled:false, HealthCheckEnabled:true, HealthCheckInterval:0x1, HealthCheckMaxPacketLossCount:0x5}: error getting CNI Interface IP address: unable to find CNI Interface on the host which has IP from ["10.144.0.0/24"].Please disable the health check if your CNI does not expose a pod IP on the nodes walter@uat6-server:~

SoftTissues commented 3 years ago

Hello, I ran into the same problem, did you solve this problem?

sridhargaddam commented 3 years ago

Submariner does periodic health-check between the gateway nodes and it uses the CNI Interface IP on the host for this use-case. It looks like the CNI in K3s environment is not creating an interface on the host which has an IP from the PodCIDR.

You can disable Submariner health-check support in your deployment to avoid this issue. Please re-run the subctl join ... commands once again on your clusters and include --health-check=false as an argument.

dfarrell07 commented 2 years ago

Can we close this?

ccwalterhk commented 2 years ago

with the --health-check=false, the gateway is running successfully. No more crash.

below is some output from cluster a:

walter@k3-subctl-m1:~$ subctl show all --kubeconfig kubeconfig.cluster-a
Cluster "default"
 ✓ Showing Connections
GATEWAY       CLUSTER    REMOTE IP     NAT  CABLE DRIVER  SUBNETS                       STATUS     RTT avg.  
k3-subctl-w2  cluster-b  192.168.1.93  no   libreswan     10.145.0.0/16, 10.144.0.0/24  connected            

 ✓ Showing Endpoints
CLUSTER ID                    ENDPOINT IP     PUBLIC IP       CABLE DRIVER        TYPE            
cluster-a                     192.168.1.92    <my public IP>  libreswan           local           
cluster-b                     192.168.1.93    <my public IP>  libreswan           remote          

 ✓ Showing Gateways
NODE                            HA STATUS       SUMMARY                         
k3-subctl-w1                    active          All connections (1) are established

    Discovered network details via Submariner:
        Network plugin:  generic
        Service CIDRs:   [10.45.0.0/16]
        Cluster CIDRs:   [10.44.0.0/24]
 ✓ Showing Network details

COMPONENT                       REPOSITORY                                            VERSION         
submariner                      quay.io/submariner                                    0.11.0          
submariner-operator             quay.io/submariner                                    0.11.0          
service-discovery               quay.io/submariner                                    0.11.0          
 ✓ Showing versions

walter@k3-subctl-m1:~$

below is some output from cluster b:

walter@k3-subctl-m1:~$ subctl show all --kubeconfig kubeconfig.cluster-b
Cluster "default"
 ✓ Showing Connections
GATEWAY       CLUSTER    REMOTE IP     NAT  CABLE DRIVER  SUBNETS                     STATUS     RTT avg.  
k3-subctl-w1  cluster-a  192.168.1.92  no   libreswan     10.45.0.0/16, 10.44.0.0/24  connected            

 ✓ Showing Endpoints
CLUSTER ID                    ENDPOINT IP     PUBLIC IP       CABLE DRIVER        TYPE            
cluster-b                     192.168.1.93    <my public IP>  libreswan           local           
cluster-a                     192.168.1.92    <my public IP>  libreswan           remote          

 ✓ Showing Gateways
NODE                            HA STATUS       SUMMARY                         
k3-subctl-w2                    active          All connections (1) are established

    Discovered network details via Submariner:
        Network plugin:  generic
        Service CIDRs:   [10.145.0.0/16]
        Cluster CIDRs:   [10.144.0.0/24]
 ✓ Showing Network details

COMPONENT                       REPOSITORY                                            VERSION         
submariner                      quay.io/submariner                                    0.11.0          
submariner-operator             quay.io/submariner                                    0.11.0          
service-discovery               quay.io/submariner                                    0.11.0          
 ✓ Showing versions

ccwalterhk commented 2 years ago

However, when I try to ping from pod in cluster a to pod in cluster b, I cannot ping thru.

This is pod in cluster b:

walter@k3-subctl-m1:~$ sudo kubectl --kubeconfig kubeconfig.cluster-b run tmp-shell --rm -i --tty --image nicolaka/netshoot -- /bin/bash
[sudo] password for walter: 
If you don't see a command prompt, try pressing enter.
bash-5.1# ifconfig
eth0      Link encap:Ethernet  HWaddr BE:46:42:C6:44:09  
          inet addr:10.144.1.9  Bcast:10.144.1.255  Mask:255.255.255.0
          inet6 addr: fe80::bc46:42ff:fec6:4409/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1450  Metric:1
          RX packets:13 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:1074 (1.0 KiB)  TX bytes:628 (628.0 B)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

bash-5.1#

This is the pod in cluster a:

walter@k3-subctl-m1:~$ sudo kubectl --kubeconfig kubeconfig.cluster-a run tmp-shell --rm -i --tty --image nicolaka/netshoot -- /bin/bash
If you don't see a command prompt, try pressing enter.
bash-5.1# ifconfig
eth0      Link encap:Ethernet  HWaddr DA:E6:09:76:D1:8E  
          inet addr:10.44.1.14  Bcast:10.44.1.255  Mask:255.255.255.0
          inet6 addr: fe80::d8e6:9ff:fe76:d18e/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1450  Metric:1
          RX packets:12 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:984 (984.0 B)  TX bytes:628 (628.0 B)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

bash-5.1# ping 10.144.19
PING 10.144.19 (10.144.0.19) 56(84) bytes of data.
^C
--- 10.144.19 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1029ms

bash-5.1# ping 10.44.1.14
PING 10.44.1.14 (10.44.1.14) 56(84) bytes of data.
64 bytes from 10.44.1.14: icmp_seq=1 ttl=64 time=0.089 ms
64 bytes from 10.44.1.14: icmp_seq=2 ttl=64 time=0.026 ms
^C
--- 10.44.1.14 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1014ms
rtt min/avg/max/mdev = 0.026/0.057/0.089/0.031 ms
bash-5.1#

ccwalterhk commented 2 years ago

when I use the below commands to verify the deployment, I got below output.

walter@k3-subctl-m1:~$ KUBECONFIG=kubeconfig.cluster-a:kubeconfig.cluster-b subctl verify --kubecontexts cluster-a,cluster-b --only service-discovery,connectivity --verbose |more

Performing the following verifications: service-discovery, connectivity
Running Suite: Submariner E2E suite
===================================
Random Seed: 1640491770
Will run 37 of 41 specs

STEP: Creating kubernetes clients
STEP: Setting new cluster ID "cluster-b", previous cluster ID was "cluster-a"
STEP: Creating lighthouse clients
STEP: Creating submariner clients
[dataplane-globalnet] Basic TCP connectivity tests across overlapping clusters without discovery when a pod connects via TCP to the globalIP of a remote service when the pod is not on a gateway and the remote service is not on a gateway 
  should have sent the expected data from the pod to the other pod
  github.com/submariner-io/submariner@v0.11.0/test/e2e/dataplane/tcp_gn_pod_connectivity.go:35
STEP: Creating namespace objects with basename "dataplane-gn-conn-nd"
STEP: Generated namespace "e2e-tests-dataplane-gn-conn-nd-svx2r" in cluster "cluster-b" to execute the tests in
STEP: Creating namespace "e2e-tests-dataplane-gn-conn-nd-svx2r" in cluster "cluster-b"
STEP: Deleting namespace "e2e-tests-dataplane-gn-conn-nd-svx2r" on cluster "cluster-b"
STEP: Deleting namespace "e2e-tests-dataplane-gn-conn-nd-svx2r" on cluster "cluster-b"

• Failure in Spec Setup (BeforeEach) [0.043 seconds]
[dataplane-globalnet] Basic TCP connectivity tests across overlapping clusters without discovery
github.com/submariner-io/submariner@v0.11.0/test/e2e/dataplane/tcp_gn_pod_connectivity.go:28
  when a pod connects via TCP to the globalIP of a remote service [BeforeEach]
  github.com/submariner-io/submariner@v0.11.0/test/e2e/dataplane/tcp_gn_pod_connectivity.go:53
    when the pod is not on a gateway and the remote service is not on a gateway
    github.com/submariner-io/submariner@v0.11.0/test/e2e/dataplane/tcp_gn_pod_connectivity.go:60
      should have sent the expected data from the pod to the other pod
      github.com/submariner-io/submariner@v0.11.0/test/e2e/dataplane/tcp_gn_pod_connectivity.go:35

      Error creating namespace &Namespace{ObjectMeta:{e2e-tests-dataplane-gn-conn-nd-svx2r      0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[e2e-framework:dataplane-gn-conn-nd] map[] [] []  []},Spec:NamespaceSpec{Finalizers:[],},Status:NamespaceStatus{Phase:,Conditions:[]NamespaceCondition{}
,},}
      Unexpected error:
          <*errors.StatusError | 0xc0005081e0>: {
              ErrStatus: {
                  TypeMeta: {Kind: "", APIVersion: ""},
                  ListMeta: {
                      SelfLink: "",
                      ResourceVersion: "",
                      Continue: "",
                      RemainingItemCount: nil,
                  },
                  Status: "Failure",

nyechiel commented 2 years ago

Closing old issues. Please re-open if this is still relevant.

ccwalterhk commented 2 years ago

Have some improvement after using ----health-check=false. The gateway is up and running now. But, the issue has not been fixed. the pod in cluster A cannot ping pod in cluster B.

nyechiel commented 2 years ago

@cwalterhk are you following https://submariner.io/getting-started/quickstart/k3s/?

ccwalterhk commented 2 years ago

Yes. I follow the exact steps in the URL. I even use the same POD and service CIDR.

ccwalterhk commented 2 years ago

BTW, all the worker and master nodes are on same segment, although I don't think this is a concern.

ccwalterhk commented 2 years ago

walter@k3-subctl-m1:~ sudo kubectl --kubeconfig kubeconfig.cluster-a get node -o wide
NAME           STATUS   ROLES                  AGE   VERSION        INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
k3-subctl-m1   Ready    control-plane,master   46h   v1.22.5+k3s1   192.168.1.90   <none>        Ubuntu 20.04.3 LTS   5.4.0-91-generic   containerd://1.5.8-k3s1
k3-subctl-w1   Ready    <none>                 44h   v1.22.5+k3s1   192.168.1.92   <none>        Ubuntu 20.04.3 LTS   5.4.0-91-generic   containerd://1.5.8-k3s1
walter@k3-subctl-m1:~ sudo kubectl --kubeconfig kubeconfig.cluster-b get node -o wide
NAME           STATUS   ROLES                  AGE   VERSION        INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
k3-subctl-m2   Ready    control-plane,master   46h   v1.22.5+k3s1   192.168.1.91   <none>        Ubuntu 20.04.3 LTS   5.4.0-91-generic   containerd://1.5.8-k3s1
k3-subctl-w2   Ready    <none>                 44h   v1.22.5+k3s1   192.168.1.93   <none>        Ubuntu 20.04.3 LTS   5.4.0-91-generic   containerd://1.5.8-k3s1
walter@k3-subctl-m1:~$ subctl show all --kubeconfig kubeconfig.cluster-a
Cluster "default"
 ✓ Showing Connections
GATEWAY       CLUSTER    REMOTE IP     NAT  CABLE DRIVER  SUBNETS                       STATUS     RTT avg.  
k3-subctl-w2  cluster-b  192.168.1.93  no   libreswan     10.145.0.0/16, 10.144.0.0/24  connected            

 ✓ Showing Endpoints
CLUSTER ID                    ENDPOINT IP     PUBLIC IP       CABLE DRIVER        TYPE            
cluster-a                     192.168.1.92    58.182.182.152  libreswan           local           
cluster-b                     192.168.1.93    58.182.182.152  libreswan           remote          

 ✓ Showing Gateways
NODE                            HA STATUS       SUMMARY                         
k3-subctl-w1                    active          All connections (1) are established

    Discovered network details via Submariner:
        Network plugin:  generic
        Service CIDRs:   [10.45.0.0/16]
        Cluster CIDRs:   [10.44.0.0/24]
 ✓ Showing Network details

COMPONENT                       REPOSITORY                                            VERSION         
submariner                      quay.io/submariner                                    0.11.0          
submariner-operator             quay.io/submariner                                    0.11.0          
service-discovery               quay.io/submariner                                    0.11.0          
 ✓ Showing versions

walter@k3-subctl-m1:~$ subctl show all --kubeconfig kubeconfig.cluster-b
Cluster "default"
 ✓ Showing Connections
GATEWAY       CLUSTER    REMOTE IP     NAT  CABLE DRIVER  SUBNETS                     STATUS     RTT avg.  
k3-subctl-w1  cluster-a  192.168.1.92  no   libreswan     10.45.0.0/16, 10.44.0.0/24  connected            

 ✓ Showing Endpoints
CLUSTER ID                    ENDPOINT IP     PUBLIC IP       CABLE DRIVER        TYPE            
cluster-b                     192.168.1.93    58.182.182.152  libreswan           local           
cluster-a                     192.168.1.92    58.182.182.152  libreswan           remote          

 ✓ Showing Gateways
NODE                            HA STATUS       SUMMARY                         
k3-subctl-w2                    active          All connections (1) are established

    Discovered network details via Submariner:
        Network plugin:  generic
        Service CIDRs:   [10.145.0.0/16]
        Cluster CIDRs:   [10.144.0.0/24]
 ✓ Showing Network details

COMPONENT                       REPOSITORY                                            VERSION         
submariner                      quay.io/submariner                                    0.11.0          
submariner-operator             quay.io/submariner                                    0.11.0          
service-discovery               quay.io/submariner                                    0.11.0          
 ✓ Showing versions

walter@k3-subctl-m1:~$

ccwalterhk commented 2 years ago

Output of subtle diagnose all:

walter@k3-subctl-m1:~$ subctl diagnose all --kubeconfig kubeconfig.cluster-a
Cluster "default"
 ✓ Checking Submariner support for the Kubernetes version
 ✓ Kubernetes version "v1.22.5+k3s1" is supported

 ✓ Checking Submariner support for the CNI network plugin
 ✓ The detected CNI network plugin ("generic") is supported

 ✓ Checking gateway connections
 ✓ All connections are established

 ✓ Checking Submariner pods
 ✓ All Submariner pods are up and running

 ✓ Non-Globalnet deployment detected - checking if cluster CIDRs overlap
 ✓ Clusters do not have overlapping CIDRs

 ✓ Checking Submariner support for the kube-proxy mode 
 ✓ The kube-proxy mode is supported

 ✗ Checking the firewall configuration to determine if the metrics port (8080) is allowed 
 ✗ The tcpdump output from the sniffer pod does not contain the client pod HostIP. Please check that your firewall configuration allows TCP/8080 traffic on the "k3-subctl-w1" node.

 ✓ Checking the firewall configuration to determine if VXLAN traffic is allowed 
 ✓ The firewall configuration allows VXLAN traffic

Skipping inter-cluster firewall check as it requires two kubeconfigs. Please run "subctl diagnose firewall inter-cluster" command manually.

walter@k3-subctl-m1:~$

ccwalterhk commented 2 years ago

Output of subtle diagnose firewall inter-cluster:

walter@k3-subctl-m1:~$ subctl diagnose firewall inter-cluster kubeconfig.cluster-a kubeconfig.cluster-b
 ✓ Checking if tunnels can be setup on the gateway node of cluster "cluster-a" 
 ✓ Tunnels can be established on the gateway node

walter@k3-subctl-m1:~$ subctl diagnose firewall inter-cluster  kubeconfig.cluster-b kubeconfig.cluster-a
 ✓ Checking if tunnels can be setup on the gateway node of cluster "cluster-b" 
 ✓ Tunnels can be established on the gateway node

ccwalterhk commented 2 years ago

Uploaded subctl gather for cluster a and b.

submariner-cluster-B-20211227124346.tar.gz submariner-cluster-A-20211227124333.tar.gz

sridhargaddam commented 2 years ago

It looks like the Submariner Operator was not able to detect the CNI in the cluster and uses "generic" route-agent. May i know the CNI that you are using in the cluster?

Also, I see that tunnels/connections are properly established between the gateway nodes. So the connectivity issue you are seeing is only when the source or destination pod is on non-gateway?

ccwalterhk commented 2 years ago

I did not specify any CNI during installation of K3s and submariner. I believe it is using default for K3s which is "--flannel-backend=vxlan". Any command to confirm what CNI is being used?

I only have 1 worker node per cluster. So, the pod is also located in the gateway.

sridhargaddam commented 2 years ago

Looking at the logs, yes you seem to be using flannel CNI. So, i think there is some issue (non-fatal) in the submariner-operator code which is unable to detect flannel CNI in K3s. Can you please report an issue in the "submariner-operator" namespace?

sridhargaddam commented 2 years ago

However, when I try to ping from pod in cluster a to pod in cluster b, I cannot ping thru.

This is the pod in cluster a:


walter@k3-subctl-m1:~$ sudo kubectl --kubeconfig kubeconfig.cluster-a run tmp-shell --rm -i --tty --image nicolaka/netshoot -- /bin/bash
If you don't see a command prompt, try pressing enter.
bash-5.1# ifconfig
eth0      Link encap:Ethernet  HWaddr DA:E6:09:76:D1:8E  
          inet addr:10.44.1.14  Bcast:10.44.1.255  Mask:255.255.255.0
          inet6 addr: fe80::d8e6:9ff:fe76:d18e/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1450  Metric:1
          RX packets:12 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:984 (984.0 B)  TX bytes:628 (628.0 B)

bash-5.1# ping 10.144.19
PING 10.144.19 (10.144.0.19) 56(84) bytes of data.
^C
--- 10.144.19 ping statistics ---

The ip-address of pod in cluster-b is 10.144.1.9. Looks like you are trying to ping 10.144.0.19 Are you sure that the ip-address to which you tried to ping belongs to a running pod in cluster-b?

sridhargaddam commented 2 years ago

  Error creating namespace &Namespace{ObjectMeta:{e2e-tests-dataplane-gn-conn-nd-svx2r      0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[e2e-framework:dataplane-gn-conn-nd] map[] [] []  []},Spec:NamespaceSpec{Finalizers:[],},Status:NamespaceStatus{Phase:,Conditions:[]NamespaceCondition{}

,},}

The error does not seem to be related to Submariner and the e2e test code is getting failures when its trying to create a namespace. Please check if you are able to create namespaces in your cluster with the user-account with which you are trying to run the e2e tests.

ccwalterhk commented 2 years ago

However, when I try to ping from pod in cluster a to pod in cluster b, I cannot ping thru.

This is the pod in cluster a:

walter@k3-subctl-m1:~$ sudo kubectl --kubeconfig kubeconfig.cluster-a run tmp-shell --rm -i --tty --image nicolaka/netshoot -- /bin/bash
If you don't see a command prompt, try pressing enter.
bash-5.1# ifconfig
eth0      Link encap:Ethernet  HWaddr DA:E6:09:76:D1:8E  
          inet addr:10.44.1.14  Bcast:10.44.1.255  Mask:255.255.255.0
          inet6 addr: fe80::d8e6:9ff:fe76:d18e/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1450  Metric:1
          RX packets:12 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:984 (984.0 B)  TX bytes:628 (628.0 B)

bash-5.1# ping 10.144.19
PING 10.144.19 (10.144.0.19) 56(84) bytes of data.
^C
--- 10.144.19 ping statistics ---

The ip-address of pod in cluster-b is 10.144.1.9. Looks like you are trying to ping 10.144.0.19 Are you sure that the ip-address to which you tried to ping belongs to a running pod in cluster-b?

Sorry, there was typo. But, I did multiple times. below is another test.

Pod in cluster b:

walter@k3-subctl-m1:~$ sudo kubectl --kubeconfig kubeconfig.cluster-b run tmp-shell --rm -i --tty --image nicolaka/netshoot -- /bin/bash
[sudo] password for walter: 
If you don't see a command prompt, try pressing enter.
bash-5.1# ifconfig
eth0      Link encap:Ethernet  HWaddr 2A:32:00:05:EF:C9  
          inet addr:10.144.1.24  Bcast:10.144.1.255  Mask:255.255.255.0
          inet6 addr: fe80::2832:ff:fe05:efc9/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1450  Metric:1
          RX packets:13 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:1074 (1.0 KiB)  TX bytes:628 (628.0 B)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

Pod in cluster a:

walter@k3-subctl-m1:~$ sudo kubectl --kubeconfig kubeconfig.cluster-a run tmp-shell --rm -i --tty --image nicolaka/netshoot -- /bin/bash
[sudo] password for walter: 
If you don't see a command prompt, try pressing enter.
bash-5.1# ifconfig
eth0      Link encap:Ethernet  HWaddr 7A:E6:8F:72:58:A8  
          inet addr:10.44.1.46  Bcast:10.44.1.255  Mask:255.255.255.0
          inet6 addr: fe80::78e6:8fff:fe72:58a8/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1450  Metric:1
          RX packets:15 errors:0 dropped:0 overruns:0 frame:0
          TX packets:10 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:1214 (1.1 KiB)  TX bytes:768 (768.0 B)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

Cannot ping from cluster a to cluster b:

bash-5.1# ping 10.144.1.24
PING 10.144.1.24 (10.144.1.24) 56(84) bytes of data.

--- 10.144.1.24 ping statistics ---
8 packets transmitted, 0 received, 100% packet loss, time 7148ms

ccwalterhk commented 2 years ago

Looking at the logs, yes you seem to be using flannel CNI. So, i think there is some issue (non-fatal) in the submariner-operator code which is unable to detect flannel CNI in K3s. Can you please report an issue in the "submariner-operator" namespace?

Thanks. I can report an issue in the "submariner-operator" namespace. For now, can advise how to workaround?

sridhargaddam commented 2 years ago

Cannot ping from cluster a to cluster b:

bash-5.1# ping 10.144.1.24
PING 10.144.1.24 (10.144.1.24) 56(84) bytes of data.

--- 10.144.1.24 ping statistics ---
8 packets transmitted, 0 received, 100% packet loss, time 7148ms

While the ping is in progress, can you run "tcpdump -len -i any" in the netshoot pod of cluster-b to check if the packets are received on the destination cluster or if the icmp ping itself is not reaching the pod in cluster-b.

ccwalterhk commented 2 years ago

Cannot ping from cluster a to cluster b:
bash-5.1# ping 10.144.1.24
PING 10.144.1.24 (10.144.1.24) 56(84) bytes of data.

--- 10.144.1.24 ping statistics ---
8 packets transmitted, 0 received, 100% packet loss, time 7148ms
While the ping is in progress, can you run "tcpcpdudump -len -i any" in the netshoot pod of cluster-b to check if the packets are received on the destination cluster or if the icmp ping itself is not reaching the pod in cluster-b.

sridhargaddam commented 2 years ago

Looking at the logs, yes you seem to be using flannel CNI. So, i think there is some issue (non-fatal) in the submariner-operator code which is unable to detect flannel CNI in K3s. Can you please report an issue in the "submariner-operator" namespace?

Thanks. I can report an issue in the "submariner-operator" namespace. For now, can advise how to workaround?

For flannel even the generic route-agent would work, hence I mentioned that its "non-fatal" in my message. Looking at the logs of Submariner pods, I could not find any errors in it. So we have to debug this problem using tcpdump to figure out the issue.

One more observation from the subctl gather output logs: Normally when subctl gather ... is run on an cluster, it will also collect logs related to ipsec-status, ip-xfrm-state etc as shown below.

I'm not able to find these logs in the subctl gather output that is shared. Did you get any errors while running "subctl gather ... " command?

ccwalterhk commented 2 years ago

Cannot ping from cluster a to cluster b:
bash-5.1# ping 10.144.1.24
PING 10.144.1.24 (10.144.1.24) 56(84) bytes of data.

--- 10.144.1.24 ping statistics ---
8 packets transmitted, 0 received, 100% packet loss, time 7148ms
While the ping is in progress, can you run "tcpcpdudump -len -i any" in the netshoot pod of cluster-b to check if the packets are received on the destination cluster or if the icmp ping itself is not reaching the pod in cluster-b.

I actually keep watching the RX count in pod of cluster-b. I don't think packets are being received in cluster b.

bash-5.1# ifconfig
eth0      Link encap:Ethernet  HWaddr D2:D3:32:5C:81:F6  
          inet addr:10.144.1.25  Bcast:10.144.1.255  Mask:255.255.255.0
          inet6 addr: fe80::d0d3:32ff:fe5c:81f6/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1450  Metric:1
          RX packets:**18** errors:0 dropped:0 overruns:0 frame:0
          TX packets:12 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:1424 (1.3 KiB)  TX bytes:908 (908.0 B)

bash-5.1# ifconfig
eth0      Link encap:Ethernet  HWaddr D2:D3:32:5C:81:F6  
          inet addr:10.144.1.25  Bcast:10.144.1.255  Mask:255.255.255.0
          inet6 addr: fe80::d0d3:32ff:fe5c:81f6/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1450  Metric:1
          RX packets:**18** errors:0 dropped:0 overruns:0 frame:0
          TX packets:12 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:1424 (1.3 KiB)  TX bytes:908 (908.0 B)

bash-5.1# ifconfig
eth0      Link encap:Ethernet  HWaddr D2:D3:32:5C:81:F6  
          inet addr:10.144.1.25  Bcast:10.144.1.255  Mask:255.255.255.0
          inet6 addr: fe80::d0d3:32ff:fe5c:81f6/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1450  Metric:1
          RX packets:**18** errors:0 dropped:0 overruns:0 frame:0
          TX packets:12 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:1424 (1.3 KiB)  TX bytes:908 (908.0 B)

ccwalterhk commented 2 years ago

Looking at the logs, yes you seem to be using flannel CNI. So, i think there is some issue (non-fatal) in the submariner-operator code which is unable to detect flannel CNI in K3s. Can you please report an issue in the "submariner-operator" namespace?

Thanks. I can report an issue in the "submariner-operator" namespace. For now, can advise how to workaround?

For flannel even the generic route-agent would work, hence I mentioned that its "non-fatal" in my message. Looking at the logs of Submariner pods, I could not find any errors in it. So we have to debug this problem using tcpdump to figure out the issue.

One more observation from the subctl gather output logs: Normally when subctl gather ... is run on an cluster, it will also collect logs related to ipsec-status, ip-xfrm-state etc as shown below.

I'm not able to find these logs in the subctl gather output that is shared. Did you get any errors while running "subctl gather ... " command?

I use below syntax to generate the log. I don't get any errors at all.

subctl gather --kubeconfig kubeconfig.cluster-a

ccwalterhk commented 2 years ago

Cannot ping from cluster a to cluster b:
bash-5.1# ping 10.144.1.24
PING 10.144.1.24 (10.144.1.24) 56(84) bytes of data.

--- 10.144.1.24 ping statistics ---
8 packets transmitted, 0 received, 100% packet loss, time 7148ms
While the ping is in progress, can you run "tcpcpdudump -len -i any" in the netshoot pod of cluster-b to check if the packets are received on the destination cluster or if the icmp ping itself is not reaching the pod in cluster-b.

I actually keep watching the RX count in pod of cluster-b. I don't think packets are being received in cluster b.

bash-5.1# ifconfig
eth0      Link encap:Ethernet  HWaddr D2:D3:32:5C:81:F6  
          inet addr:10.144.1.25  Bcast:10.144.1.255  Mask:255.255.255.0
          inet6 addr: fe80::d0d3:32ff:fe5c:81f6/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1450  Metric:1
          RX packets:**18** errors:0 dropped:0 overruns:0 frame:0
          TX packets:12 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:1424 (1.3 KiB)  TX bytes:908 (908.0 B)

bash-5.1# ifconfig
eth0      Link encap:Ethernet  HWaddr D2:D3:32:5C:81:F6  
          inet addr:10.144.1.25  Bcast:10.144.1.255  Mask:255.255.255.0
          inet6 addr: fe80::d0d3:32ff:fe5c:81f6/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1450  Metric:1
          RX packets:**18** errors:0 dropped:0 overruns:0 frame:0
          TX packets:12 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:1424 (1.3 KiB)  TX bytes:908 (908.0 B)

bash-5.1# ifconfig
eth0      Link encap:Ethernet  HWaddr D2:D3:32:5C:81:F6  
          inet addr:10.144.1.25  Bcast:10.144.1.255  Mask:255.255.255.0
          inet6 addr: fe80::d0d3:32ff:fe5c:81f6/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1450  Metric:1
          RX packets:**18** errors:0 dropped:0 overruns:0 frame:0
          TX packets:12 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:1424 (1.3 KiB)  TX bytes:908 (908.0 B)

Confirmed with TCPDUMP. Not getting anything.

sridhargaddam commented 2 years ago

I use below syntax to generate the log. I don't get any errors at all.

subctl gather --kubeconfig kubeconfig.cluster-a

@vthapar FYI

sridhargaddam commented 2 years ago

Confirmed with TCPDUMP. Not getting anything.

Okay, so it confirms that packets are not even reaching the destination cluster and are probably getting dropped on the source Gateway node itself. Please try to check where its getting dropped on the gateway node.

I'm a bit busy this week and other Submariner developers are off due to holidays. If you are unable to figure out the issue by next week, I can provide some support in investigating it.

ccwalterhk commented 2 years ago

Confirmed with TCPDUMP. Not getting anything.

Okay, so it confirms that packets are not even reaching the destination cluster and are probably getting dropped on the source Gateway node itself. Please try to check where its getting dropped on the gateway node.

I'm a bit busy this week and other Submariner developers are off due to holidays. If you are unable to figure out the issue by next week, I can provide some support in investigating it.

Actually, thank you very much for help. Found some interesting thing. the traffic goes internet. don't know who is 10.44.1.1 as well

walter@k3-subctl-m1:~$ sudo kubectl --kubeconfig kubeconfig.cluster-a run tmp-shell --rm -i --tty --image nicolaka/netshoot -- /bin/bash
[sudo] password for walter: 
If you don't see a command prompt, try pressing enter.
bash-5.1# ping 10.144.1.27
PING 10.144.1.27 (10.144.1.27) 56(84) bytes of data.
^C
--- 10.144.1.27 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

bash-5.1# traceroute 10.144.1.27
traceroute to 10.144.1.27 (10.144.1.27), 30 hops max, 46 byte packets
 1  10.44.1.1 (10.44.1.1)  0.005 ms  0.004 ms  0.003 ms
 2  gateway.home.local (192.168.1.5)  0.631 ms  0.294 ms  0.343 ms
 3  183.90.58.1 (183.90.58.1)  2.483 ms  2.190 ms  2.368 ms
 4  183.90.44.101 (183.90.44.101)  2.295 ms  2.368 ms  2.339 ms
 5  *  *  *
 6  *^C
bash-5.1#

sridhargaddam commented 2 years ago

I think the problem is because the cluster_cidr configured in Submariner is not matching with the actual value of your Cluster. For example, from the subctl show output ..., I can see the following

For Cluster A:        
    Discovered network details via Submariner:
        Network plugin:  generic
        Service CIDRs:   [10.45.0.0/16]
        Cluster CIDRs:   [10.44.0.0/24]    <---- This is the issue

For Cluster B:
    Discovered network details via Submariner:
        Network plugin:  generic
        Service CIDRs:   [10.145.0.0/16]
        Cluster CIDRs:   [10.144.0.0/24]   <---- This is the issue

So, cluster-b is advertising its local clusterCIDR as 10.144.0.0/24 (i.e., IPs falling in this CIDR are 10.144.0.1 to 10.144.0.255). But when you scheduled a pod in Cluster-b its getting an IPaddress like 10.144.1.x, hence its not going over the Submariner IPsec tunnel. Similar thing is happening for cluster-a.

Did you explicitly specify the --clustercidr during the subctl join ... operation or was it auto-discovered by "submariner-operator"?

In case it was auto-discovered, it means its a bug in submariner-operator autodiscovery code. You can do ONE of the following until it's fixed in submariner-operator code.

Re-run "subctl join --clustercidr \<with-correct-value> ... ` on both the clusters
Modify submariner crd on both the clusters and edit the value of clusterCIDR to point to the correct CIDR in both clusters.

ccwalterhk commented 2 years ago

I think the problem is because the cluster_cidr configured in Submariner is not matching with the actual value of your Cluster. For example, from the subctl show output ..., I can see the following
For Cluster A:        
    Discovered network details via Submariner:
        Network plugin:  generic
        Service CIDRs:   [10.45.0.0/16]
        Cluster CIDRs:   [10.44.0.0/24]    <---- This is the issue

For Cluster B:
    Discovered network details via Submariner:
        Network plugin:  generic
        Service CIDRs:   [10.145.0.0/16]
        Cluster CIDRs:   [10.144.0.0/24]   <---- This is the issue
So, cluster-b is advertising its local clusterCIDR as 10.144.0.0/24 (i.e., IPs falling in this CIDR are 10.144.0.1 to 10.144.0.255). But when you scheduled a pod in Cluster-b its getting an IPaddress like 10.144.1.x, hence its not going over the Submariner IPsec tunnel. Similar thing is happening for cluster-a.

Did you explicitly specify the --clustercidr during the subctl join ... operation or was it auto-discovered by "submariner-operator"?

In case it was auto-discovered, it means its a bug in submariner-operator autodiscovery code. You can do ONE of the following until it's fixed in submariner-operator code.

Re-run "subctl join --clustercidr ... ` on both the clusters

Modify submariner crd on both the clusters and edit the value of clusterCIDR to point to the correct CIDR in both clusters.

Thanks. You are right. After I use method 1 you proposed, it is pingable now. Very good.

Just FYI. I still need to use --health-check=false. if I remove --health-check=false, there is error in subctl show output.

walter@k3-subctl-m1:~$ subctl show all --kubeconfig kubeconfig.cluster-a
Cluster "default"
 ✓ Showing Connections
GATEWAY       CLUSTER    REMOTE IP     NAT  CABLE DRIVER  SUBNETS                       STATUS  RTT avg.  
k3-subctl-w2  cluster-b  192.168.1.93  no   libreswan     10.145.0.0/16, 10.144.0.0/24  error   0s        

 ✓ Showing Endpoints
CLUSTER ID                    ENDPOINT IP     PUBLIC IP       CABLE DRIVER        TYPE            
cluster-a                     192.168.1.92    58.182.182.154  libreswan           local           
cluster-b                     192.168.1.93    58.182.182.154  libreswan           remote          

 ✓ Showing Gateways
NODE                            HA STATUS       SUMMARY                         
k3-subctl-w1                    active          0 connections out of 1 are established

    Discovered network details via Submariner:
        Network plugin:  generic
        Service CIDRs:   [10.45.0.0/16]
        Cluster CIDRs:   [10.44.0.0/16]
 ✓ Showing Network details

COMPONENT                       REPOSITORY                                            VERSION         
submariner                      quay.io/submariner                                    0.11.0          
submariner-operator             quay.io/submariner                                    0.11.0          
service-discovery               quay.io/submariner                                    0.11.0          
 ✓ Showing versions

ccwalterhk commented 2 years ago

### This is output if I use --health-check=false

walter@k3-subctl-m1:~$ subctl show all --kubeconfig kubeconfig.cluster-a
Cluster "default"
 ✓ Showing Connections
GATEWAY       CLUSTER    REMOTE IP     NAT  CABLE DRIVER  SUBNETS                       STATUS     RTT avg.  
k3-subctl-w2  cluster-b  192.168.1.93  no   libreswan     10.145.0.0/16, 10.144.0.0/16  connected            

 ✓ Showing Endpoints
CLUSTER ID                    ENDPOINT IP     PUBLIC IP       CABLE DRIVER        TYPE            
cluster-a                     192.168.1.92    58.182.182.154  libreswan           local           
cluster-b                     192.168.1.93    58.182.182.154  libreswan           remote          

 ✓ Showing Gateways
NODE                            HA STATUS       SUMMARY                         
k3-subctl-w1                    active          All connections (1) are established

    Discovered network details via Submariner:
        Network plugin:  generic
        Service CIDRs:   [10.45.0.0/16]
        Cluster CIDRs:   [10.44.0.0/16]
 ✓ Showing Network details

COMPONENT                       REPOSITORY                                            VERSION         
submariner                      quay.io/submariner                                    0.11.0          
submariner-operator             quay.io/submariner                                    0.11.0          
service-discovery               quay.io/submariner                                    0.11.0          
 ✓ Showing versions

walter@k3-subctl-m1:~$ subctl show all --kubeconfig kubeconfig.cluster-b
Cluster "default"
 ✓ Showing Connections
GATEWAY       CLUSTER    REMOTE IP     NAT  CABLE DRIVER  SUBNETS                     STATUS     RTT avg.  
k3-subctl-w1  cluster-a  192.168.1.92  no   libreswan     10.45.0.0/16, 10.44.0.0/16  connected            

 ✓ Showing Endpoints
CLUSTER ID                    ENDPOINT IP     PUBLIC IP       CABLE DRIVER        TYPE            
cluster-b                     192.168.1.93    58.182.182.154  libreswan           local           
cluster-a                     192.168.1.92    58.182.182.154  libreswan           remote          

 ✓ Showing Gateways
NODE                            HA STATUS       SUMMARY                         
k3-subctl-w2                    active          All connections (1) are established

    Discovered network details via Submariner:
        Network plugin:  generic
        Service CIDRs:   [10.145.0.0/16]
        Cluster CIDRs:   [10.144.0.0/16]
 ✓ Showing Network details

COMPONENT                       REPOSITORY                                            VERSION         
submariner                      quay.io/submariner                                    0.11.0          
submariner-operator             quay.io/submariner                                    0.11.0          
service-discovery               quay.io/submariner                                    0.11.0          
 ✓ Showing versions

walter@k3-subctl-m1:~$ 
walter@k3-subctl-m1:~$

ccwalterhk commented 2 years ago

I think the problem is because the cluster_cidr configured in Submariner is not matching with the actual value of your Cluster. For example, from the subctl show output ..., I can see the following
For Cluster A:        
    Discovered network details via Submariner:
        Network plugin:  generic
        Service CIDRs:   [10.45.0.0/16]
        Cluster CIDRs:   [10.44.0.0/24]    <---- This is the issue

For Cluster B:
    Discovered network details via Submariner:
        Network plugin:  generic
        Service CIDRs:   [10.145.0.0/16]
        Cluster CIDRs:   [10.144.0.0/24]   <---- This is the issue
So, cluster-b is advertising its local clusterCIDR as 10.144.0.0/24 (i.e., IPs falling in this CIDR are 10.144.0.1 to 10.144.0.255). But when you scheduled a pod in Cluster-b its getting an IPaddress like 10.144.1.x, hence its not going over the Submariner IPsec tunnel. Similar thing is happening for cluster-a. Did you explicitly specify the --clustercidr during the subctl join ... operation or was it auto-discovered by "submariner-operator"? In case it was auto-discovered, it means its a bug in submariner-operator autodiscovery code. You can do ONE of the following until it's fixed in submariner-operator code.

Re-run "subctl join --clustercidr ... ` on both the clusters

Modify submariner crd on both the clusters and edit the value of clusterCIDR to point to the correct CIDR in both clusters.
Thanks. You are right. After I use method 1 you proposed, it is pingable now. Very good.

Just FYI. I still need to use --health-check=false. if I remove --health-check=false, there is error in subctl show output.
walter@k3-subctl-m1:~$ subctl show all --kubeconfig kubeconfig.cluster-a
Cluster "default"
 ✓ Showing Connections
GATEWAY       CLUSTER    REMOTE IP     NAT  CABLE DRIVER  SUBNETS                       STATUS  RTT avg.  
k3-subctl-w2  cluster-b  192.168.1.93  no   libreswan     10.145.0.0/16, 10.144.0.0/24  error   0s        

 ✓ Showing Endpoints
CLUSTER ID                    ENDPOINT IP     PUBLIC IP       CABLE DRIVER        TYPE            
cluster-a                     192.168.1.92    58.182.182.154  libreswan           local           
cluster-b                     192.168.1.93    58.182.182.154  libreswan           remote          

 ✓ Showing Gateways
NODE                            HA STATUS       SUMMARY                         
k3-subctl-w1                    active          0 connections out of 1 are established

    Discovered network details via Submariner:
        Network plugin:  generic
        Service CIDRs:   [10.45.0.0/16]
        Cluster CIDRs:   [10.44.0.0/16]
 ✓ Showing Network details

COMPONENT                       REPOSITORY                                            VERSION         
submariner                      quay.io/submariner                                    0.11.0          
submariner-operator             quay.io/submariner                                    0.11.0          
service-discovery               quay.io/submariner                                    0.11.0          
 ✓ Showing versions

Thank you, @sridhargaddam. By going thru this troubleshooting session, I learnt a lot on submariner and Kubernetes.

Suggest to update https://submariner.io/getting-started/quickstart/k3s/ URL and put cluster CIDR option in the join command.

sridhargaddam commented 2 years ago

Thank you, @sridhargaddam. By going thru this troubleshooting session, I learnt a lot on submariner and Kubernetes.

Suggest to update https://submariner.io/getting-started/quickstart/k3s/ URL and put cluster CIDR option in the join command.

Glad to hear that. We welcome PRs :) In-case you want to propose a PR, please feel free to send a PR to the following repo - https://github.com/submariner-io/submariner-website

Also, can you please report a bug on "submariner-operator" repo with the following info, thanks.

Auto-discovery of CNI is failing on K3s
Issue with discovery of ClusterCIDRs on K3s

ccwalterhk commented 2 years ago

Did you explicitly specify the --clustercidr during the subctl join ... operation or was it auto-discovered by "submariner-operator"?

Glad to hear that its working now. May I know if the original issue was a wrong configuration or if its an issue with submariner-operator autodiscovery code?

It is the autodiscovery code.

ccwalterhk commented 2 years ago

Thank you, @sridhargaddam. By going thru this troubleshooting session, I learnt a lot on submariner and Kubernetes.

Suggest to update https://submariner.io/getting-started/quickstart/k3s/ URL and put cluster CIDR option in the join command.

Glad to hear that. We welcome PRs :) In-case you want to propose a PR, please feel free to send a PR to the following repo - https://github.com/submariner-io/submariner-website

Also, can you please report a bug on "submariner-operator" repo with the following info, thanks.

Auto-discovery of CNI is failing on K3s

Issue with discovery of ClusterCIDRs on K3s

Will report both issues. I will take a look how to do the PR as well.

nyechiel commented 2 years ago

Thanks @cwalterhk for the feedback and @sridhargaddam for your support!

submariner-io / submariner