submariner-io / submariner

Networking component for interconnecting Pods and Services across Kubernetes clusters.
https://submariner.io
Apache License 2.0
2.41k stars 190 forks source link

The deployment is successful, but the applications between clusters cannot access each other #1237

Closed lemonit-eric-mao closed 3 years ago

lemonit-eric-mao commented 3 years ago
The deployment is successful, but the applications between clusters cannot access each other
Official Website Document

Preconditions
Name Version
subctl 0.8.1
k8s 1.20.4
containerd 1.4.4
os CentOS 7.9

Name Mode
calico VXLAN
kube-proxy iptables

Cluster Host IP
Cluster 01 192.168.103.230 ~ 192.168.103.232
Cluster 02 192.168.103.233 ~ 192.168.103.235
Cluster 03 192.168.103.236 ~ 192.168.103.238

For k8s cluster requirements, the network plug-in uses calico
## Download the installation file from the official website
wget https://docs.projectcalico.org/manifests/calico.yaml

Modify the calico.yaml file and add the following environment variables
......
      containers:
        -name: tigera-operator
          image: quay.io/tigera/operator:v1.15.1
          ......
          env:

            ## Modify
            # Enable IPIP
            -name: CALICO_IPV4POOL_IPIP
              value: "Never"
            # Enable or Disable VXLAN on the default IP pool.
            -name: CALICO_IPV4POOL_VXLAN
              value: "Always"

            ## New
            # Specify interface
            -name: IP_AUTODETECTION_METHOD
              # ens Configure according to the beginning of the actual network card
              value: "interface=ens."
......
kubectl apply -f calico.yaml

## Cluster 01
[root@master01 ~]# calicoctl get ippool -o wide
NAME CIDR NAT IPIPMODE VXLANMODE DISABLED SELECTOR
default-ipv4-ippool 10.244.0.0/16 true Never Always false all()

[root@master01 ~]#

## Cluster 02
[root@master01 ~]# calicoctl get ippool -o wide
NAME CIDR NAT IPIPMODE VXLANMODE DISABLED SELECTOR
default-ipv4-ippool 10.245.0.0/16 true Never Always false all()

[root@master01 ~]#

## Cluster 03
[root@master01 ~]# calicoctl get ippool -o wide
NAME CIDR NAT IPIPMODE VXLANMODE DISABLED SELECTOR
default-ipv4-ippool 10.246.0.0/16 true Never Always false all()

[root@master01 ~]#



Official Installation Document
Official github document
Download and install
curl -Ls https://get.submariner.io | VERSION=v0.8.1 bash
export PATH=$PATH:~/.local/bin
echo export PATH=\$PATH:~/.local/bin >> ~/.profile

Qiniu Cloud Download
## subctl v0.8.1
wget http://qiniu.dev-share.top/subctl -P /usr/local/bin/ && chmod +x /usr/local/bin/subctl

## calicoctl v3.18.1
wget http://qiniu.dev-share.top/calicoctl -P /usr/local/bin/ && chmod +x /usr/local/bin/calicoctl

[Pull k8s cluster.kube/config](http://www.dev-share.top/2020/09/29/k8s-%e5%a4%9a%e9%9b% 86%e7%be%a4%e5%88%87%e6%8d%a2/ "Pull k8s cluster.kube/config") Jump link
[root@master01 ~]# ./generate-kube-config.sh \
    cluster-01=192.168.103.230 \
    cluster-02=192.168.103.233 \
    cluster-03=192.168.103.236 \
    && source /etc/profile

[root@master01 ~]# ll
-rw------- 1 root root 5541 Apr 2 15:14 cluster-01
-rw------- 1 root root 5545 Apr 2 15:14 cluster-02
-rw------- 1 root root 5541 Apr 2 15:15 cluster-03
-rwxrwxrwx 1 root root 3005 Apr 2 15:09 generate-kube-config.sh
[root@master01 ~]#

Execute on cluster 01 master node

subctl deploy-broker --kubeconfig <PATH-TO-KUBECONFIG-BROKER> subctl deploy-broker --kubeconfig <PATH-TO-KUBECONFIG-BROKER> --service-discovery (enable multi-cluster service discovery optional)

[root@master01 ~]# subctl deploy-broker \
                     --kubeconfig cluster-01 \
                     --service-discovery

 ✓ Deploying broker
 ✓ Creating broker-info.subm file
 ✓ A new IPsec PSK will be generated for broker-info.subm



Execute on cluster 01 master node, join the first k8s cluster

subctl join broker-info.subm --disable-nat (disable NAT for IPsec) --kubeconfig <PATH-TO-JOINING-CLUSTER> --clusterid <ID>

[root@master01 yaml]# subctl join \
                        broker-info.subm \
                        --disable-nat \
                        --kubeconfig cluster-01 \
                        --clusterid cluster-01

* broker-info.subm says broker is at: https://192.168.103.230:6443
? Which node should be used as the gateway? worker01
    Discovered network details:
        Network plugin: generic
        Service CIDRs: [10.96.0.0/16]
        Cluster CIDRs: [10.244.0.1/16]
 ✓ Discovering network details
 ✓ Validating Globalnet configurations
 ✓ Discovering multi cluster details
 ✓ Deploying the Submariner operator
 ✓ Created operator CRDs
 ✓ Created operator namespace: submariner-operator
 ✓ Created operator service account and role
 ✓ Created lighthouse service account and role
 ✓ Created Lighthouse service accounts and roles
 ✓ Deployed the operator successfully
 ✓ Creating SA for cluster
 ✓ Deploying Submariner
 ✓ Submariner is up and running



Execute on cluster 01 master node, join the second k8s cluster
## Join the second k8s cluster
[root@master01 yaml]# subctl join \
                        broker-info.subm \
                        --disable-nat \
                        --kubeconfig cluster-02 \
                        --clusterid cluster-02

* broker-info.subm says broker is at: https://192.168.103.230:6443
? Which node should be used as the gateway? worker01
    Discovered network details:
        Network plugin: generic
        Service CIDRs: [10.97.0.0/16]
        Cluster CIDRs: [10.245.0.0/16]
 ✓ Discovering network details
 ✓ Validating Globalnet configurations
 ✓ Discovering multi cluster details
 ✓ Deploying the Submariner operator
 ✓ Created operator CRDs
 ✓ Created operator namespace: submariner-operator
 ✓ Created operator service account and role
 ✓ Created lighthouse service account and role
 ✓ Created Lighthouse service accounts and roles
 ✓ Deployed the operator successfully
 ✓ Creating SA for cluster
 ✓ Deploying Submariner
 ✓ Submariner is up and running



Execute on cluster 01 master node, join the third k8s cluster
## Join the third k8s cluster
[root@master01 yaml]# subctl join \
                        broker-info.subm \
                        --disable-nat \
                        --kubeconfig cluster-03 \
                        --clusterid cluster-03

* broker-info.subm says broker is at: https://192.168.103.230:6443
? Which node should be used as the gateway? worker01
    Discovered network details:
        Network plugin: generic
        Service CIDRs: [10.98.0.0/16]
        Cluster CIDRs: [10.246.0.0/16]
 ✓ Discovering network details
 ✓ Validating Globalnet configurations
 ✓ Discovering multi cluster details
 ✓ Deploying the Submariner operator
 ✓ Created operator CRDs
 ✓ Created operator namespace: submariner-operator
 ✓ Created operator service account and role
 ✓ Created lighthouse service account and role
 ✓ Created Lighthouse service accounts and roles
 ✓ Deployed the operator successfully
 ✓ Creating SA for cluster
 ✓ Deploying Submariner
 ✓ Submariner is up and running
Deployment complete, if there is any problem, please follow the steps below to troubleshoot






Official troubleshooting document Jump link
View and confirm all submariner configurations

# ------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------

[root@master01 ~]# subctl show all

Showing information for cluster "cluster-01":
Showing Network details
    Discovered network details:
        Network plugin: generic
        Service CIDRs: [10.96.0.0/16]
        Cluster CIDRs: [10.244.0.1/16]

Showing Endpoint details
CLUSTER ID ENDPOINT IP PUBLIC IP CABLE DRIVER TYPE
cluster-01 192.168.103.231 libreswan local
cluster-03 192.168.103.237 libreswan remote
cluster-02 192.168.103.234 libreswan remote

Showing Connection details
GATEWAY CLUSTER REMOTE IP CABLE DRIVER SUBNETS STATUS
worker01 cluster-03 192.168.103.237 libreswan 10.98.0.0/16, 10.246.0.0/16 connected
worker01 cluster-02 192.168.103.234 libreswan 10.97.0.0/16, 10.245.0.0/16 connected

Showing Gateway details
NODE HA STATUS SUMMARY
worker01 active All connections (2) are established

Showing version details
COMPONENT REPOSITORY VERSION
submariner quay.io/submariner 0.8.1
submariner-operator quay.io/submariner 0.8.1
service-discovery quay.io/submariner 0.8.1

# ------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------

Showing information for cluster "cluster-02":
Showing Network details
    Discovered network details:
        Network plugin: generic
        Service CIDRs: [10.97.0.0/16]
        Cluster CIDRs: [10.245.0.0/16]

Showing Endpoint details
CLUSTER ID ENDPOINT IP PUBLIC IP CABLE DRIVER TYPE
cluster-02 192.168.103.234 libreswan local
cluster-01 192.168.103.231 libreswan remote
cluster-03 192.168.103.237 libreswan remote

Showing Connection details
GATEWAY CLUSTER REMOTE IP CABLE DRIVER SUBNETS STATUS
worker01 cluster-01 192.168.103.231 libreswan 10.96.0.0/16, 10.244.0.1/16 connected
worker01 cluster-03 192.168.103.237 libreswan 10.98.0.0/16, 10.246.0.0/16 connected

Showing Gateway details
NODE HA STATUS SUMMARY
worker01 active All connections (2) are established

Showing version details
COMPONENT REPOSITORY VERSION
submariner quay.io/submariner 0.8.1
submariner-operator quay.io/submariner 0.8.1
service-discovery quay.io/submariner 0.8.1

# ------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------

Showing information for cluster "cluster-03":
Showing Network details
    Discovered network details:
        Network plugin: generic
        Service CIDRs: [10.98.0.0/16]
        Cluster CIDRs: [10.246.0.0/16]

Showing Endpoint details
CLUSTER ID ENDPOINT IP PUBLIC IP CABLE DRIVER TYPE
cluster-03 192.168.103.237 libreswan local
cluster-01 192.168.103.231 libreswan remote
cluster-02 192.168.103.234 libreswan remote

Showing Connection details
GATEWAY CLUSTER REMOTE IP CABLE DRIVER SUBNETS STATUS
worker01 cluster-01 192.168.103.231 libreswan 10.96.0.0/16, 10.244.0.1/16 connected
worker01 cluster-02 192.168.103.234 libreswan 10.97.0.0/16, 10.245.0.0/16 connected

Showing Gateway details
NODE HA STATUS SUMMARY
worker01 active All connections (2) are established

Showing version details
COMPONENT REPOSITORY VERSION
submariner quay.io/submariner 0.8.1
submariner-operator quay.io/submariner 0.8.1
service-discovery quay.io/submariner 0.8.1
[root@master01 ~]#



View and confirm cluster gateway
[root@master01 ~]# kubectl -n submariner-k8s-broker get clusters.submariner.io
NAME AGE
cluster-01 5m8s
cluster-02 5m8s
cluster-03 5m8s
[root@master01 ~]#
## Cluster 01
[root@master01 ~]# kubectl get node --selector=submariner.io/gateway=true -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
worker01 Ready <none> 6h42m v1.20.4 192.168.103.231 <none> CentOS Linux 7 (Core) 5.11.6-1.el7.elrepo.x86_64 containerd://1.4.4
[root@master01 ~]#

## Cluster 02
[root@master01 ~]# kubectl get node --selector=submariner.io/gateway=true -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
worker01 Ready <none> 6h20m v1.20.4 192.168.103.234 <none> CentOS Linux 7 (Core) 5.11.6-1.el7.elrepo.x86_64 containerd://1.4.4
[root@master01 ~]#

## Cluster 03
[root@master01 yaml]# kubectl get node --selector=submariner.io/gateway=true -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
worker01 Ready <none> 6h31m v1.20.4 192.168.103.237 <none> CentOS Linux 7 (Core) 5.11.6-1.el7.elrepo.x86_64 containerd://1.4.4
[root@master01 yaml]#

View and confirm all gateways
[root@master01 ~]# subctl show connections all

Showing information for cluster "cluster-01":
GATEWAY CLUSTER REMOTE IP CABLE DRIVER SUBNETS STATUS
worker01 cluster-03 192.168.103.237 libreswan 10.98.0.0/16, 10.246.0.0/16 connected
worker01 cluster-02 192.168.103.234 libreswan 10.97.0.0/16, 10.245.0.0/16 connected

Showing information for cluster "cluster-02":
GATEWAY CLUSTER REMOTE IP CABLE DRIVER SUBNETS STATUS
worker01 cluster-01 192.168.103.231 libreswan 10.96.0.0/16, 10.244.0.1/16 connected
worker01 cluster-03 192.168.103.237 libreswan 10.98.0.0/16, 10.246.0.0/16 connected

Showing information for cluster "cluster-03":
GATEWAY CLUSTER REMOTE IP CABLE DRIVERSUBNETS STATUS
worker01 cluster-01 192.168.103.231 libreswan 10.96.0.0/16, 10.244.0.1/16 connected
worker01 cluster-02 192.168.103.234 libreswan 10.97.0.0/16, 10.245.0.0/16 connected
[root@master01 ~]#

View and confirm the current cluster gateway details
[root@master01 ~]# kubectl describe Gateway -n submariner-operator
Name: worker01
Namespace: submariner-operator
Labels: <none>
Annotations: update-timestamp: 1617700363
API Version: submariner.io/v1
Kind: Gateway
Metadata:
  Creation Timestamp: 2021-04-06T08:38:27Z
  Generation: 364
  Managed Fields:
    API Version: submariner.io/v1
    Fields Type: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:update-timestamp:
      f:status:
        .:
        f:connections:
        f:haStatus:
        f:localEndpoint:
          .:
          f:backend:
          f:cable_name:
          f:cluster_id:
          f:healthCheckIP:
          f:hostname:
          f:nat_enabled:
          f:private_ip:
          f:public_ip:
          f:subnets:
        f:statusFailure:
        f:version:
    Manager: submariner-engine
    Operation: Update
    Time: 2021-04-06T08:38:27Z
  Resource Version: 6975
  UID: 51c1ba12-c039-48d9-8ad4-ed00f17570cc
Status:
  Connections:
    Endpoint:
      Backend: libreswan
      cable_name: submariner-cable-cluster-02-192-168-103-234
      cluster_id: cluster-02
      Health Check IP: 10.245.5.0
      Hostname: worker01
      nat_enabled: false # This should be false
      private_ip: 192.168.103.234
      public_ip:
      Subnets:
        10.97.0.0/16
        10.245.0.0/16
    Latency RTT:
      Average: 756.116µs
      Last: 652.209µs
      Max: 3.866329ms
      Min: 402.391µs
      Std Dev: 107.977µs
    Status: connected
    Status Message:
    Endpoint:
      Backend: libreswan
      cable_name: submariner-cable-cluster-03-192-168-103-237
      cluster_id: cluster-03
      Health Check IP: 10.246.5.0
      Hostname: worker01
      nat_enabled: false # This should be false
      private_ip: 192.168.103.237
      public_ip:
      Subnets:
        10.98.0.0/16
        10.246.0.0/16
    Latency RTT:
      Average: 808.937µs
      Last: 819.635µs
      Max: 4.398945ms
      Min: 406.738µs
      Std Dev: 177.569µs
    Status: connected
    Status Message:
  Ha Status: active
  Local Endpoint:
    Backend: libreswan
    cable_name: submariner-cable-cluster-01-192-168-103-231
    cluster_id: cluster-01
    Health Check IP: 10.244.5.0
    Hostname: worker01
    nat_enabled: false # This should be false
    private_ip: 192.168.103.231
    public_ip:
    Subnets:
      10.96.0.0/16
      10.244.0.1/16
  Status Failure:
  Version: v0.8.0-22-gcf1490f
Events: <none>
[root@master01 ~]#

## View and confirm the current cluster
[root@master01 ~]# kubectl get crds | grep -iE'multicluster.x-k8s.io'
serviceexports.multicluster.x-k8s.io 2021-04-06T07:24:34Z
serviceimports.multicluster.x-k8s.io 2021-04-06T07:24:23Z
[root@master01 ~]#

## View and confirm the current cluster
[root@master01 ~]# kubectl -n submariner-operator get service submariner-lighthouse-coredns
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
submariner-lighthouse-coredns ClusterIP 10.96.25.151 <none> 53/UDP 4m55s
[root@master01 ~]#

View confirmation submariner-lighthouse-coredns
[root@master01 ~]# kubectl -n kube-system describe configmap coredns
Name: coredns
Namespace: kube-system
Labels: <none>
Annotations: <none>

Data
====
Corefile:
----
#lighthouse-start AUTO-GENERATED SECTION. DO NOT EDIT
clusterset.local:53 {
    forward. 10.96.25.151
}
#lighthouse-end
.:53 {
    errors
    health {
       lameduck 5s
    }
    ready
    kubernetes cluster.local in-addr.arpa ip6.arpa {
       pods insecure
       fallthrough in-addr.arpa ip6.arpa
       ttl 30
    }
    prometheus: 9153
    forward. /etc/resolv.conf {
       max_concurrent 1000
    }
    cache 30
    loop
    reload
    loadbalance
}

Events: <none>
[root@master01 ~]#






Test cluster 01
## Switch to cluster 03
[root@master01 ~]# kubectl config use-context cluster-03
Switched to context "cluster-03".
[root@master01 ~]#

## Create a test program
kubectl create namespace nginx-test
kubectl -n nginx-test create deployment nginx --image=nginxinc/nginx-unprivileged:stable-alpine
kubectl -n nginx-test expose deployment nginx --port=8080

## View service/pod
[[root@master01 ~]# kubectl -n nginx-test getsvc,pods -l app=nginx -o wide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
service/nginx ClusterIP 10.98.224.208 <none> 8080/TCP 3m13s app=nginx

NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/nginx-6fdb7ffd5b-z2xx4 1/1 Running 0 3m13s 10.246.5.3 worker01 <none> <none>
[root@master01 ~]#

## Test program is created END

Exposure service (currently still in "cluster-03")
## Create ServiceExport
[root@master01 ~]# subctl export service --namespace nginx-test nginx
Service exported successfully
[root@master01 ~]#

## After the ServiceExport is successfully created, the nginx service will be exported to other clusters through the proxy.
## After exporting, the service can be discovered as the entire cluster set of nginx.nginx-test.svc.clusterset.local.
[root@master01 ~]# kubectl -n nginx-test describe serviceexports
Name: nginx
Namespace: nginx-test
Labels: <none>
Annotations: <none>
API Version: multicluster.x-k8s.io/v1alpha1
Kind: ServiceExport
Metadata:
  Creation Timestamp: 2021-04-06T09:19:28Z
  Generation: 1
  Resource Version: 7730
  UID: 96ee611c-db27-4702-bb51-a362daf6f023
Status:
  Conditions:
    Last Transition Time: 2021-04-06T09:19:28Z
    Message: Awaiting sync of the ServiceImport to the broker
    Reason: AwaitingSync
    Status: False
    Type: Valid
    Last Transition Time: 2021-04-06T09:19:28Z
    Message: Service was successfully synced to the broker
    Reason:
    Status: True
    Type: Valid
Events: <none>
[root@master01 ~]#

## View serviceimport Note: If it is not visible in other clusters, it means that the network plug-in is still configured incorrectly
[root@master01 ~]# kubectl get -n submariner-operator serviceimport
NAME TYPE IP AGE
nginx-nginx-test-cluster-03 ClusterSetIP ["10.98.224.208"] 33s
[root@master01 ~]#

Switch to cluster 02 to test
## Switch to cluster 02
[root@master01 ~]# kubectl config use-context cluster-02
Switched to context "cluster-02".
[root@master01 ~]#

## View serviceimport
[root@master01 ~]# kubectl get -n submariner-operator serviceimport
NAME TYPE IP AGE
nginx-nginx-test-cluster-03 ClusterSetIP ["10.98.224.208"] 53s
[root@master01 ~]#

[root@master01 ~]# kubectl create namespace nginx-test
[root@master01 ~]# kubectl -n nginx-test  run --generator=run-pod/v1 \
                     tmp-shell --rm -i --tty --image quay.io/submariner/nettest -- /bin/bash

bash-5.0# curl nginx.nginx-test.svc.clusterset.local:8080
curl: (6) Could not resolve host: nginx.nginx-test.svc.clusterset.local
bash-5.0#
bash-5.0#
bash-5.0# traceroute 10.98.224.208
traceroute to 10.98.224.208(10.98.224.208), 30 hops max, 46 byte packets
 1  192.168.103.234 (192.168.103.234)  0.023 ms  0.013 ms  0.008 ms
 2  192.168.103.237 (192.168.103.237)  0.369 ms  0.498 ms  0.314 ms
 3  192.168.100.1 (192.168.100.1)  1.923 ms  2.244 ms  1.794 ms

I don't know where I am wrong. Please teach me

sridhargaddam commented 3 years ago

Thanks for reporting the issue and for the detailed logs @lemonit-eric-mao

I see that you validated CoreDNS ConfigMap to include lighthouse DNS in Cluster1. I hope it's looking good even in Cluster2 and Cluster3?

There are two aspects to look at

  1. Why DNS resolution failed for nginx.nginx-test.svc.clusterset.local in spite of having a Service Import entry in the local cluster. For this, we would need some additional info from Lighthouse pods. I'm looping in @vthapar @aswinsuryan who can guide you on this.

  2. Looking at the logs you shared, all the tunnels seem to be successfully established. However, it is not clear if you created the Calico IPPools. If not, please create the pools as shown here for your setup - https://gist.github.com/sridhargaddam/ff4578b613901f93c62b105565cd690f

Also, do let us know if curl to 10.98.224.208:8080 is working in Cluster2. This will help us to know if the issue is with DNS resolution alone or if there is some issue even with datapath connectivity.

aswinsuryan commented 3 years ago

@lemonit-eric-mao It seems like nginx-test namespace was not existing in cluster02 initially and was created just before running curl. Submariner expects the namespace of the service that is exported to be present in all the joined clusters. It is not created explicitly. The endpointslices that are created would be synced to the service namespace in all the joined clusters. Here that step would have failed. Without active endpoints in endpointslices we will not return the ClusterIP from that particular cluster.

Could you please retry by creating the service namespace in all the clusters where the service will be used?

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had activity for 60 days. It will be closed if no further activity occurs. Please make a comment if this issue/pr is still valid. Thank you for your contributions.