submariner-io / submariner

Networking component for interconnecting Pods and Services across Kubernetes clusters.
https://submariner.io
Apache License 2.0
2.44k stars 193 forks source link

subctl show should warn if a sub-connection has a failure #1080

Closed manosnoam closed 3 years ago

manosnoam commented 3 years ago

On non-globalnet env, Libreswan had connection failure, so nginx service on one cluster, could not be reached from another cluster: https://qe-jenkins-csb-skynet.cloud.paas.psi.redhat.com/job/debug_job/940/Test-Report/

The problem: The connection failure was not displayed with subctl show command, but it should have been displayed, at least as warning.

What happened:

$ export KUBECONFIG=kubconf_pkomarov-cluster-b
[nmanos@nmanos temp]$ oc  get svc -l app=nginx-cl-b -n test-submariner
NAME         TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
nginx-cl-b   ClusterIP   172.32.233.221   <none>        8080/TCP   108m

export KUBECONFIG=kubconf_pkomarov-cluster-a
[nmanos@nmanos temp]$ oc  exec netshoot-cl-a -n test-submariner -- /bin/bash -c "curl --max-time 30 --verbose 172.32.233.221:8080"
*   Trying 172.32.233.221:8080...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:29 --:--:--     0* Connection timed out after 30001 milliseconds
  0     0    0     0    0     0      0      0 --:--:--  0:00:30 --:--:--     0
* Closing connection 0
curl: (28) Connection timed out after 30001 milliseconds
command terminated with exit code 28

Subctl shows no error:

subctl show all
Showing information for cluster "pkomarov-cluster-a":
    Discovered network details:
        Network plugin:  OpenShiftSDN
        Service CIDRs:   [172.31.0.0/16]
        Cluster CIDRs:   [10.132.0.0/14]
CLUSTER ID                    ENDPOINT IP     PUBLIC IP       CABLE DRIVER        TYPE            
pkomarov-cluster-a            10.1.52.20      18.223.235.201  libreswan           local           
default-cl2                   10.2.1.169      66.187.232.129  libreswan           remote          
GATEWAY                         CLUSTER                 REMOTE IP       CABLE DRIVER        SUBNETS                                 STATUS          
default-cl2-srrk7-worker-ct748  default-cl2             10.2.1.169      libreswan           172.32.0.0/16, 10.136.0.0/14            connected       
NODE                            HA STATUS       SUMMARY                         
ip-10-1-52-20                   active          All connections (1) are established
COMPONENT                       REPOSITORY                                            VERSION         
submariner                      registry.redhat.io/rhacm2-tech-preview                v0.8.0          
submariner-operator             registry.redhat.io/rhacm2-tech-preview/submariner-rhe v0.8.0          
service-discovery               registry.redhat.io/rhacm2-tech-preview                v0.8.0          
Showing information for cluster "default-cl2":
    Discovered network details:
        Network plugin:  OpenShiftSDN
        Service CIDRs:   [172.32.0.0/16]
        Cluster CIDRs:   [10.136.0.0/14]
CLUSTER ID                    ENDPOINT IP     PUBLIC IP       CABLE DRIVER        TYPE            
default-cl2                   10.2.1.169      66.187.232.129  libreswan           local           
pkomarov-cluster-a            10.1.52.20      18.223.235.201  libreswan           remote          
GATEWAY                         CLUSTER                 REMOTE IP       CABLE DRIVER        SUBNETS                                 STATUS          
ip-10-1-52-20                   pkomarov-cluster-a      10.1.52.20      libreswan           172.31.0.0/16, 10.132.0.0/14            connected       
NODE                            HA STATUS       SUMMARY                         
default-cl2-srrk7-worker-ct748  active          All connections (1) are established
COMPONENT                       REPOSITORY                                            VERSION         
submariner                      registry.redhat.io/rhacm2-tech-preview                v0.8.0          
submariner-operator             registry.redhat.io/rhacm2-tech-preview/submariner-rhe v0.8.0          
service-discovery               registry.redhat.io/rhacm2-tech-preview                v0.8.0  

But Submariner Gateway pod does show a connection problem in whack:

### Pod submariner-gateway-wcj2x in Namespace submariner-operator ###

Name:               submariner-gateway-wcj2x
Namespace:          submariner-operator
Priority:           0
PriorityClassName:  <none>
Node:               ip-10-1-52-20.us-east-2.compute.internal/10.1.52.20
Start Time:         Tue, 12 Jan 2021 14:53:31 +0200
Labels:             app=submariner-engine
                    controller-revision-hash=5d46647c45
                    pod-template-generation=1
Annotations:        openshift.io/scc: privileged
Status:             Running
IP:                 10.1.52.20
Controlled By:      DaemonSet/submariner-gateway
Containers:
  submariner:
    Container ID:  cri-o://1c3bbf9e354dd4ae79794e64c1d4d87a5699c1d11c778939fbdbd33dfab54dd1
    Image:         registry.redhat.io/rhacm2-tech-preview/submariner-gateway-rhel8:v0.8.0
    Image ID:      registry.redhat.io/rhacm2-tech-preview/submariner-gateway-rhel8@sha256:f0866441b026f9a3fee49775925d2f1f0b2453f820362d6a80c82cc183b2ed1d
    Port:          <none>
    Host Port:     <none>
    Command:
      submariner.sh
    State:          Running
      Started:      Tue, 12 Jan 2021 14:55:39 +0200
    Ready:          True
    Restart Count:  0
    Environment:
      SUBMARINER_NAMESPACE:                      submariner-operator
      SUBMARINER_CLUSTERCIDR:                    10.132.0.0/14
      SUBMARINER_SERVICECIDR:                    172.31.0.0/16
      SUBMARINER_GLOBALCIDR:                     
      SUBMARINER_CLUSTERID:                      pkomarov-cluster-a
      SUBMARINER_COLORCODES:                     blue
      SUBMARINER_DEBUG:                          false
      SUBMARINER_NATENABLED:                     true
      SUBMARINER_BROKER:                         k8s
      SUBMARINER_CABLEDRIVER:                    libreswan
      BROKER_K8S_APISERVER:                      api.pkomarov-cluster-a.devcluster.openshift.com:6443
      BROKER_K8S_APISERVERTOKEN:                 
2021-01-12T12:56:12.319693557Z 002 added IKEv2 connection "submariner-cable-default-cl2-10-2-1-169-2-2"
2021-01-12T12:56:12.321322402Z 000 "submariner-cable-default-cl2-10-2-1-169-2-2": queuing pending IPsec SA negotiating with 66.187.232.129 IKE SA #1 "submariner-cable-default-cl2-10-2-1-169-0-0"
2021-01-12T12:56:12.321464613Z I0112 12:56:12.321427       1 cableengine.go:155] Successfully installed Endpoint cable "submariner-cable-default-cl2-10-2-1-169" with remote IP 66.187.232.129
2021-01-12T12:56:12.321464613Z I0112 12:56:12.321451       1 tunnel.go:63] Tunnel controller successfully installed Endpoint cable submariner-cable-default-cl2-10-2-1-169 in the engine
2021-01-12T12:56:12.321706489Z I0112 12:56:12.321643       1 tunnel.go:51] Tunnel controller processing added or updated submariner Endpoint object: &v1.Endpoint{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"pkomarov-cluster-a-submariner-cable-pkomarov-cluster-a-10-1-52-20", GenerateName:"", Namespace:"submariner-operator", SelfLink:"/apis/submariner.io/v1/namespaces/submariner-operator/endpoints/pkomarov-cluster-a-submariner-cable-pkomarov-cluster-a-10-1-52-20", UID:"23cb0235-4d56-4714-b413-60ecab43023b", ResourceVersion:"162905", Generation:1, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63746052972, loc:(*time.Location)(0x231d360)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry{v1.ManagedFieldsEntry{Manager:"submariner-engine", Operation:"Update", APIVersion:"submariner.io/v1", Time:(*v1.Time)(0xc0000aa7a0), FieldsType:"FieldsV1", FieldsV1:(*v1.FieldsV1)(0xc0000aa7e0)}}}, Spec:v1.EndpointSpec{ClusterID:"pkomarov-cluster-a", CableName:"submariner-cable-pkomarov-cluster-a-10-1-52-20", HealthCheckIP:"10.134.2.1", Hostname:"ip-10-1-52-20", Subnets:[]string{"172.31.0.0/16", "10.132.0.0/14"}, PrivateIP:"10.1.52.20", PublicIP:"18.223.235.201", NATEnabled:true, Backend:"libreswan", BackendConfig:map[string]string(nil)}}
2021-01-12T12:56:12.321716825Z I0112 12:56:12.321702       1 cableengine.go:92] Not installing cable for local cluster
2021-01-12T12:56:12.321716825Z I0112 12:56:12.321708       1 tunnel.go:63] Tunnel controller successfully installed Endpoint cable submariner-cable-pkomarov-cluster-a-10-1-52-20 in the engine
2021-01-12T12:56:16.763554914Z I0112 12:56:16.763498       1 libreswan.go:181] Connection "submariner-cable-default-cl2-10-2-1-169-0-0" not found in active connections obtained from whack: map[], map[]
2021-01-12T12:56:16.763554914Z I0112 12:56:16.763533       1 libreswan.go:181] Connection "submariner-cable-default-cl2-10-2-1-169-0-1" not found in active connections obtained from whack: map[], map[]
2021-01-12T12:56:16.763554914Z I0112 12:56:16.763542       1 libreswan.go:181] Connection "submariner-cable-default-cl2-10-2-1-169-0-2" not found in active connections obtained from whack: map[], map[]
2021-01-12T12:56:16.763601281Z I0112 12:56:16.763550       1 libreswan.go:181] Connection "submariner-cable-default-cl2-10-2-1-169-1-0" not found in active connections obtained from whack: map[], map[]
2021-01-12T12:56:16.763601281Z I0112 12:56:16.763558       1 libreswan.go:181] Connection "submariner-cable-default-cl2-10-2-1-169-1-1" not found in active connections obtained from whack: map[], map[]
2021-01-12T12:56:16.763601281Z I0112 12:56:16.763567       1 libreswan.go:181] Connection "submariner-cable-default-cl2-10-2-1-169-1-2" not found in active connections obtained from whack: map[], map[]
2021-01-12T12:56:16.763601281Z I0112 12:56:16.763575       1 libreswan.go:181] Connection "submariner-cable-default-cl2-10-2-1-169-2-0" not found in active connections obtained from whack: map[], map[]
2021-01-12T12:56:16.763601281Z I0112 12:56:16.763583       1 libreswan.go:181] Connection "submariner-cable-default-cl2-10-2-1-169-2-1" not found in active connections obtained from whack: map[], map[]
2021-01-12T12:56:16.763601281Z I0112 12:56:16.763591       1 libreswan.go:181] Connection "submariner-cable-default-cl2-10-2-1-169-2-2" not found in active connections obtained from whack: map[], map[]
2021-01-12T12:56:16.763654078Z I0112 12:56:16.763629       1 libreswan.go:195] Connection "submariner-cable-default-cl2-10-2-1-169" not found in active connections obtained from whack: map[], map[]
2021-01-12T12:56:18.179316978Z I0112 12:56:18.179268       1 pinger.go:142] Pinger for IP "10.139.0.1" stopped
2021-01-12T12:56:18.179316978Z I0112 12:56:18.179291       1 pinger.go:87] Starting pinger for IP "10.139.0.1"

Also Libreswan shows that the connection is marked as "prospective erouted":

000 Total IPsec connections: loaded 9, active 2
000  
000 State Information: DDoS cookies not required, Accepting new IKE connections
000 IKE SAs: total(2), half-open(1), open(0), authenticated(1), anonymous(0)
000 IPsec SAs: total(2), authenticated(2), anonymous(0)
000  
000 #5: "submariner-cable-default-cl2-10-2-1-169-0-0":40406 STATE_V2_ESTABLISHED_CHILD_SA (IPsec SA established); EVENT_SA_REKEY in 19703s; newest IPSEC; eroute owner; isakmp#137; idle;
000 #5: "submariner-cable-default-cl2-10-2-1-169-0-0" esp.e92b4c68@66.187.232.129 esp.c305f5c0@10.1.52.20 tun.0@66.187.232.129 tun.0@10.1.52.20 Traffic: ESPin=0B ESPout=0B! ESPmax=0B 
000 #147: "submariner-cable-default-cl2-10-2-1-169-0-0":4501 STATE_PARENT_I1 (sent IKE_SA_INIT request); EVENT_RETRANSMIT in 12s; idle;
000 #147: pending CHILD SA for "submariner-cable-default-cl2-10-2-1-169-0-0"
000 #147: pending CHILD SA for "submariner-cable-default-cl2-10-2-1-169-0-0"
000 #147: pending CHILD SA for "submariner-cable-default-cl2-10-2-1-169-0-0"
000 #147: pending CHILD SA for "submariner-cable-default-cl2-10-2-1-169-0-0"
000 #147: pending CHILD SA for "submariner-cable-default-cl2-10-2-1-169-0-0"
000 #147: pending CHILD SA for "submariner-cable-default-cl2-10-2-1-169-0-0"
000 #147: pending CHILD SA for "submariner-cable-default-cl2-10-2-1-169-0-0"
000 #147: pending CHILD SA for "submariner-cable-default-cl2-10-2-1-169-0-0"
000 #147: pending CHILD SA for "submariner-cable-default-cl2-10-2-1-169-0-0"
000 #6: "submariner-cable-default-cl2-10-2-1-169-2-2":40406 STATE_V2_ESTABLISHED_CHILD_SA (IPsec SA established); EVENT_SA_REKEY in 18981s; newest IPSEC; eroute owner; isakmp#137; idle;
000 #6: "submariner-cable-default-cl2-10-2-1-169-2-2" esp.fed5fd55@66.187.232.129 esp.42fb8525@10.1.52.20 tun.0@66.187.232.129 tun.0@10.1.52.20 Traffic: ESPin=839KB ESPout=788KB! ESPmax=0B 
000 #137: "submariner-cable-default-cl2-10-2-1-169-2-2":40406 STATE_V2_ESTABLISHED_IKE_SA (established IKE SA); EVENT_SA_REKEY in 2694s; newest ISAKMP; idle;

Environment: OCP cluster A (AWS):

Client Version: 4.6.9 Server Version: 4.6.9 Kubernetes Version: v1.19.0+7070803

OCP cluster B (OSP): Client Version: 4.6.9 Server Version: 4.4.7 Kubernetes Version: v1.17.1+f5fb168

Submariner: subctl version: v0.8.0-25-g7efa84b

Showing information for cluster "default-cl2": COMPONENT REPOSITORY VERSION
submariner
submariner-operator registry.redhat.io/rhacm2-tech-preview/submariner-rhe v0.8.0

manosnoam commented 3 years ago

@sridhargaddam opened issue #1081 with Libreswan specific details in it + tcpdump.

Regarding the subctl connection issue, the current behavior is:

  1. Pinger/Healthcheck only pings the HealthcheckIP (which belongs to Pod CIDR whose connection is fine)

  2. subctl connection code currently marks a connection as connected if at least one of the sub-connections in a connection is active state.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had activity for 60 days. It will be closed if no further activity occurs. Please make a comment if this issue/pr is still valid. Thank you for your contributions.