submariner-io / submariner

Networking component for interconnecting Pods and Services across Kubernetes clusters.
https://submariner.io
Apache License 2.0
2.42k stars 190 forks source link

HA Status is "active" before Strongswan associations came up #624

Closed manosnoam closed 4 years ago

manosnoam commented 4 years ago
$ oc describe Gateway -n submariner-operator # | grep 'Ha Status:\s*active'   

Name:         ip-10-166-5-197
Namespace:    submariner-operator
Labels:       <none>
Annotations:  update-timestamp: 1591304966
API Version:  submariner.io/v1
Kind:         Gateway
Metadata:
  Creation Timestamp:  2020-06-04T21:09:20Z
  Generation:          2
  Resource Version:    26737
  Self Link:           /apis/submariner.io/v1/namespaces/submariner-operator/gateways/ip-10-166-5-197
  UID:                 90e63ced-51c8-44c6-8315-7e52809ca598
Status:
  Connections:
    Endpoint:
      Backend:      strongswan
      cable_name:   submariner-cable-nmanos-cluster-b-10-1-0-13
      cluster_id:   nmanos-cluster-b
      Hostname:     default-cl1-8pt5x-worker-5fm49
      nat_enabled:  true
      private_ip:   10.1.0.13
      public_ip:    66.187.233.202
      Subnets:
        169.254.32.0/19
    Status:          connecting
    Status Message:  Connecting to 66.187.233.202:4501
  Ha Status:         active
  Local Endpoint:
    Backend:      strongswan
    cable_name:   submariner-cable-nmanos-cluster-a-10-166-5-197
    cluster_id:   nmanos-cluster-a
    Hostname:     ip-10-166-5-197
    nat_enabled:  true
    private_ip:   10.166.5.197
    public_ip:    18.191.75.213
    Subnets:
      169.254.0.0/19
  Status Failure:  
  Version:         v0.4.0-rc1
Events:            <none>

$ subctl info

    Discovered network details:
        Network plugin:  OpenShift
        Service CIDRs: [100.96.0.0/16]
        Cluster CIDRs: [10.252.0.0/14]

$ oc describe cm -n openshift-dns

Name:         dns-default
Namespace:    openshift-dns
Labels:       dns.operator.openshift.io/owning-dns=default
Annotations:  <none>

Data
====
Corefile:
----
# lighthouse
supercluster.local:5353 {
    forward . 100.96.145.64
}
.:5353 {
    errors
    health
    kubernetes cluster.local in-addr.arpa ip6.arpa {
        pods insecure
        upstream
        fallthrough in-addr.arpa ip6.arpa
    }
    prometheus :9153
    forward . /etc/resolv.conf {
        policy sequential
    }
    cache 30
    reload
}

Events:  <none>

$ oc get pods -n submariner-operator --show-labels

NAME                                             READY   STATUS    RESTARTS   AGE   LABELS
submariner-gateway-frtvd                         1/1     Running   0          32s   app=submariner-engine,controller-revision-hash=5fc48cc9f8,pod-template-generation=1
submariner-globalnet-94vx8                       1/1     Running   1          30s   app=submariner-globalnet,component=globalnet,controller-revision-hash=7cb9d85dc,pod-template-generation=1
submariner-lighthouse-agent-69f7b6b9f8-2dn7v     1/1     Running   0          29s   app=submariner-lighthouse-agent,component=submariner-lighthouse,pod-template-hash=69f7b6b9f8
submariner-lighthouse-coredns-69b5cc5746-5rl8w   1/1     Running   0          27s   app=submariner-lighthouse-coredns,component=submariner-lighthouse,pod-template-hash=69b5cc5746
submariner-lighthouse-coredns-69b5cc5746-777wx   1/1     Running   0          27s   app=submariner-lighthouse-coredns,component=submariner-lighthouse,pod-template-hash=69b5cc5746
submariner-operator-7b554d7c98-7bd28             1/1     Running   0          57s   name=submariner-operator,pod-template-hash=7b554d7c98
submariner-routeagent-28t2s                      1/1     Running   0          30s   app=submariner-routeagent,component=routeagent,controller-revision-hash=849768458d,pod-template-generation=1
submariner-routeagent-hl97d                      1/1     Running   0          30s   app=submariner-routeagent,component=routeagent,controller-revision-hash=849768458d,pod-template-generation=1
submariner-routeagent-kplwb                      1/1     Running   0          30s   app=submariner-routeagent,component=routeagent,controller-revision-hash=849768458d,pod-template-generation=1
submariner-routeagent-rcntb                      1/1     Running   0          30s   app=submariner-routeagent,component=routeagent,controller-revision-hash=849768458d,pod-template-generation=1

$ oc get clusters -n submariner-operator -o wide
NAME               AGE
nmanos-cluster-a   8s
nmanos-cluster-b   8s

$ oc describe cluster "nmanos-cluster-a" -n submariner-operator

Name:         nmanos-cluster-a
Namespace:    submariner-operator
Labels:       <none>
Annotations:  <none>
API Version:  submariner.io/v1
Kind:         Cluster
Metadata:
  Creation Timestamp:  2020-06-04T21:09:21Z
  Generation:          1
  Resource Version:    26686
  Self Link:           /apis/submariner.io/v1/namespaces/submariner-operator/clusters/nmanos-cluster-a
  UID:                 3cf75eb2-3425-470b-a7ef-ce4521a676eb
Spec:
  cluster_cidr:
    10.252.0.0/14
  cluster_id:  nmanos-cluster-a
  color_codes:
    blue
  global_cidr:
    169.254.0.0/19
  service_cidr:
    100.96.0.0/16
Events:  <none>

$ oc exec submariner-gateway-frtvd -n submariner-operator strongswan stroke statusall # | grep 'Security Associations \(1 up'

Status of IKE charon daemon (strongSwan 5.8.2, Linux 4.18.0-147.8.1.el8_1.x86_64, x86_64):
  uptime: 10 seconds, since Jun 04 21:09:21 2020
  malloc: sbrk 2830336, mmap 0, used 901296, free 1929040
  worker threads: 11 of 16 idle, 5/0/0/0 working, job queue: 0/0/0/0, scheduled: 1
  loaded plugins: charon pkcs11 tpm aesni aes des rc2 sha2 sha1 md4 md5 mgf1 random nonce x509 revocation constraints acert pubkey pkcs1 pkcs7 pkcs8 pkcs12 pgp dnskey sshkey pem openssl gcrypt fips-prf gmp curve25519 chapoly xcbc cmac hmac ctr ccm gcm drbg curl attr kernel-netlink resolve socket-default farp stroke vici updown eap-identity eap-sim eap-aka eap-aka-3gpp eap-aka-3gpp2 eap-md5 eap-gtc eap-mschapv2 eap-dynamic eap-radius eap-tls eap-ttls eap-peap xauth-generic xauth-eap xauth-pam xauth-noauth dhcp led duplicheck unity counters
Listening IP addresses:
  10.166.5.197
  10.254.2.1
  240.166.5.197
Connections:
submariner-cable-nmanos-cluster-b-10-1-0-13:  10.166.5.197...66.187.233.202  IKEv2
submariner-cable-nmanos-cluster-b-10-1-0-13:   local:  [18.191.75.213] uses pre-shared key authentication
submariner-cable-nmanos-cluster-b-10-1-0-13:   remote: [66.187.233.202] uses pre-shared key authentication
submariner-child-submariner-cable-nmanos-cluster-b-10-1-0-13:   child:  10.166.5.197/32 169.254.0.0/19 === 10.1.0.13/32 169.254.32.0/19 TUNNEL
Security Associations (0 up, 1 connecting):
submariner-cable-nmanos-cluster-b-10-1-0-13[2]: CONNECTING, 10.166.5.197[%any]...66.187.233.202[%any]
submariner-cable-nmanos-cluster-b-10-1-0-13[2]: IKEv2 SPIs: e5c4b4d50423c3c1_i* 0000000000000000_r
submariner-cable-nmanos-cluster-b-10-1-0-13[2]: Tasks active: IKE_VENDOR IKE_INIT IKE_NATD IKE_CERT_PRE IKE_AUTH IKE_CERT_POST IKE_CONFIG CHILD_CREATE IKE_AUTH_LIFETIME
mangelajo commented 4 years ago

That is correct, those are unrelated things.

"active" means, that the gateway, via the HA election protocol has gained "active" status, that means it can start creating connections and publishing himself as an endpoint of the cluster.

At the start it's normal that the connections still don't exist, and will need a few seconds to be created (until the other clusters publish endpoints on the broker, and the specific cluster will receive those endpoints and process...)

A bug would be if this keeps happening for a long time even after the remote clusters have published endpoints, etc... but at the start is completely normal and inside of the expected behavior, as I said they are independent things HA status, and connection status.

sridhargaddam commented 4 years ago

That is correct, those are unrelated things.

"active" means, that the gateway, via the HA election protocol has gained "active" status, that means it can start creating connections and publishing himself as an endpoint of the cluster.

At the start it's normal that the connections still don't exist, and will need a few seconds to be created (until the other clusters publish endpoints on the broker, and the specific cluster will receive those endpoints and process...)

A bug would be if this keeps happening for a long time even after the remote clusters have published endpoints, etc... but at the start is completely normal and inside of the expected behavior, as I said they are independent things HA status, and connection status.

I agree with @mangelajo. Also HA status does not fit into connection status when we have more than 2 clusters as there could be a connection loss with a single remote cluster while the connections to all other clusters are still intact.

manosnoam commented 4 years ago

Up until now I was watching the output of: $ oc exec $submariner_pod strongswan stroke statusall | grep 'Security Associations \(1 up'

This works great, and makes the automation wait until strongswan associations came up, before continuing with submariner tests. It could take 3-10 minutes...

But now when Strongswan will not be the default cable-driver, how should we watch for Submariner readiness ?

manosnoam commented 4 years ago

@mangelajo, maybe if in addition to "HA Status active", you can show number of connections established ?

In another test I saw that the Gateway still shows "Status: connecting", even after StrongSwan is up:

$ oc exec submariner-gateway-knc98 -n submariner-operator strongswan stroke statusall | grep 'Security Associations \(1 up'   

Status of IKE charon daemon (strongSwan 5.8.4, Linux 4.18.0-147.8.1.el8_1.x86_64, x86_64):
   uptime: 52 seconds, since Jun 11 09:17:26 2020
   malloc: sbrk 2830336, mmap 0, used 1000880, free 1829456
   worker threads: 11 of 16 idle, 5/0/0/0 working, job queue: 0/0/0/0, scheduled: 5
   loaded plugins: charon pkcs11 tpm aesni aes des rc2 sha2 sha1 md4 md5 mgf1 random nonce x509 revocation constraints acert pubkey pkcs1 pkcs7 pkcs8 pkcs12 pgp dnskey sshkey pem openssl gcrypt fips-prf gmp curve25519 chapoly xcbc cmac hmac ctr ccm gcm drbg newhope curl attr kernel-netlink resolve socket-default farp stroke vici updown eap-identity eap-sim eap-aka eap-aka-3gpp eap-aka-3gpp2 eap-md5 eap-gtc eap-mschapv2 eap-dynamic eap-radius eap-tls eap-ttls eap-peap xauth-generic xauth-eap xauth-pam xauth-noauth dhcp led duplicheck unity counters
 Listening IP addresses:
   10.166.16.21
   10.254.2.1
   240.166.16.21
 Connections:
 submariner-cable-nmanos-cluster-b-10-166-0-26:  10.166.16.21...66.187.233.202  IKEv2
 submariner-cable-nmanos-cluster-b-10-166-0-26:   local:  [3.21.232.238] uses pre-shared key authentication
 submariner-cable-nmanos-cluster-b-10-166-0-26:   remote: [66.187.233.202] uses pre-shared key authentication
 submariner-child-submariner-cable-nmanos-cluster-b-10-166-0-26:   child:  10.166.16.21/32 169.254.0.0/19 === 10.166.0.26/32 169.254.32.0/19 TUNNEL
 Security Associations (1 up, 1 connecting):
 submariner-cable-nmanos-cluster-b-10-166-0-26[3]: ESTABLISHED 1 second ago, 10.166.16.21[3.21.232.238]...66.187.233.202[66.187.233.202]
 submariner-cable-nmanos-cluster-b-10-166-0-26[3]: IKEv2 SPIs: e85d05d025f69511_i 80628238eb3932a7_r*, rekeying in 3 hours
 submariner-cable-nmanos-cluster-b-10-166-0-26[3]: IKE proposal: AES_GCM_16_128/PRF_HMAC_SHA2_256/MODP_2048
 submariner-child-submariner-cable-nmanos-cluster-b-10-166-0-26{1}:  INSTALLED, TUNNEL, reqid 1, ESP in UDP SPIs: cbcbcf94_i c5a99d8a_o
 submariner-child-submariner-cable-nmanos-cluster-b-10-166-0-26{1}:  AES_GCM_16_128, 0 bytes_i, 0 bytes_o, rekeying in 56 minutes
 submariner-child-submariner-cable-nmanos-cluster-b-10-166-0-26{1}:   10.166.16.21/32 169.254.0.0/19 === 10.166.0.26/32 169.254.32.0/19
 submariner-cable-nmanos-cluster-b-10-166-0-26[2]: CONNECTING, 10.166.16.21[%any]...66.187.233.202[%any]
 submariner-cable-nmanos-cluster-b-10-166-0-26[2]: IKEv2 SPIs: 99dc41f4c459ba83_i* 0000000000000000_r
 submariner-cable-nmanos-cluster-b-10-166-0-26[2]: Tasks active: IKE_VENDOR IKE_INIT IKE_NATD IKE_CERT_PRE IKE_AUTH IKE_CERT_POST IKE_CONFIG CHILD_CREATE IKE_AUTH_LIFETIME 

$ oc exec submariner-gateway-knc98 -n submariner-operator -- bash -c "swanctl --list-sas --uri unix:///var/run/charon.vici"

  submariner-cable-nmanos-cluster-b-10-166-0-26: #3, ESTABLISHED, IKEv2, e85d05d025f69511_i 80628238eb3932a7_r*
    local  '3.21.232.238' @ 10.166.16.21[4501]
    remote '66.187.233.202' @ 66.187.233.202[4501]
    AES_GCM_16-128/PRF_HMAC_SHA2_256/MODP_2048
    established 2s ago, rekeying in 13540s
    submariner-child-submariner-cable-nmanos-cluster-b-10-166-0-26: #1, reqid 1, INSTALLED, TUNNEL-in-UDP, ESP:AES_GCM_16-128
      installed 2s ago, rekeying in 3396s, expires in 3958s
      in  cbcbcf94,      0 bytes,     0 packets
      out c5a99d8a,      0 bytes,     0 packets
      local  10.166.16.21/32 169.254.0.0/19
      remote 10.166.0.26/32 169.254.32.0/19
  submariner-cable-nmanos-cluster-b-10-166-0-26: #2, CONNECTING, IKEv2, 99dc41f4c459ba83_i* 0000000000000000_r
    local  '%any' @ 10.166.16.21[501]
    remote '%any' @ 66.187.233.202[4501]
    active:  IKE_VENDOR IKE_INIT IKE_NATD IKE_CERT_PRE IKE_AUTH IKE_CERT_POST IKE_CONFIG CHILD_CREATE IKE_AUTH_LIFETIME

$ oc describe Gateway -n submariner-operator | grep "Ha Status:\s*active"

 Name:         ip-10-166-16-21
 Namespace:    submariner-operator
 Labels:       <none>
 Annotations:  update-timestamp: 1591867097
 API Version:  submariner.io/v1
 Kind:         Gateway
 Metadata:
   Creation Timestamp:  2020-06-11T09:17:27Z
   Generation:          2
   Resource Version:    24899
   Self Link:           /apis/submariner.io/v1/namespaces/submariner-operator/gateways/ip-10-166-16-21
   UID:                 887479e9-20ce-45cf-a086-7407d95b25cc
 Status:
   Connections:
     Endpoint:
       Backend:      strongswan
       cable_name:   submariner-cable-nmanos-cluster-b-10-166-0-26
       cluster_id:   nmanos-cluster-b
       Hostname:     default-cl1-9vgwj-worker-bq8rt
       nat_enabled:  true
       private_ip:   10.166.0.26
       public_ip:    66.187.233.202
       Subnets:
         169.254.32.0/19
     Status:          connecting
     Status Message:  Connecting to 66.187.233.202:4501
   Ha Status:         active
   Local Endpoint:
     Backend:      strongswan
     cable_name:   submariner-cable-nmanos-cluster-a-10-166-16-21
     cluster_id:   nmanos-cluster-a
     Hostname:     ip-10-166-16-21
     nat_enabled:  true
     private_ip:   10.166.16.21
     public_ip:    3.21.232.238
     Subnets:
       169.254.0.0/19
   Status Failure:  
   Version:         v0.4.0-rc2
 Events:            <none>

$ subctl info

     Discovered network details:
         Network plugin:  OpenShift
         Service CIDRs: [100.96.0.0/16]
         Cluster CIDRs: [10.252.0.0/14]
sridhargaddam commented 4 years ago

Up until now I was watching the output of: $ oc exec $submariner_pod strongswan stroke statusall | grep 'Security Associations \(1 up'

This works great, and makes the automation wait until strongswan associations came up, before continuing with submariner tests. It could take 3-10 minutes...

But now when Strongswan will not be the default cable-driver, how should we watch for Submariner readiness ?

@manosnoam Recently, subctl was enhanced to support some additional commands where we can get the status of connections/endpoints/gateways etc. Can you check if "subctl show connections" is suitable for your needs?

[sgaddam@localhost debug_job_Tests_10082020_0939]$ subctl show connections GATEWAY CLUSTER REMOTE IP CABLE DRIVER SUBNETS STATUS
default-cl1-ff6px-worker-kjg2n nmanos-cluster-b 10.166.0.205 libreswan 169.254.32.0/19 connected

sridhargaddam commented 4 years ago

@mangelajo, maybe if in addition to "HA Status active", you can show number of connections established ?

In another test I saw that the Gateway still shows "Status: connecting", even after StrongSwan is up:

Was this on the AWS Cluster? If so, yeah on AWS Cluster with Strongswan we do notice that the connection request from OnPremCluster to AWS will be in ESTABLISHED state but the connection request that is initiated from AWS to OnPremCluster will be in CONNECTING state as the connection cannot be established. However, the datapath works fine as it seems to use the ESTABLISHED connection.

CC: @skitt @nyechiel

manosnoam commented 4 years ago

Thanks @sridhargaddam I've changed the tests order, so first I'm testing that the cable driver is ready, and then checking that HA status is active. Works good, closing issue.