Closed manosnoam closed 4 years ago
That is correct, those are unrelated things.
"active" means, that the gateway, via the HA election protocol has gained "active" status, that means it can start creating connections and publishing himself as an endpoint of the cluster.
At the start it's normal that the connections still don't exist, and will need a few seconds to be created (until the other clusters publish endpoints on the broker, and the specific cluster will receive those endpoints and process...)
A bug would be if this keeps happening for a long time even after the remote clusters have published endpoints, etc... but at the start is completely normal and inside of the expected behavior, as I said they are independent things HA status, and connection status.
That is correct, those are unrelated things.
"active" means, that the gateway, via the HA election protocol has gained "active" status, that means it can start creating connections and publishing himself as an endpoint of the cluster.
At the start it's normal that the connections still don't exist, and will need a few seconds to be created (until the other clusters publish endpoints on the broker, and the specific cluster will receive those endpoints and process...)
A bug would be if this keeps happening for a long time even after the remote clusters have published endpoints, etc... but at the start is completely normal and inside of the expected behavior, as I said they are independent things HA status, and connection status.
I agree with @mangelajo. Also HA status does not fit into connection status when we have more than 2 clusters as there could be a connection loss with a single remote cluster while the connections to all other clusters are still intact.
Up until now I was watching the output of:
$ oc exec $submariner_pod strongswan stroke statusall | grep 'Security Associations \(1 up'
This works great, and makes the automation wait until strongswan associations came up, before continuing with submariner tests. It could take 3-10 minutes...
But now when Strongswan will not be the default cable-driver, how should we watch for Submariner readiness ?
@mangelajo, maybe if in addition to "HA Status active", you can show number of connections established ?
In another test I saw that the Gateway still shows "Status: connecting", even after StrongSwan is up:
$ oc exec submariner-gateway-knc98 -n submariner-operator strongswan stroke statusall | grep 'Security Associations \(1 up'
Status of IKE charon daemon (strongSwan 5.8.4, Linux 4.18.0-147.8.1.el8_1.x86_64, x86_64):
uptime: 52 seconds, since Jun 11 09:17:26 2020
malloc: sbrk 2830336, mmap 0, used 1000880, free 1829456
worker threads: 11 of 16 idle, 5/0/0/0 working, job queue: 0/0/0/0, scheduled: 5
loaded plugins: charon pkcs11 tpm aesni aes des rc2 sha2 sha1 md4 md5 mgf1 random nonce x509 revocation constraints acert pubkey pkcs1 pkcs7 pkcs8 pkcs12 pgp dnskey sshkey pem openssl gcrypt fips-prf gmp curve25519 chapoly xcbc cmac hmac ctr ccm gcm drbg newhope curl attr kernel-netlink resolve socket-default farp stroke vici updown eap-identity eap-sim eap-aka eap-aka-3gpp eap-aka-3gpp2 eap-md5 eap-gtc eap-mschapv2 eap-dynamic eap-radius eap-tls eap-ttls eap-peap xauth-generic xauth-eap xauth-pam xauth-noauth dhcp led duplicheck unity counters
Listening IP addresses:
10.166.16.21
10.254.2.1
240.166.16.21
Connections:
submariner-cable-nmanos-cluster-b-10-166-0-26: 10.166.16.21...66.187.233.202 IKEv2
submariner-cable-nmanos-cluster-b-10-166-0-26: local: [3.21.232.238] uses pre-shared key authentication
submariner-cable-nmanos-cluster-b-10-166-0-26: remote: [66.187.233.202] uses pre-shared key authentication
submariner-child-submariner-cable-nmanos-cluster-b-10-166-0-26: child: 10.166.16.21/32 169.254.0.0/19 === 10.166.0.26/32 169.254.32.0/19 TUNNEL
Security Associations (1 up, 1 connecting):
submariner-cable-nmanos-cluster-b-10-166-0-26[3]: ESTABLISHED 1 second ago, 10.166.16.21[3.21.232.238]...66.187.233.202[66.187.233.202]
submariner-cable-nmanos-cluster-b-10-166-0-26[3]: IKEv2 SPIs: e85d05d025f69511_i 80628238eb3932a7_r*, rekeying in 3 hours
submariner-cable-nmanos-cluster-b-10-166-0-26[3]: IKE proposal: AES_GCM_16_128/PRF_HMAC_SHA2_256/MODP_2048
submariner-child-submariner-cable-nmanos-cluster-b-10-166-0-26{1}: INSTALLED, TUNNEL, reqid 1, ESP in UDP SPIs: cbcbcf94_i c5a99d8a_o
submariner-child-submariner-cable-nmanos-cluster-b-10-166-0-26{1}: AES_GCM_16_128, 0 bytes_i, 0 bytes_o, rekeying in 56 minutes
submariner-child-submariner-cable-nmanos-cluster-b-10-166-0-26{1}: 10.166.16.21/32 169.254.0.0/19 === 10.166.0.26/32 169.254.32.0/19
submariner-cable-nmanos-cluster-b-10-166-0-26[2]: CONNECTING, 10.166.16.21[%any]...66.187.233.202[%any]
submariner-cable-nmanos-cluster-b-10-166-0-26[2]: IKEv2 SPIs: 99dc41f4c459ba83_i* 0000000000000000_r
submariner-cable-nmanos-cluster-b-10-166-0-26[2]: Tasks active: IKE_VENDOR IKE_INIT IKE_NATD IKE_CERT_PRE IKE_AUTH IKE_CERT_POST IKE_CONFIG CHILD_CREATE IKE_AUTH_LIFETIME
$ oc exec submariner-gateway-knc98 -n submariner-operator -- bash -c "swanctl --list-sas --uri unix:///var/run/charon.vici"
submariner-cable-nmanos-cluster-b-10-166-0-26: #3, ESTABLISHED, IKEv2, e85d05d025f69511_i 80628238eb3932a7_r*
local '3.21.232.238' @ 10.166.16.21[4501]
remote '66.187.233.202' @ 66.187.233.202[4501]
AES_GCM_16-128/PRF_HMAC_SHA2_256/MODP_2048
established 2s ago, rekeying in 13540s
submariner-child-submariner-cable-nmanos-cluster-b-10-166-0-26: #1, reqid 1, INSTALLED, TUNNEL-in-UDP, ESP:AES_GCM_16-128
installed 2s ago, rekeying in 3396s, expires in 3958s
in cbcbcf94, 0 bytes, 0 packets
out c5a99d8a, 0 bytes, 0 packets
local 10.166.16.21/32 169.254.0.0/19
remote 10.166.0.26/32 169.254.32.0/19
submariner-cable-nmanos-cluster-b-10-166-0-26: #2, CONNECTING, IKEv2, 99dc41f4c459ba83_i* 0000000000000000_r
local '%any' @ 10.166.16.21[501]
remote '%any' @ 66.187.233.202[4501]
active: IKE_VENDOR IKE_INIT IKE_NATD IKE_CERT_PRE IKE_AUTH IKE_CERT_POST IKE_CONFIG CHILD_CREATE IKE_AUTH_LIFETIME
$ oc describe Gateway -n submariner-operator | grep "Ha Status:\s*active"
Name: ip-10-166-16-21
Namespace: submariner-operator
Labels: <none>
Annotations: update-timestamp: 1591867097
API Version: submariner.io/v1
Kind: Gateway
Metadata:
Creation Timestamp: 2020-06-11T09:17:27Z
Generation: 2
Resource Version: 24899
Self Link: /apis/submariner.io/v1/namespaces/submariner-operator/gateways/ip-10-166-16-21
UID: 887479e9-20ce-45cf-a086-7407d95b25cc
Status:
Connections:
Endpoint:
Backend: strongswan
cable_name: submariner-cable-nmanos-cluster-b-10-166-0-26
cluster_id: nmanos-cluster-b
Hostname: default-cl1-9vgwj-worker-bq8rt
nat_enabled: true
private_ip: 10.166.0.26
public_ip: 66.187.233.202
Subnets:
169.254.32.0/19
Status: connecting
Status Message: Connecting to 66.187.233.202:4501
Ha Status: active
Local Endpoint:
Backend: strongswan
cable_name: submariner-cable-nmanos-cluster-a-10-166-16-21
cluster_id: nmanos-cluster-a
Hostname: ip-10-166-16-21
nat_enabled: true
private_ip: 10.166.16.21
public_ip: 3.21.232.238
Subnets:
169.254.0.0/19
Status Failure:
Version: v0.4.0-rc2
Events: <none>
$ subctl info
Discovered network details:
Network plugin: OpenShift
Service CIDRs: [100.96.0.0/16]
Cluster CIDRs: [10.252.0.0/14]
Up until now I was watching the output of:
$ oc exec $submariner_pod strongswan stroke statusall | grep 'Security Associations \(1 up'
This works great, and makes the automation wait until strongswan associations came up, before continuing with submariner tests. It could take 3-10 minutes...
But now when Strongswan will not be the default cable-driver, how should we watch for Submariner readiness ?
@manosnoam Recently, subctl was enhanced to support some additional commands where we can get the status of connections/endpoints/gateways etc. Can you check if "subctl show connections" is suitable for your needs?
[sgaddam@localhost debug_job_Tests_10082020_0939]$ subctl show connections
GATEWAY CLUSTER REMOTE IP CABLE DRIVER SUBNETS STATUS
default-cl1-ff6px-worker-kjg2n nmanos-cluster-b 10.166.0.205 libreswan 169.254.32.0/19 connected
@mangelajo, maybe if in addition to "HA Status active", you can show number of connections established ?
In another test I saw that the Gateway still shows "Status: connecting", even after StrongSwan is up:
Was this on the AWS Cluster? If so, yeah on AWS Cluster with Strongswan we do notice that the connection request from OnPremCluster to AWS will be in ESTABLISHED state but the connection request that is initiated from AWS to OnPremCluster will be in CONNECTING state as the connection cannot be established. However, the datapath works fine as it seems to use the ESTABLISHED connection.
CC: @skitt @nyechiel
Thanks @sridhargaddam I've changed the tests order, so first I'm testing that the cable driver is ready, and then checking that HA status is active. Works good, closing issue.