Add 1 HA Failure Scenario to E2E testing (POC)

pinikomarov commented 4 years ago

Purpose of this issue : To automate - add to e2e (GO) a single HA failure test from the scenarios here : https://docs.google.com/spreadsheets/d/11MV7CZmt13D8D4WYJTaddzQxSrXFRrn3UG0LiQKFVwY/edit#gid=0

More specifically : scenario no. 1 of : setup : KIND Setup: Vanilla Submariner + Libreswan Driver + without Lighthouse

Sridhar's gave me the manual steps which are as follows :

Okay, I did the following manually.

Deploy clusters using KIND. In submariner repo, execute the command "make deploy using=libreswan"

The above command creates three clusters - cluster1 (broker), cluster2, and cluster3.

I used cluster2 as West cluster and cluster3 as East cluster (sample topology diagram in the [*])

The above steps are common for all the scenarios. Now let's take the scenario "Submariner Engine Pod restarts (i.e., same node and no failover) on the West cluster."

For this, I deploy a few fedora pods on cluster2 and cluster3. kubectl run fedora-pod -i --tty --image fedora -- /bin/bash

Before triggering any failure, I ensure that from the fedora pod on cluster2 I'm able to ping to the fedora pod on cluster3 (install ping utility if its missing).

I continue to run ping command as shown below to understand the impact of datapath loss. i.e., on fedora pod on Cluster2, run the following command ping -D -i 0.1

Now identify the submariner-gateway pod on cluster2 (say it's named as submariner-gateway-8pj66) and delete it (to simulate restart scenario). kubectl delete pod submariner-gateway-8pj66 -n submariner-operator

Wait for the submariner-gateway pod to restart and enter Running state. Look at the output of "ping -D ...." in your fedora pod and measure the amount of time where ping did not get any response. As you can see, this was a manual test.

pinikomarov commented 4 years ago

Hi @tpantelis can you help me on this effort ? first time writing e2e tests here so some things are new to me.. test : "Submariner Engine Pod restarts (i.e., same node and no failover) on the West cluster.)"

As I see it a code plan might be :

f.AwaitGatewayFullyConnected - implemented in shipyard
Removing submariner.io/gateway label from the Active Gateway node <- not implemented
Issuing kubectl delete pod -n submariner-operator" - implemented in shipyard/nodes.go :func (f *Framework) DeletePod(
ping between two pods on clusters to the gateway pod and check pass fail criteria <- not implemented

@tpantelis , can you take a look at this code plan and comment if it makes sense ? did I miss anything or did I misread existing test methods , etc...

Thanks

@manosnoam Hey man , initial poc test code plan here ^^

tpantelis commented 4 years ago

We already have HA tests in test/e2e/redundancy/gateway_failover.go. What you describe looks similar/same as to what the current tests already do - unless I'm missing something?

pinikomarov commented 4 years ago

@tpantelis , that's partially correct, the tests do restart the pod, but the health check is different: https://github.com/submariner-io/submariner/blob/1da2b9cf52c1c6527476dfe4ae9230cbe905bc33/test/e2e/redundancy/gateway_failover.go#L81 After the AwaitGatewayFullyConnected we do a RunConnectivityTest which does not give us pass fail test criteria about how much down time we had during the disruption. So what I would do differently is :

add is a ping and check for how long the ping was not hitting anything between pods, and most probably add a cut-off after which too much ping loss would fail the test.
the ping would be between a pod on the cluster and not from gateway node to gateway node @sridhargaddam correct me if I'm wrong here please, thanks

tpantelis commented 4 years ago

So what you're looking to do is to test latencies during fail-over. I'm not sure this is conducive as a single pass/fail test based on some arbitrary threshold. I think fail-over time largely depends on the environment and potential latencies beyond our control, eg how long it takes k8s to restart a pod, how long it takes the new Endpoint to propagate across the broker to the remote cluster, how long it takes for the underlying components to establish a new tunnel etc. It seems this is more conducive as a subctl command similar to the benchmark tests recently added where one can collect data over multiple runs.

pinikomarov commented 4 years ago

adding Sridhar to advice here .. Hi @sridhargaddam , what would be a good test to add from your HA failure scenarios doc ? : https://docs.google.com/spreadsheets/d/11MV7CZmt13D8D4WYJTaddzQxSrXFRrn3UG0LiQKFVwY/edit#gid=0 If we are going with what Tom is saying , (problematic to test cut-off value of ping), then we already have part scenario no. 1 covered ? meaning the disruption and connectivity check, we just have to add connectivity check between different pods in the clusters as your doc specifies ? WDYT ? Thanks

sridhargaddam commented 4 years ago

If we are going with what Tom is saying , (problematic to test cut-off value of ping), then we already have part scenario no. 1 covered ?

Yes, Scenario-1 and Scenario-7 are kind of covered in the current e2e redundancy tests where the connectivity is validated (not downtime). Also, I agree with what @tpantelis mentioned. The amount of downtime largely depends on the environment as there are many factors involved that can impact the behavior. So such tests IMHO are more suitable as subctl tests (not as e2e).

meaning the disruption and connectivity check, we just have to add connectivity check between different pods in the clusters as your doc specifies ?

Well, we don't have to try each and every combination of Pod location. The e2e redundancy tests try out the following combinations which are good enough to say if the datapath is working or not.

Verifies TCP connectivity from gateway node Pod on WestCluster to gateway node Pod on EastCluster
Verifies TCP connectivity from non-gateway node Pod on WestCluster to non-gateway node Pod on EastCluster

There is no real advantage if we validate the other two use-cases like

Verifies TCP connectivity from gateway node Pod on WestCluster to non-gateway node Pod on EastCluster
Verifies TCP connectivity from non-gateway node Pod on WestCluster to gateway node Pod on EastCluster

as 3 and 4 will be covered in 1 and 2 tests.

pinikomarov commented 4 years ago

@sridhargaddam , so a worthwhile test case would be : route-agent pod restart on gateway/non-gateway nodes-> then check connectivity between gateway->gatewaypods and non-GW->non-GW pods Sounds right ?

sridhargaddam commented 4 years ago

@sridhargaddam , so a worthwhile test case would be : route-agent pod restart on gateway/non-gateway nodes-> then check connectivity between gateway->gatewaypods and non-GW->non-GW pods Sounds right ?

Ideally, we expect that when route-agent pods are restarted the datapath is not disrupted. But we don't have such test-cases today in our e2e. So what we can do is that restart route-agent pods on gw/non-gw (as part of redundancy tests) and ensure that connectivity is not broken.