Closed pinikomarov closed 3 years ago
Hi @tpantelis can you help me on this effort ? first time writing e2e tests here so some things are new to me.. test : "Submariner Engine Pod restarts (i.e., same node and no failover) on the West cluster.)"
As I see it a code plan might be :
@tpantelis , can you take a look at this code plan and comment if it makes sense ? did I miss anything or did I misread existing test methods , etc...
Thanks
@manosnoam Hey man , initial poc test code plan here ^^
We already have HA tests in test/e2e/redundancy/gateway_failover.go. What you describe looks similar/same as to what the current tests already do - unless I'm missing something?
@tpantelis , that's partially correct, the tests do restart the pod, but the health check is different: https://github.com/submariner-io/submariner/blob/1da2b9cf52c1c6527476dfe4ae9230cbe905bc33/test/e2e/redundancy/gateway_failover.go#L81 After the AwaitGatewayFullyConnected we do a RunConnectivityTest which does not give us pass fail test criteria about how much down time we had during the disruption. So what I would do differently is :
So what you're looking to do is to test latencies during fail-over. I'm not sure this is conducive as a single pass/fail test based on some arbitrary threshold. I think fail-over time largely depends on the environment and potential latencies beyond our control, eg how long it takes k8s to restart a pod, how long it takes the new Endpoint to propagate across the broker to the remote cluster, how long it takes for the underlying components to establish a new tunnel etc. It seems this is more conducive as a subctl command similar to the benchmark tests recently added where one can collect data over multiple runs.
adding Sridhar to advice here .. Hi @sridhargaddam , what would be a good test to add from your HA failure scenarios doc ? : https://docs.google.com/spreadsheets/d/11MV7CZmt13D8D4WYJTaddzQxSrXFRrn3UG0LiQKFVwY/edit#gid=0 If we are going with what Tom is saying , (problematic to test cut-off value of ping), then we already have part scenario no. 1 covered ? meaning the disruption and connectivity check, we just have to add connectivity check between different pods in the clusters as your doc specifies ? WDYT ? Thanks
If we are going with what Tom is saying , (problematic to test cut-off value of ping), then we already have part scenario no. 1 covered ?
Yes, Scenario-1 and Scenario-7 are kind of covered in the current e2e redundancy tests where the connectivity is validated (not downtime). Also, I agree with what @tpantelis mentioned. The amount of downtime largely depends on the environment as there are many factors involved that can impact the behavior. So such tests IMHO are more suitable as subctl tests (not as e2e).
meaning the disruption and connectivity check, we just have to add connectivity check between different pods in the clusters as your doc specifies ?
Well, we don't have to try each and every combination of Pod location. The e2e redundancy tests try out the following combinations which are good enough to say if the datapath is working or not.
There is no real advantage if we validate the other two use-cases like
as 3 and 4 will be covered in 1 and 2 tests.
@sridhargaddam , so a worthwhile test case would be : route-agent pod restart on gateway/non-gateway nodes-> then check connectivity between gateway->gatewaypods and non-GW->non-GW pods Sounds right ?
@sridhargaddam , so a worthwhile test case would be : route-agent pod restart on gateway/non-gateway nodes-> then check connectivity between gateway->gatewaypods and non-GW->non-GW pods Sounds right ?
Ideally, we expect that when route-agent pods are restarted the datapath is not disrupted. But we don't have such test-cases today in our e2e. So what we can do is that restart route-agent pods on gw/non-gw (as part of redundancy tests) and ensure that connectivity is not broken.
Moving forward with : 3,6
disruption Method :
"Route-agent Pod restarts on Gateway node of the West cluster.
To restart the route agent Pod, we can use the command
kubectl delete pod
@nyechiel hey does removing this from backlog means it's no relevant anymore ? since I intend on working on this if no other important tasks come up ..
@nyechiel hey does removing this from backlog means it's no relevant anymore ? since I intend on working on this if no other important tasks come up ..
I cleaned the v0.8 board as we warp up this release. We haven't done planing for our next sprint yet, but if you are planning to look at this let's reflect it here: https://github.com/orgs/submariner-io/projects/12
@nyechiel that's good thanks
@sridhargaddam , @tpantelis Hi guys , can you take a look at both wip commits for this test ? : https://github.com/submariner-io/shipyard/pull/394 and https://github.com/submariner-io/submariner/pull/1061 and tell me if the direction is good ? I would like to start functional testing for those commits and would like your opinion , thanks
updated the shipyard PR : https://github.com/submariner-io/shipyard/pull/405
updated the submariner PR : https://github.com/submariner-io/submariner/pull/1084
@tpantelis thanks for the help here :)
Purpose of this issue : To automate - add to e2e (GO) a single HA failure test from the scenarios here : https://docs.google.com/spreadsheets/d/11MV7CZmt13D8D4WYJTaddzQxSrXFRrn3UG0LiQKFVwY/edit#gid=0
More specifically : scenario no. 1 of : setup : KIND Setup: Vanilla Submariner + Libreswan Driver + without Lighthouse
Sridhar's gave me the manual steps which are as follows :