Clean of resources causes errors

jetzlstorfer commented 4 years ago

Cleanup of the experiment is throwing errors.

The errors are thrown in the TestFinished Event Handler as we are going to delete some resources there.

2020-10-01T12:20:45.264698332Z 2020/10/01 12:20:45 Deleting chaos experiment resources
2020-10-01T12:20:45.352877445Z 2020/10/01 12:20:45 Error execute kubectl delete command: Error executing command kubectl delete -f litmus/experiment.yaml: exit status 1
2020-10-01T12:20:45.352914192Z Error from server (NotFound): error when deleting "litmus/experiment.yaml": chaosengines.litmuschaos.io "carts-chaos" not found

We need to investigate why the error is thrown and fix it

ksatchit commented 4 years ago

Refer: https://github.com/keptn-sandbox/litmus-service/issues/2#issuecomment-702098896

Right now, the testsFinishedEventHandler is invoked twice - once by the chaos-test followed by the jmeter test.

The ChaosEngine is already removed by the time the jmeter sends the testFinishedEvent, leading to the described error log. This can be seen in the litmus-service pod logs:

Completion of chaos experiment:

2020/10/07 09:43:10 Chaos experiment is completed
2020/10/07 09:43:10 ChaosExperiment Verdict: Pass
2020/10/07 09:43:10 Final Result: pass

Handle TestFinishedEvent sent by Litmus service

2020/10/07 09:43:10 gotEvent(sh.keptn.events.tests-finished): e204ed5e-e6fc-44e3-8825-622fac89294a - df64d289-c733-489b-be1f-3f970aaf1ecc
2020/10/07 09:43:10 Processing Test Finished Event
2020/10/07 09:43:10 Handling Tests Finished Event: df64d289-c733-489b-be1f-3f970aaf1ecc
2020/10/07 09:43:10 Deleting chaos experiment resources

Handle TestFinishedEvent sent by Jmeter service

Jmeter service log

{"timestamp":"2020-10-07T09:54:23.647352508Z","logLevel":"DEBUG","message":"Successfully executed JMeter test. Project: litmus, Service: carts, Stage: chaos, TestStra
tegy: performance"}
{"timestamp":"2020-10-07T09:54:23.647449482Z","logLevel":"INFO","message":"Tests for performance with status = true.Project: litmus, Service: carts, Stage: chaos, Tes
tStrategy: performance"}

Litmus service log

2020/10/07 09:54:23 gotEvent(sh.keptn.events.tests-finished): e204ed5e-e6fc-44e3-8825-622fac89294a - 9cf4d689-4c7f-4172-8430-b49c9edddaab
2020/10/07 09:54:23 Processing Test Finished Event
2020/10/07 09:54:23 Handling Tests Finished Event: 9cf4d689-4c7f-4172-8430-b49c9edddaab
2020/10/07 09:54:23 Deleting chaos experiment resources
2020/10/07 09:54:24 Error execute kubectl delete command: Error executing command kubectl delete -f litmus/experiment.yaml: exit status 1
Error from server (NotFound): error when deleting "litmus/experiment.yaml": chaosengines.litmuschaos.io "carts-chaos" not found

ksatchit commented 4 years ago

Current thoughts and direction regarding handling of `testFinished` events

It is agreed that the chaosengine deletion is something we would want performed as part of the testFinishedHandler rather than earlier (i.e., in the deploymentFinishedEventHandler, where we wait for chaos completion). The reason being, we want to follow the pattern where cleanup occurs after the perf/other tests are completed (this is the real end of test), notwithstanding how long the chaos runs. It is possible that the chaos completes much earlier than the perf tests OR may run up until that point (either via a combination of values within TOTAL_CHAOS_DURATION & CHAOS_INTERVAL or some new capabilities in the experiment that allows it to run indefinitely).

(Note: The eval is automatically going to happen from start of perf test to the end and thereby it is not mandatory to have chaos running throughout the perf test - it will factor in the duration when chaos occurs. This is closer to real-world chaos - say, a random kill as against continuous chaos which is more of a stress test)
The removal of the chaosengine is not deemed critical at this point, as a re-application of the engine will work anyways (as long as the previous engine's status is set to completed or stopped). Even in case of failed deletion, it is an error we can still ignore (for now) - i.e., isn't a blocker for proceeding with further tasks in the integration such as documentation etc.,

Having agreed upon the above, the options we have, include:

(a). Using an ENV var (for ex: SEND_TEST_FINISHED_EVENT) in the litmus-service deployment that decides on whether we can have the litmus service generate the event or not. The default can be false, which means only the jmeter's (or any other perf service) testFinishedEvent alone will be handled by removing the chaosengine. true would indicate cleaning up only when the litmus-service sends the testFinishedEvent.
(b). Bringing up the litmus-service deployment with an ENV that points to the event source that we need to really consider as the primary test, with only a testFinishedEvent from this source being considered
(c). As part of the testFinishedEventHandler, verify if the chaosengine is present, and attempt deletion only if the chaosengine resource is still present.

Current Choice

(b) can be ruled out based on the fact that tests can be really varied and keptn control plane allows for various tests in the pipeline (added dynamically?). Also, though the litmus-service deployment is manual today, it might be automated or graduate to a core service - and hence cannot be brought up w/ reference to a "particular" event source or primary test.
(c) can be ruled out as we want to error out on unavailability of the chaos resource - to catch cases of unintended removal*. The engine resource also has enough data packed in - in the engineStatus.experimentStatuses section (this is also being enhanced every release) that may be useful (say, the testsFinishedEventHandler may be enhanced tomorrow to read off this and act).

(* - we have seen some cases in which the crd is terminating state or has been removed on its own, which is undesirable. However this seems to be a corner case/not usual scenario. The litmus team is still investigating this)

But what happens when there are multiple "primary/perf" tests raising the same event? We might need to maintain state regarding this - i.e., record successful/legitimate removal of the engine, so that subsequent events can be handled accordingly. This is not a current priority though.
This leaves us with (a):

(a) makes sense in two conditions (mainly for the true case) - (i) If we want chaos as the only test (no other perf tests), (ii) cases where we want chaos to outlive the primary perf test..and thereby handle the testFinished event only later.

(There was another option discussed which we haven't elaborated on above: Ignoring the event if is generated from litmus-service, while handling the rest. However, this is more or less equivalent to not generating the testFinishedevent at all after chaos. Guess we may not want to take this direction. So, we might go w/ (a) for the time being)

ksatchit commented 3 years ago

We took the ENV approach initially ("SEND_TEST_FINISHED_EVENT" set to "false" to prevent removal of chaosengine from the litmus-service (and react to the test finished event of jmeter alone)

However, this flow has since been modified with the refactor carried out to support Keptn 0.8.0 wherein we ignore testFinishedEvents from the litmus-service, thereby retaining the chaosengine for a deferred removal upon jmeter/other test completion.

However, if the testFinishedEvent comes in much earlier than the chaos experiment ends, in which case the results are shown "Aborted", we send out a Warning.

keptn-sandbox / litmus-service