grafana / k6-operator

An operator for running distributed k6 tests.
Apache License 2.0
576 stars 158 forks source link

Make cloud output test runs resilient to operator's restarts #108

Closed yorugac closed 11 months ago

yorugac commented 2 years ago

The test run with cloud output is not resilient towards external restart of operator's pod. This happens mainly due to the controller not storing its full state with cloud output execution. When operator is restarted by external actor, the flow of the controller may be broken in case of any test run; and in case of test run with cloud output specifically, it may lead to the test run being started but not finalized.

More precisely, FinishJobs is set to finalize always by timeout, regardless of the state of runner pods; since https://github.com/grafana/k6-operator/pull/86/commits/f08da61c27776c2fe89b325566751be5026ff059. But in case of restart of the operator's pod, the test run ID is lost and it's not possible to finalize the test. Full solution for such cases is to store the test run ID independently from the pod lifecycle, i.e. externally. Additionally, FinishJobs rely on cloud.InspectOutput.TotalDuration field which would also be lost in case of a restart.

yorugac commented 11 months ago

This was resolved as part of #138