grafana / k6-operator

An operator for running distributed k6 tests.
Apache License 2.0
588 stars 166 forks source link

Remove starter job in favor of built-in implementation #87

Open yorugac opened 2 years ago

yorugac commented 2 years ago

Problem

The starter is currently relying on REST API of k6 to start the runners. It is a workable solution that allows to synchronize the start of the test. But it requires to schedule a separate pod which might be time-consuming and / or ineffective depending on the setup and test run. Additionally, it is a limiting approach on very large tests.

Suggested solution

The starting functionality is built-in into k6 itself, even if just for 1-instance test. It is used in k6 resume CLI command of externally-controlled executor. So k6-operator could re-use existing k6 code base and implement a synchronized start of the test directly in the TestRun controller. To achieve synchronization, the implementation should rely on Golang routines.

This solution allows to resolve all the issue described above. But there are cons as well:

Additional context

Additional pros:

k6 issue about REST API to keep an eye on: https://github.com/grafana/k6/issues/3497

simskij commented 2 years ago

Just thought I'd offer some background as we (read: I) opted for using the HTTP Rest API over going for direct pod invocation deliberately.

While using k6 resume would work, it would require higher privileges for the starter pod, which I'm not really sure is an excellent idea. Allowing the starter pod to execute HTTP requests is by far less likely to cause issues than giving a pod the privileges to execute arbitrary shell commands in the load generators.

yorugac commented 2 years ago

Hi @simskij, thank you for providing additional background and input :slightly_smiling_face:

I don't think I'm following here: what do you mean by higher privileges? Internally, k6 resume is nothing but a wrapper around REST call.

simskij commented 2 years ago

My bad! I was interpreting your intention as calling k6 resume inside the runners using the programmatic equivalent to kubectl exec, which would then have required the starter to have execution privileges in the runner pods.

yorugac commented 2 years ago

Yes, that'd be quite a different scenario :smile: Thanks for clarifying :+1: I'll update description a bit to be more precise.

yorugac commented 2 years ago

This might not be fully viable after all :disappointed: given cases like https://github.com/grafana/k6-operator/pull/89

frittentheke commented 3 months ago

@yorugac @simskij

While I totally understand and appreciate the synchronization to start all k6 instances (parallel>1) I am wondering why the operator does not simply issue those HTTP calls ([https://github.com/grafana/k6-operator/blob/40dce3ffc13c55fb665c41b891c1459d3ad66068/controllers/k6_start.go#L93 ->](https://github.com/grafana/k6-operator/blob/40dce3ffc13c55fb665c41b891c1459d3ad66068/controllers/k6_start.go#L103 -> https://github.com/grafana/k6-operator/blob/40dce3ffc13c55fb665c41b891c1459d3ad66068/pkg/resources/containers/curl_start.go#L29) to the REST API endpoints natively? It just seems rather complex to spin up another job / pod to then just issue a few HTTP calls via curl.

Especially if you compare this to the setup phase (https://github.com/grafana/k6-operator/blob/40dce3ffc13c55fb665c41b891c1459d3ad66068/controllers/k6_start.go#L96 -> https://github.com/grafana/k6-operator/blob/main/controllers/common.go#L231 -> https://github.com/grafana/k6-operator/blob/40dce3ffc13c55fb665c41b891c1459d3ad66068/pkg/testrun/k6client.go#L17

which is already using the golang k6 client. Aligning this would also follow the idea of this issue to make the communication with K6 a little less freestyle.

The also some closer tracking of health and test progress of k6 pods coming up would likely a thing that could be added once the k6 REST API is used.

yorugac commented 2 months ago

I am wondering why the operator does not simply issue those HTTP calls to the REST API endpoints natively? It just seems rather complex to spin up another job / pod to then just issue a few HTTP calls via curl.

Given the discussion above, I believe this is what this issue is basically about at this point :smile: I should update the description to reflect that.

This issue wasn't prioritized so far simply because nobody seemed to care, until recently. So thank you greatly for sharing your opinion @frittentheke!

But also, do any of existing users actually depend on starter pod? One can specify starter in TestRun so if anyone does that, changing this behaviour now would mean breaking existing workflows. If anyone uses .spec.starter in TestRun, please comment here!

For completeness sake, on this comment above. Currently I don't think this will be an issue as no other such cases came up, AFAIR... And the part of "waiting for services to be up" does not need to change. Only the logic after that will be impacted.

yorugac commented 2 months ago

Updated description of the issue to be more up-to-date.

cc @ivanape I believe this issue is related to what we discussed before :slightly_smiling_face: