Combine all initializer commands with && to catch any failing commands

frittentheke commented 3 months ago

By running two commands instead of one (the second being the cat | grep), any failures (non-zero exit code) of the first part (containing k6 inspect) will be lost and masked away.

By chaining them all with && the first non-zero RC will fail the whole command and return.

Fixes: #435

yorugac commented 3 months ago

@frittentheke, thank you for the PR! It seems like you have opened 2 PRs for the same issue? Could you please leave only one?

Quickly skimming the code, the change seems reasonable on the first glance, but I need to look up the details there again. Additionally, this change requires testing. I'll be able to fully review this in a couple of weeks approx. Hope that's alright!

frittentheke commented 3 months ago

@frittentheke, thank you for the PR! It seems like you have opened 2 PRs for the same issue? Could you please leave only one?

I changed one to draft and shall remove the "fixes" reference to this issue. It is simply not "just" a bugfix.

Thanks for looking into and testing #453

yorugac commented 2 months ago

@frittentheke, sorry for the delay! I've finally managed to test this PR a bit. So there are 2 parts: 1) correct error return in common.go: that's a great catch; thank you! 2) the k6 inspect command: I'm afraid this change makes debugging more complicated than before. The current flow, with ;, allows one to see error in logs of initializer. Example user flow: user sees an initializer job has failed in k6-operator's logs and then can go and see the exact error in initializer logs. But with &&, initializer fails and leaves no logs afterwards. AFAIS, it becomes harder to debug the problem. If you could describe a more specific case of initializer failure that you were looking into, please do! Otherwise, I'd request to omit this part as I don't see how it helps ATM...

yorugac commented 1 month ago

Hi @frittentheke, would you be able to modify the PR to leave only the first change, in common.go? We're going to have a release this week, and it'd be preferable to merge that fix before that. We can continue discussion about the second change at a later point / in another PR, of course.

frittentheke commented 1 month ago

@frittentheke, sorry for the delay! I've finally managed to test this PR a bit. So there are 2 parts:

1. correct error return in `common.go`: that's a great catch; thank you!

2. the `k6 inspect` command: I'm afraid this change makes debugging more complicated than before. The current flow, with `;`, allows one to see error in logs of initializer. Example user flow: user sees an `initializer job has failed` in k6-operator's logs and then can go and see the exact error in initializer logs. But with `&&`, initializer fails and leaves no logs afterwards. AFAIS, it becomes harder to debug the problem. If you could describe a more specific case of initializer failure that you were looking into, please do! Otherwise, I'd request to omit this part as I don't see how it helps ATM...

@yorugac sorry for the delay ...

Regarding 1) I created a new PR with just the var typo fixed: https://github.com/grafana/k6-operator/pull/465

As for 2) :

I'd like to argue that the current flow actually only handles certain types of errors - those that go into that very output file /tmp/k6logs and that are then caught by the grep filter level=error- but then misses others: Any issues with the initializer (causing non-zero exit codes) or cause none of those error lines to be emitted are then lost and the initializer actually falsely succeeds as the original return code of the inspect is masked or lost and only the one from the grep command counts for the pod termination. In my case, which made me write up the PR, the inspect command was not able to start (due to some imports being wrong) and the initializer still succeeded as there was not output with level=error and the k6-operator got stuck in the next phase (no JSON output, no jobs, ...).

To me any unexpected non-zero return code should be handled as an error (I know about https://github.com/grafana/k6/issues/2804, https://github.com/grafana/k6-operator/issues/75, ...). A while back I started a bigger PR (https://github.com/grafana/k6-operator/pull/450) completely reworking the way the initializer works through the various tasks and then sends off it's verdict via termination log, the kubernetes-native way of handling this phase (https://kubernetes.io/docs/tasks/debug/debug-application/determine-reason-pod-failure/). But that PR / discussion is for another day I suppose. If you even like the idea to rework the whole thing to make it a lot more robust.

frittentheke commented 1 month ago

@yorugac with 0.17.0 released would you consider discussing 2) some more? If you agree that my chain of thought is not totally off and would consider a rework the likes of https://github.com/grafana/k6-operator/pull/450 I'd gladly take in some feedback there to finish the PR.

yorugac commented 2 weeks ago

Hi @frittentheke, apologies for this delay! I've got swamped with internal work. And thank you for your patience! :joy:

You described this case:

In my case, which made me write up the PR, the inspect command was not able to start (due to some imports being wrong) and the initializer still succeeded as there was not output with level=error and the k6-operator got stuck in the next phase (no JSON output, no jobs, ...).

AFAIR, when imports are off, there should be logs left in initializer which then should cause an error in k6-operator on JSON unmarshal. It sounds like it wasn't so for you, so no logs despite the import error? If so, could you please share a script how to repeat that situation?

grafana / k6-operator

Combine all initializer commands with && to catch any failing commands #453