Refactor the `Output` interface

na-- commented 2 years ago

I've recently been going back to basics and making some substantial changes and refactors in how base components of k6 work. One of these has been the current core.Engine. Through https://github.com/grafana/k6/issues/1889, https://github.com/grafana/k6/pull/2426 and another WIP PR I've been slowly trying to split it apart and obviate it, since that will unlock a lot of flexibility and remove tons of cruft.

During that effort, I've noticed another suboptimal component we have in k6, our outputs. This is the current output.Output API: https://github.com/grafana/k6/blob/55c7ccd53d6c0775740e0880538bd4c8b231bc7a/output/types.go#L58-L76

And even though I was the one who kind of wrote it, I still think it's quite bad :sweat_smile: I think the biggest reason it's so bad is because it kind of had to be, due to the dysfunctional nature of our current Rube-Goldberg core.Engine :sweat_smile: But I'm really close to getting rid of the Engine completely, and it has been making obvious how bad the Outputs are because of it...

Here's what I actually propose the new Output API should look like:

type Output interface {
    // The output stops when `samples` is closed, and k6 will wait() for it to finish.
    Start(samples chan stats.SampleContainer) (wait func() error, err error)
}

type WithDescription interface {
    Output
    Description() string
}
// ... same single-method optional interfaces: WithThresholds, WithTestRunStop, WithRunStatusUpdates, WithBuiltinMetrics

Besides the fact that then new API is much simpler and more idiomatic Go code, the old API actually has a problem for the use case of k6. Specifically, buffering is important when we have network IO, but counter-productive when we're passing metrics internally in k6. It's probably somewhat jittery to have these buffers from AddMetricSamples() internally, since their batched processing will cause memory churn, but also small CPU usage spikes at intermittent intervals when they are processed. If there are CPU bursts every 50ms, that might affect the actual test execution and metric measurements... :disappointed:

I also think there is a way to do this transition gracefully and support both old and new Output extensions for a few releases (the old ones with an adapter to the new API that emits a warning). We can move the outputs to the new /metrics/ top-level package and make the current /output package a adapter proxy for the new API for a few k6 version. That proxy can emit a deprecation warning, so users can switch their extensions to the new API.

mstoykov commented 2 years ago

Without looking into the proposition too much:

Given that #1831 will likely have breaking changes for outputs as well I would prefer if we don't break the interface once to then break it again. Especially as (from a quick look at this) one of the breaking changes will not beneficial to current outputs all that much, even if it keeps supporting them for some time.

Given that I would prefer if we don't work on this before we work on #1831 which likely will need changes to the output interface as well.

na-- commented 2 years ago

:thinking: Good points :+1: Though I am not sure if we'll be able to gracefully deprecate the current outputs without a ton of effort for https://github.com/grafana/k6/issues/1831, so maybe don't bother with that and do both breaking changes at once, to rip the band-aid off quickly? :sweat_smile:

na-- commented 2 years ago

We should probably also add a context to either the current Stop() function, or, ideally, to my proposed wait() error function, like this:

type Output interface {
    // The output stops when `samples` is closed, and k6 will wait() for it to finish.
    Start(samples chan stats.SampleContainer) (wait func(context.Context) error, err error)
}

This will allow us to set a soft limit for how long outputs are allowed to stop.

na-- commented 1 year ago

In the course of removing the Engine (https://github.com/grafana/k6/issues/1889) by rebasing and polishing this old PoC commit from https://github.com/grafana/k6/pull/2438, I started hitting some import cycles due to https://github.com/grafana/k6/pull/2536 :disappointed: :face_exhaling: But then, in a classic "Hal fixing a light bulb" moment, I decided that it is the perfect reason to finally refactor the lib.RunStatus mess... :sweat_smile:

The big problem there is that RunStatus is something quite specific to our cloud execution, but it is unfortunately part of the lib/ packages and used in a bunch of core places... :disappointed:

My current thinking is that since these run statuses are pretty much specific for our cloud API, so they should be moved to the cloudapi/ package and basically not used anywhere else in the code. However, then this output interface becomes a minor problem: https://github.com/grafana/k6/blob/5fa71b761185200d29c87eff16faa5e4a6c372b7/output/types.go#L78-L82

The logical solution would be to have some way to notify the output that the test has finished executing with some error. If the err value is nil, then it finished normally. If it isn't, the cloud output (or any other output that cares) can determine the "run status" in a similar way to how we determine exit codes and such things elsewhere in the code :thinking:

I'm in the process of making a PR with this refactoring, but I am also commenting in this issue because this will affect how we refactor the Output interface in the future. We either need to have some way to pass outputs this information (e.g. have a Stop(err error) method, not just a Start() one), or we need to provide outputs some way of querying this information themselves.

codebien commented 1 year ago

or we need to provide outputs some way of querying this information themselves.

In that case, it could be something like: s/SetRunStatus(latestStatus lib.RunStatus)/SetTestRunError(err error)/. The advantage of the setter is that it decouples the notification from the Stop operation. I don't know if it is helpful in some way, however, if the Stop method or the setter will require a deeper analysis of the usage.

In the case of Stop the context would still be required like Stop(ctx context.Context, err error) because as we said before it would be helpful to have a:

soft limit for how long outputs are allowed to stop.

For example, it would be beneficial for the Prometheus Remote Write Output that makes an HTTP call during the Stop operation. https://github.com/grafana/xk6-output-prometheus-remote/pull/73#discussion_r1039458897

grafana / k6

Refactor the `Output` interface #2430