grafana / xk6-output-prometheus-remote

k6 extension to output real-time test metrics using Prometheus Remote Write.
GNU Affero General Public License v3.0
156 stars 72 forks source link

Investigate potential for concurrency processing #3

Open yorugac opened 2 years ago

yorugac commented 2 years ago

There are two main ways to add concurrency to the extension:

1) concurrency at the level of processing metrics pre-request

Details: Output receives batches of metrics that must be iterated over and converted into remote write TimeSeries. This may seems as a natural point to add concurrency like this:

   samplesContainers := o.GetBufferedSamples()
   step := math.Floor(len(samplesContainers) / concurrencyLimit)

   for i := 0; i < concurrencyLimit; i++ {
      wg.Add(1)
      // get chunk of samplesContainers from i * step to (i+1) * step
      go func(...) {
         ...
         gatherpoint[i] = convertToTimeSeries(chunk)
         ...
      }(...)
   }
   wg.Wait()

   for i := 0; i < concurrencyLimit; i++ {
      allTS = append(allTS, gatherpoint[i]...)
   }

   // encode and send remote write request

But this processing must be done within 1 second of flush period. Basic experiments so far showed next to none improvement in trying to spawn goroutines within that time limit. This result will likely be impacted by changes from Metric Refactoring in k6 and might need more investigation.

2) concurrency at remote write requests

Details: this is blocked by inability to compile TimeSeries (group samples). Attempt to send disjointed samples concurrently would only result in out of order errors.

mhaddon commented 2 years ago

Does this mean with the k6-operator parallelism has to be 1? Or does it just mean that requests cant take longer than 1 second?

yorugac commented 2 years ago

Hi @mhaddon, this issue has no relation to k6-operator. It's just an open question on possible ways to optimize performance of this, xk6-output-prometheus-remote, extension. 1 second is a default value of flush period: https://github.com/grafana/xk6-output-prometheus-remote/blob/d50ae155d36ec7acbf015932a232077d3e9743e3/pkg/remotewrite/config.go#L20 So yes, if flush period is set to default, it's preferable that requests don't take more than 1 second; otherwise, there would be degraded performance and loss of data.

As mentioned in description, concurrency experiments don't seem to bring much of an improvement without solving "metrics refactoring" first, which is to be addressed in #2.