After the first iteration, the memory usage is higher than required. Especially for the Trend metrics is very easy to saturate the bandwidth in a range from tons of KiloBytes up to the remote limit (1 MB).
We also decided to denormalize some fields to reduce the workload and keep the implementation simple on the remote server but the load generated on the client is high, we should revisit this decision.
Fault tolerance
The current flush process could be more fault tolerant, it doesn't retry on failures.
Validation
__name__ and test_run_id are reserved labels for the remote service and if a test also sets them then there are conflicts generating unexpected behavior for the user. A more dev-friendly UX should be implemented.
Proposal
We identified some actions that should drive us to the goal:
A more compact Protobuf representation for Histogram.
Split in multiple requests when the flush process gets a number of time series higher than the MaxMetricSamplesPerPackage variable.
Normalize as MetricSet's fields the common fields across time series.
Fault-tolerant flush operation.
Exclude __name__ and test_run_id from the allowed tag names.
### Nice to have (in case we need to reduce the scope)
- [ ] https://github.com/grafana/k6/pull/3120
- [ ] https://github.com/grafana/k6/pull/3125
- [ ] https://github.com/grafana/k6/pull/3146
- [ ] https://github.com/grafana/k6/pull/3137
- [ ] https://github.com/grafana/k6/issues/3122
- [ ] Revaluate the current periodic and abort signal architecture/interaction (https://github.com/grafana/k6/pull/3082#discussion_r1207875810, https://github.com/grafana/k6/pull/3104#discussion_r1224212064)
- [ ] Unexport all the strucs/methods/fields not required as exported
Context
https://github.com/grafana/k6/issues/2954 introduces the new experimental Coud output with a Protobuf-based protocol.
Memory usage
After the first iteration, the memory usage is higher than required. Especially for the Trend metrics is very easy to saturate the bandwidth in a range from tons of KiloBytes up to the remote limit (1 MB).
We also decided to denormalize some fields to reduce the workload and keep the implementation simple on the remote server but the load generated on the client is high, we should revisit this decision.
Fault tolerance
The current flush process could be more fault tolerant, it doesn't retry on failures.
Validation
__name__
andtest_run_id
are reserved labels for the remote service and if a test also sets them then there are conflicts generating unexpected behavior for the user. A more dev-friendly UX should be implemented.Proposal
We identified some actions that should drive us to the goal:
MaxMetricSamplesPerPackage
variable.__name__
andtest_run_id
from the allowed tag names.Acceptance criteria
Change the Cloud output default version to
2
.Worklog