cloud: Binary-based ingestion

What

The current Cloud ingestion service receives the metrics on the CLOUD_URL/v1/metrics/<TEST_REF_ID> endpoint. Each HTTP request contains a JSON payload with an array of Sample at the root and each Sample contains a data field where the type is one of the types defined below.

We want replace it implementing a new HTTP body format using a binary encoding.

Current JSON payload

[
  {
    "type": "<TYPE>",
    "metric": "<NAME>",
    "data": {
      ...
    },
  },{
  ...
  }
]

Single point

Show me the JSON

```js { "type": "Point", "metric": "vus", "data": { "time": "%d", "type": "gauge", "tags": { "aaa": "bbb", "ccc": "123" }, "value": 999 } } ```

Multi points

Show me the JSON

```js { "type": "Points", "metric": "iter_li_all", "data": { "time": "%d", "type": "counter", "tags": { "test": "mest" }, "values": { "data_received": 6789.1, "data_sent": 1234.5, "iteration_duration": 10000 } } } ```

Aggregated points

Show me the JSON

```js { "type": "AggregatedPoints", "metric": "http_req_li_all", "data": { "time": "%d", "type": "aggregated_trend", "count": 2, "tags": { "test": "mest" }, "values": { "http_req_duration": { "min": 0.013, "max": 0.123, "avg": 0.068 }, "http_req_blocked": { "min": 0.001, "max": 0.003, "avg": 0.002 }, "http_req_connecting": { "min": 0.001, "max": 0.002, "avg": 0.0015 }, "http_req_tls_handshaking": { "min": 0.003, "max": 0.004, "avg": 0.0035 }, "http_req_sending": { "min": 0.004, "max": 0.005, "avg": 0.0045 }, "http_req_waiting": { "min": 0.005, "max": 0.008, "avg": 0.0065 }, "http_req_receiving": { "min": 0.006, "max": 0.008, "avg": 0.007 } } } } ```

Why

It is required to have better efficiency at scale. An encoding binary format would reduce the size of the payload and the hardware requirements for encoding/decoding operations both on cloud and on clients.

Non-Goals

Aggregation algorithm for reducing the dataset of flushed data (e.g. aggregation).

How / Proposals

Create a new Cloud output (v2) that flushes metrics creating HTTP requests using the Protobuf mechanism for serializing the body.

In summary an example of the HTTP request:

POST CLOUD_URL/v2/metrics/<TEST_REF_ID> HTTP/1.1
Host: www.example.com
User-Agent: k6
Content-Type: application/x-protobuf
Content-Encoding: snappy
K6-Metrics-Protocol-Version: 2.0

To stay closer to the Prometheus implementation, the output has to compresss using the Snappy algorithm.

The code below contains a Protobuf proposal inspired by OpenMetrics to use for encoding the body request:

Show me the Proto file

EDIT: The Protobuf after several iterations, https://github.com/grafana/k6/blob/0cddc417243fd152f0a2e532b1870fa6d8635d03/output/cloud/expv2/pbcloud/metric.proto

TODO: Use a HDR histogram implementation for mapping the Trend type.

It is a requiment to add the name and the test run id as part of the tag set. The output has to add:

metrics.<metric>.tags["__name__"] = "<metric-name>"
metrics.<metric>.tags["test_run_id"] = "<test-ref-id>"

Additional implementation details

Encapsulate the new Cloud output startup from the current Cloud output based on a config option logic. In this way, we can overwrite at runtime the used output and fallback on the previous logic in case it is required.

Action Plan

Quick and dirty implementation of a basic Cloud output v2
- Config option for using v2 and the relative logic in v1
- Ability to flush metric samples encoded as defined by the new protocol
- No Trend implementation
Trend implementation as HDR
Reiterate for polish and stability

Future

Better metrics aggregation (e.g. https://github.com/grafana/k6/issues/1700)
Consider a full Prometheus Remote-write implementation
Origin (e.g Builtin or Custom) of the metrics (at the moment the cloud backend has a fixed list of the Builtin metrics).

Open Questions

~~HDR format in protobuf~~ Done

Work log

It collects all the tasks required for the new cloud output. The new cloud outputs include a consistent refactor, a new binary format for the metrics requests' payload and samples aggregation and HDR Histogram generation on the client.

It depends on the following PRs as a prerequisite:

https://github.com/grafana/k6/pull/3063
https://github.com/grafana/k6/pull/3041 Support multiple versions of the output

The following PRs are expected to be merged to have the final working output:

https://github.com/grafana/k6/pull/3072 Experimental v2 Output foundations
https://github.com/grafana/k6/pull/3071 Samples aggregation
https://github.com/grafana/k6/pull/2963 Protobuf models and client
https://github.com/grafana/k6/pull/3082 Handle errors on flush
https://github.com/grafana/k6/pull/3083 Flushing of the aggregates
https://github.com/grafana/k6/pull/3027 HDR Histogram
https://github.com/grafana/k6/pull/3085 Dedicated and optimized sinks
https://github.com/grafana/k6/pull/3098 Bucket time as unix nano

grafana / k6