grafana / k6

A modern load testing tool, using Go and JavaScript - https://k6.io
GNU Affero General Public License v3.0
25.15k stars 1.25k forks source link

cloud: Binary-based ingestion #2954

Closed codebien closed 1 year ago

codebien commented 1 year ago

What

The current Cloud ingestion service receives the metrics on the CLOUD_URL/v1/metrics/<TEST_REF_ID> endpoint. Each HTTP request contains a JSON payload with an array of Sample at the root and each Sample contains a data field where the type is one of the types defined below.

We want replace it implementing a new HTTP body format using a binary encoding.

Current JSON payload

[
  {
    "type": "<TYPE>",
    "metric": "<NAME>",
    "data": {
      ...
    },
  },{
  ...
  }
]

Single point

Show me the JSON ```js { "type": "Point", "metric": "vus", "data": { "time": "%d", "type": "gauge", "tags": { "aaa": "bbb", "ccc": "123" }, "value": 999 } } ```

Multi points

Show me the JSON ```js { "type": "Points", "metric": "iter_li_all", "data": { "time": "%d", "type": "counter", "tags": { "test": "mest" }, "values": { "data_received": 6789.1, "data_sent": 1234.5, "iteration_duration": 10000 } } } ```

Aggregated points

Show me the JSON ```js { "type": "AggregatedPoints", "metric": "http_req_li_all", "data": { "time": "%d", "type": "aggregated_trend", "count": 2, "tags": { "test": "mest" }, "values": { "http_req_duration": { "min": 0.013, "max": 0.123, "avg": 0.068 }, "http_req_blocked": { "min": 0.001, "max": 0.003, "avg": 0.002 }, "http_req_connecting": { "min": 0.001, "max": 0.002, "avg": 0.0015 }, "http_req_tls_handshaking": { "min": 0.003, "max": 0.004, "avg": 0.0035 }, "http_req_sending": { "min": 0.004, "max": 0.005, "avg": 0.0045 }, "http_req_waiting": { "min": 0.005, "max": 0.008, "avg": 0.0065 }, "http_req_receiving": { "min": 0.006, "max": 0.008, "avg": 0.007 } } } } ```

Why

It is required to have better efficiency at scale. An encoding binary format would reduce the size of the payload and the hardware requirements for encoding/decoding operations both on cloud and on clients.

Non-Goals

How / Proposals

Create a new Cloud output (v2) that flushes metrics creating HTTP requests using the Protobuf mechanism for serializing the body.

In summary an example of the HTTP request:

POST CLOUD_URL/v2/metrics/<TEST_REF_ID> HTTP/1.1
Host: www.example.com
User-Agent: k6
Content-Type: application/x-protobuf
Content-Encoding: snappy
K6-Metrics-Protocol-Version: 2.0

To stay closer to the Prometheus implementation, the output has to compresss using the Snappy algorithm.

The code below contains a Protobuf proposal inspired by OpenMetrics to use for encoding the body request:

Show me the Proto file EDIT: The Protobuf after several iterations, https://github.com/grafana/k6/blob/0cddc417243fd152f0a2e532b1870fa6d8635d03/output/cloud/expv2/pbcloud/metric.proto

TODO: Use a HDR histogram implementation for mapping the Trend type.

It is a requiment to add the name and the test run id as part of the tag set. The output has to add:

metrics.<metric>.tags["__name__"] = "<metric-name>"
metrics.<metric>.tags["test_run_id"] = "<test-ref-id>"

Additional implementation details

Encapsulate the new Cloud output startup from the current Cloud output based on a config option logic. In this way, we can overwrite at runtime the used output and fallback on the previous logic in case it is required.

Action Plan

  1. Quick and dirty implementation of a basic Cloud output v2
    • Config option for using v2 and the relative logic in v1
    • Ability to flush metric samples encoded as defined by the new protocol
    • No Trend implementation
  2. Trend implementation as HDR
  3. Reiterate for polish and stability

Future

Open Questions

Work log

It collects all the tasks required for the new cloud output. The new cloud outputs include a consistent refactor, a new binary format for the metrics requests' payload and samples aggregation and HDR Histogram generation on the client.

It depends on the following PRs as a prerequisite:

The following PRs are expected to be merged to have the final working output:

codebien commented 1 year ago

Most of the work planned here has been merged, I will close it and continue on a new dedicated issue https://github.com/grafana/k6/issues/3117 for the remaining performance optimizations.