vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
18.2k stars 1.6k forks source link

Vector aggregated metrics reporting as zero in Datadog #21833

Open decimalst opened 2 days ago

decimalst commented 2 days ago

A note for the community

Problem

Hi, I have a gauge metric which is reported multiple times per second via influxDB http that I am trying to aggregate the count for.

In my console logs sink, the aggregation with count appears to work, and shows a value of 1000, equaling my test suite. However, when I view the metric in Datadog, it shows a reported value of 0. Screenshot 2024-11-19 at 1 52 29 PM

Configuration

vector.yaml: |
    data_dir: /vector-data-dir
    sinks:
      console:
        encoding:
          codec: json
        inputs:
        - unfiltered_metrics
        type: console
      datadog-metrics:
        default_api_key: myapikey
        inputs:
        - internal_metrics
        - unfiltered_metrics
        tls:
          enabled: true
        type: datadog_metrics
    sources:
      influx_http:
        address: 0.0.0.0:8086
        decoding:
          codec: influxdb
        method: POST
        path: /write
        response_code: 204
        type: http_server
      influx_http_query:
        address: 0.0.0.0:8087
        encoding: text
        method: GET
        path: /query
        response_code: 200
        type: http_server
      internal_metrics:
        type: internal_metrics
    transforms:
      aggregate_gauge_counters:
        inputs:
        - route_metrics.aggregate_gauge_counters
        interval_ms: 10000
        mode: Count
        type: aggregate
      unfiltered_metrics:
        inputs:
        - aggregate_gauge_counters
        source: |
          true == true
        type: remap
      filter_telegraf_metrics:
        inputs:
        - influx_http
        source: |
          true == true
        type: remap
      route_metrics:
        inputs:
        - filter_telegraf_metrics
        route:
          aggregate_gauge_counters:
            source: |
              .kind == "absolute"
            type: vrl
        type: route

Version

0.42.0

Debug Output

2024-11-19T18:46:28.353367Z INFO vector::app: Log level is enabled. level="info"
2024-11-19T18:46:28.357859Z INFO vector::app: Loading configs. paths=["/etc/vector"]
2024-11-19T18:46:28.360326Z WARN vector::config::loading: Transform "route_metrics._unmatched" has no consumers
2024-11-19T18:46:28.360335Z WARN vector::config::loading: Source "influx_http_query" has no consumers
2024-11-19T18:46:28.385001Z INFO vector::topology::running: Running healthchecks.
2024-11-19T18:46:28.385125Z INFO vector: Vector has started. debug="false" version="0.42.0" arch="x86_64" revision="3d16e34 2024-10-21 14:10:14.375255220"
2024-11-19T18:46:28.385135Z INFO vector::app: API is disabled, enable by setting `api.enabled` to `true` and use commands like `vector top`.
2024-11-19T18:46:28.385180Z INFO source{component_kind="source" component_id=influx_http component_type=http_server}: vector::sources::util::http::prelude: Building HTTP server. address=0.0.0.0:8086
2024-11-19T18:46:28.385206Z INFO source{component_kind="source" component_id=influx_http_query component_type=http_server}: vector::sources::util::http::prelude: Building HTTP server. address=0.0.0.0:8087
2024-11-19T18:46:28.385335Z INFO vector::topology::builder: Healthcheck passed.
2024-11-19T18:46:28.513995Z INFO vector::topology::builder: Healthcheck passed.
{"name":"test.ops1.metric2_value","tags":{"host":"myhostname","region":"us-west"},"kind":"absolute","counter":{"value":1000.0}}

Example Data

I tested with a couple of scripts: here is a bash example:

for i in {1..1000}; do
    (
        curl -i -XPOST 'http://influxdb-vector.apps.cluster-url.com/write' \
             --data-binary 'test.ops1.metric2,host=myhostname,region=us-west value=1' \
             > /dev/null 2>&1
    ) &

Same behavior happens with a python script using aiohttp.

Additional Context

Vector is running on Kubernetes.

References

No response

decimalst commented 2 days ago

I can't tell, but I think this is actually intended behavior, based on this footnote in the docs.

Aggregation Behavior
Metrics are aggregated based on their kind. During an interval, incremental metrics are “added” and newer absolute metrics replace older ones in the same series. This results in a reduction of volume and less granularity, while maintaining numerical correctness. As an example, two incremental counter metrics with values 10 and 13 processed by the transform during a period would be aggregated into a single incremental counter with a value of 23. Two absolute gauge metrics with values 93 and 95 would result in a single absolute gauge with the value of 95. More complex types like distribution, histogram, set, and summary behave similarly with incremental values being combined in a manner that makes sense based on their type.

If I convert the gauge metric I am trying to aggregate to 'incremental' rather than 'absolute', this outputs the 1000 we were expecting.

pront commented 1 day ago

Hi @decimalst,

Setting aside the incremental vs absolute for a second the following sounds like a bug:

In my console logs sink, the aggregation with count appears to work, and shows a value of 1000, equaling my test suite.

For a given timestamp, do you see different output on the Vector console vs DD metrics?


Generally, metrics can be either absolute or incremental. Absolute metrics represent a "last write overwrites" scenario, where the latest absolute value seen becomes the actual metric value. On the other hand, incremental metrics are additive. The current total value of the metric is adjusted.

Also, we provide the https://vector.dev/docs/reference/configuration/global-options/#expire_metrics_secs global option as way to remove all metrics that have not been updated in the given number of seconds.

decimalst commented 1 day ago

Hey @pront, thanks for your response. I can't tell, but it seems like the count aggregation isn't counting in the way I'd expect for absolute gauges. I put together a test script and a configuration that demonstrates it not working. Here's the json sink of the logs:

[999 more entries of this, trimmed for brevity]
{"name":"test_metrics_count","tags":{"host":"host2","pod_name":"vector-5695898575-999xb","region":"us-west","type":"timeout_count"},"kind":"absolute","gauge":{"value":1.0}}
{"name":"test_metrics_count_renamed","tags":{"host":"host2","pod_name":"vector-5695898575-999xb","region":"us-west","type":"timeout_count"},"kind":"absolute","counter":{"value":1000.0}}
customConfig:
  data_dir: /vector-data-dir
  sources:
    influx_http:
      path: "/write"
      response_code: 204
      type: http_server
      address: 0.0.0.0:8086
      method: POST
      decoding:
        codec: "influxdb"
    internal_metrics:
      type: internal_metrics
    influx_http_query:
      path: "/query"
      response_code: 200
      type: http_server
      address: 0.0.0.0:8087
      method: POST
      encoding: text
  transforms:
    add_pod_metadata:
      type: remap
      inputs: ["filter_some_metrics"]
      source: |
        # Add pod name from env variable
        .tags.pod_name = get_env_var!("POD_NAME")
    route_metrics:
      type: route
      inputs: ["add_pod_metadata"]
      route:
        aggregate_incremental_gauges:
          type: vrl
          source: '.name == "test_metrics_count" && .tags.type == "timeout_count"'
    transformed_gauges:
      type: remap
      inputs:
        - route_metrics.aggregate_incremental_gauges
      source: |
        if .name == "test_metrics_count" {
          if .tags.type == "timeout_count"{
            .name = "test_metrics_count_renamed"
          }
        }
    aggregate_incremental_gauges:
      type: aggregate
      inputs:
      - transformed_gauges
      mode: Count
      interval_ms: 10000
    filter_some_metrics:
      type: remap
      inputs:
        - influx_http
      source: |
          #this filters some stuff, but not relevant here
          true == true
  sinks:
    datadog-metrics:
      tls:
        enabled: true
      type: datadog_metrics
      default_api_key: apikeyhere
      inputs: ["route_metrics._unmatched", "aggregate_incremental_gauges"]
    console:
      encoding:
        codec: json
      inputs:
      - route_metrics.aggregate_incremental_gauges
      - influx_http_query
      - aggregate_incremental_gauges
      type: console

Then I just curl 1000 times:

$ for i in {1..1000}; do
    (
        curl -i -XPOST 'http://influxdb-vector.k8surl.com/write' \
             --data-binary 'test_metrics,host=[testhost](testhost),region=us-west count=1' \
             > /dev/null 2>&1
    ) &
done

In Datadog, I don't see a value for the test_metrics_count_renamed metric reported at all. This is the change to the config which works and allows me to aggregate a count(I also had to add a pod tag because we have multiple replicas of vector listening):

    transformed_gauges:
      type: remap
      inputs:
        - route_metrics.aggregate_incremental_gauges
      source: |
        if .name == "test_metrics_count" {
          if .tags.type == "timeout_count"{
            .kind = "incremental"
            .tags.pod_name = get_env_var!("POD_NAME")
          }
        }