kaarolch commented 4 months ago

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

We are trying to use Vector as a datadog metric proxy to optimize tags for some of the metrics we produce on our side before sending them to Datadog. We have multiple types of metrics, and we discovered that after reducing the number of tags for certain metrics, Datadog provides us with different metrics values.

Let's take metrics telemetry.metric.a which is incremental count and it was shipped with 41 tag keys. We tried to limit it to only 7-8 tags (that are in use). However, when we sent metrics with a limited number of tags (host tag is included), the overall sum count value is different:

original metric- delivered without filtering
tags filtered - metric filtered by remap transform
tags filtered aggr - metrics with filtered tags (remap) and aggregate in 10s interval

Most of the time, the sum of counts on the Datadog side differs by around 1,000 (3,000 vs. 2,000) when we compare a 10-second interval.

In the next step, we tried to remove unneeded tags on the application side (looks better), but there are still differences in the metrics.

We knew that in datadog_metrics sink vector is doing aggregation when we batching, so just for a few tests we disabled batching for tags filtered one(changed around 13:36 in next screen):

batch:
  max_events: 1
  timeout_secs: 1

but still original metrics has some spike, that are not appeared in filtered tags metrics. Most of filtered tags are very static - set during app deployment - only two of 34 removed tags could increase cardinality (in this test we include it as well - changed around 14:02 in next screen), but in the end in DD UI query only uses env and pod tags to generate sum:

sum:metric.a_org{env:staging, pod:podA}.as_count()
sum:metric.a_filtered{env:staging, pod:podA}.as_count()

Do you know why we see this kind of difference when we perform filtering on the Vector side? From the perspective of Datadog, the query lines should be almost identical. Over longer terms, such as a day or a week, the difference is around 20-30%.

Our flow

|app| -> dogstatsd -> |dd-agent| -> dd-proxy -> | vector aggr tier| -> datadog_metrics -> |datadog backend|

Configuration

source_a.yml:

type: "datadog_agent"
address: "0.0.0.0:24869"
disable_logs: true
disable_metrics: false
disable_traces: true
multiple_outputs: true
store_api_key: false

route_a.yml

inputs:
  - source_a.metrics
route:
  test: .tags.tags_filter == "true" &&  .namespace == "telemetry" && includes(["metrics.a"], .name)

remap_a.yml

type: remap
inputs:
  - route_a.test: 
source: |
  allowlist_tags = []
  if .name == "metric.a" {
    allowlist_tags = [
      "a",
      "b",
      "c",
      "d",
      "client_tls_version",
      "e",
      "f",
      "g",
      "h",
      "env",
      "environment",
      "host",
      "i",
      "j",
      "k",
      "pod",
      "l",
      "status_class",
      "m",
      "kube_replica_set",
    ]
  }

  new_tags = {}
  for_each(allowlist_tags) -> |_, key| {
    tag_value = get!(.tags, [key])
    if tag_value != null {
      new_tags = set!(new_tags, [key], tag_value)
    }
  }

  .tags = new_tags
  .name = "metrics.a_filtered"

dd_agent.yml

type: datadog_metrics
inputs:
  - remap_a
default_api_key: xxx
endpoint: xxx
buffer:
  - type: memory
    max_events: 20000
    when_full: drop_newest
batch:
  max_events: 1
  timeout_secs: 1
request:
  concurrency: "adaptive"
  rate_limit_duration_secs: 1
  rate_limit_num: 50
  retry_attempts: 15
  retry_max_duration_secs: 1800
  retry_initial_backoff_secs: 1
  timeout_secs: 5
  adaptive_concurrency:
    decrease_ratio: 0.7
    ewma_alpha: 0.4
    initial_concurrency: 2
    rtt_deviation_scale: 2.5

Original metrics is transfered via other sink and flow that do not filter tags.

Version

0.37.0

Debug Output

No response

Example Data

No response

Additional Context

We understand why value could be different for gauge:

absolute metrics are replaced by the later value

but why we observed significant diff for incremental?

References

7938

jszwedko commented 4 months ago

Hi @kaarolch ,

I think one possible thing you are running into is that Datadog requires each incoming metric point (name, timestamp and tags) to be unique. If a duplicate point comes in later with the same name, timestamp, and tags it'll overwrite the original point. Are you removing any tags that could cause duplicates? I do see you have host, which removing would often cause duplication, but maybe there are other tags that are unique?

kaarolch commented 4 months ago

Hi @jszwedko The host was there:

      "env",
      "environment",
      "host",
      "i",

but as you mentioned there could be some cases when our egress pods are sending identical metadata+ timestamp, I need to check that.

kaarolch commented 4 months ago

We add extra egress_host tag and filtered metrics, even with batching enabled, looks quite goo; still I'm trying to investigate spikes.

jszwedko commented 4 months ago

Ah great, that does look better.

jszwedko commented 4 months ago

@kaarolch do you think it'd be reasonable to close this out? I think the slight differences are likely just due to reaggregation in Vector.

kaarolch commented 4 months ago

Yes, we are clear now :ty

vectordotdev / vector

Metric tags remap and sink aggregation data diff #20807

A note for the community

Problem

Configuration

Version

Debug Output

Example Data

Additional Context

References

7938