Closed kaarolch closed 4 months ago
Hi @kaarolch ,
I think one possible thing you are running into is that Datadog requires each incoming metric point (name, timestamp and tags) to be unique. If a duplicate point comes in later with the same name, timestamp, and tags it'll overwrite the original point. Are you removing any tags that could cause duplicates? I do see you have host
, which removing would often cause duplication, but maybe there are other tags that are unique?
Hi @jszwedko
The host
was there:
"env",
"environment",
"host",
"i",
but as you mentioned there could be some cases when our egress pods are sending identical metadata+ timestamp, I need to check that.
We add extra egress_host
tag and filtered metrics, even with batching enabled, looks quite goo; still I'm trying to investigate spikes.
Ah great, that does look better.
@kaarolch do you think it'd be reasonable to close this out? I think the slight differences are likely just due to reaggregation in Vector.
Yes, we are clear now :ty
A note for the community
Problem
We are trying to use Vector as a datadog metric proxy to optimize tags for some of the metrics we produce on our side before sending them to Datadog. We have multiple types of metrics, and we discovered that after reducing the number of tags for certain metrics, Datadog provides us with different metrics values.
Let's take metrics
telemetry.metric.a
which is incremental count and it was shipped with 41 tag keys. We tried to limit it to only 7-8 tags (that are in use). However, when we sent metrics with a limited number of tags (host
tag is included), the overall sum count value is different:original metric
- delivered without filteringtags filtered
- metric filtered byremap
transformtags filtered aggr
- metrics with filtered tags (remap
) and aggregate in 10s intervalMost of the time, the sum of counts on the Datadog side differs by around 1,000 (3,000 vs. 2,000) when we compare a 10-second interval.
In the next step, we tried to remove unneeded tags on the application side (looks better), but there are still differences in the metrics.
We knew that in
datadog_metrics
sink vector is doing aggregation when we batching, so just for a few tests we disabled batching for tags filtered one(changed around 13:36 in next screen):but still original metrics has some spike, that are not appeared in filtered tags metrics. Most of filtered tags are very static - set during app deployment - only two of 34 removed tags could increase cardinality (in this test we include it as well - changed around 14:02 in next screen), but in the end in DD UI query only uses
env
andpod
tags to generate sum:Do you know why we see this kind of difference when we perform filtering on the Vector side? From the perspective of Datadog, the query lines should be almost identical. Over longer terms, such as a day or a week, the difference is around 20-30%.
Our flow
Configuration
source_a.yml:
route_a.yml
remap_a.yml
dd_agent.yml
Original metrics is transfered via other sink and flow that do not filter tags.
Version
0.37.0
Debug Output
No response
Example Data
No response
Additional Context
We understand why value could be different for gauge:
but why we observed significant diff for incremental?
References
7938