open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.08k stars 2.38k forks source link

[statsdreceiver] incorrect aggregation when several producers report the same metrics #23809

Closed sigilioso closed 6 months ago

sigilioso commented 1 year ago

Component(s)

receiver/statsd

What happened?

Description

The metrics reported using the statsd receiver are not correctly aggregated when they come from different client addresses and the metrics are not decorated with any attribute to identify the client. Instead of aggregating the metrics with the same type and the same set of attribute keys and values as described in the docs, the receiver aggregates the metrics with the same type and the same set of attribute keys and values, and also the same origin address. But it does not include any attribute to identify the origin address.

As I see it, this is a common scenario, where metrics coming from different producers (or even the same producer but using different connections), do not include specific attributes to identify the producer.

If the k8sattributesprocessor is used to decorate the metrics, this is not an issue since metrics will have attributes identifying the origin but using the processor is not an option for every environment neither is mandatory when using the receiver.

It is also possible to avoid this problem by decorating the metrics in the client, but as stated in #15290, that can be cumbersome or just not possible.

Steps to Reproduce

  1. Run the collector using statsd receiver and the file exporter.
receivers:
  statsd:
    endpoint: "0.0.0.0:8125"
    aggregation_interval: 10s
    is_monotonic_counter: false

processors:
  batch:

exporters:
  file/tmp:
    path: /tmp/collector-output.json

service:
  telemetry:
    logs:
      level: debug
  pipelines:
    metrics:
      receivers: [statsd]
      processors: [batch]
      exporters: [file/tmp]
  1. Report the two values for the same metric with the same set of attributes within the same aggregation interval.
❯ echo "test.metric:10|c|#myKey:myVal" | nc -w 1 -u localhost 8125
❯ echo "test.metric:20|c|#myKey:myVal" | nc -w 1 -u localhost 8125

Expected Result

The metric is expected to be aggregated, I'd expect getting a value of 30 with an attribute mykey:myVal for test.metric.

Actual Result

I get two values of test.metric with the same set attributes (myKey:myVal) and the same timestamp.

Sample output from /tmp/collector-output.json:

{
    "resourceMetrics": [
        {
            "resource": {},
            "scopeMetrics": [
                {
                    "scope": {
                        "name": "otelcol/statsdreceiver",
                        "version": "0.80.0"
                    },
                    "metrics": [
                        {
                            "name": "test.metric",
                            "sum": {
                                "dataPoints": [
                                    {
                                        "attributes": [
                                            {
                                                "key": "myKey",
                                                "value": {
                                                    "stringValue": "myVal"
                                                }
                                            }
                                        ],
                                        "startTimeUnixNano": "1687877475484608000",
                                        "timeUnixNano": "1687877485483610000",
                                        "asInt": "20"
                                    }
                                ],
                                "aggregationTemporality": 1
                            }
                        }
                    ]
                }
            ]
        },
        {
            "resource": {},
            "scopeMetrics": [
                {
                    "scope": {
                        "name": "otelcol/statsdreceiver",
                        "version": "0.80.0"
                    },
                    "metrics": [
                        {
                            "name": "test.metric",
                            "sum": {
                                "dataPoints": [
                                    {
                                        "attributes": [
                                            {
                                                "key": "myKey",
                                                "value": {
                                                    "stringValue": "myVal"
                                                }
                                            }
                                        ],
                                        "startTimeUnixNano": "1687877475484608000",
                                        "timeUnixNano": "1687877485483610000",
                                        "asInt": "10"
                                    }
                                ],
                                "aggregationTemporality": 1
                            }
                        }
                    ]
                }
            ]
        }
    ]
}

Collector version

v0.80.0

Environment information

No response

OpenTelemetry Collector configuration

receivers:
  statsd:
    endpoint: "0.0.0.0:8125"
    aggregation_interval: 10s
    is_monotonic_counter: false

processors:
  batch:

exporters:
  file/tmp:
    path: /tmp/collector-output.json

service:
  telemetry:
    logs:
      level: debug
  pipelines:
    metrics:
      receivers: [statsd]
      processors: [batch]
      exporters: [file/tmp]

Log output

No response

Additional context

Possible cause

This can be reproduced using netcat because it will close the UDP connection right after writing the message. Therefore, different origin ports are used for each metric, unless the port is specified. As a result, metrics are being aggregated separately here and different batches are obtained here.

If we use the same port to write the metrics:

❯ echo "test.metric:20|c|#myKey:myVal" | nc -w 1 -p 5000 -u localhost 8125
❯ echo "test.metric:10|c|#myKey:myVal" | nc -w 1 -p 5000 -u localhost 8125

We get the expected results:

{
    "resourceMetrics": [
        {
            "resource": {},
            "scopeMetrics": [
                {
                    "scope": {
                        "name": "otelcol/statsdreceiver",
                        "version": "0.80.0"
                    },
                    "metrics": [
                        {
                            "name": "test.metric",
                            "sum": {
                                "dataPoints": [
                                    {
                                        "attributes": [
                                            {
                                                "key": "myKey",
                                                "value": {
                                                    "stringValue": "myVal"
                                                }
                                            }
                                        ],
                                        "startTimeUnixNano": "1687878125480554000",
                                        "timeUnixNano": "1687878135480173000",
                                        "asInt": "30"
                                    }
                                ],
                                "aggregationTemporality": 1
                            }
                        }
                    ]
                }
            ]
        }
    ]
}

Considered solutions

  1. Always decorate metrics with the producer's address. This way, there would be an attribute identifying the metrics's origin and therefore we wouldn't be reporting several values for the same metric and the same set of attributes for a specific timestamp. This approach would be really simple to implement but it would increase the weight of every metric reported (even if the origin's address is not an interesting attribute for the user's use case). As I see it, it would introduce a breaking change and we should avoid it if possible.

  2. Take a different approach to identify the producer's address from metrics so the metrics can still be decorated using processors relying on the producer's address, such as the k8sattributesprocessor. Metrics could be optionally decorated (that could be controlled using a configuration flag) with the producer's address and we could avoid grouping them while performing the aggregation. This approach would require a different configuration in the processor to make statsdreceiver and k8sattributesprocessor work together.

If the bug is confirmed, we will be willing to help to solve it.

github-actions[bot] commented 1 year ago

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 1 year ago

Pinging code owners for receiver/statsd: @jmacd @dmitryax. See Adding Labels via Comments if you do not have permissions to add labels yourself.

sigilioso commented 1 year ago

Hi @jmacd @dmitryax,

As I see it, the described scenario may be pretty common: several producers report the same metrics without a specific attribute to identify them. Could you take a look and confirm if the aggregation is incorrect? Please, Let me know if you need additional details.

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

akayami commented 1 year ago

I've run into the same problem.

statsdreciver does not aggregate messages correctly and basically causes the metric to be exported as a single unaggregated value. For instance, if I send 10 messages like this: "test.metricA:5|c|#gate:otel" within 10 seconds, I expect the block of 10 seconds to have a value of 50. otel statsd receiver reports 5.

The larger the aggregation time span, the larger the error gets. As is, this receiver is unusable, because even at aggregation of 1s, it still looses values.

github-actions[bot] commented 11 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

emersonclaireiaquinto commented 10 months ago

My team has the same problem as well. The un-aggregated metrics flow directly into the awsemf exporter, which causes a conflict as the exporter doesn't allow duplicate metrics in its batch, dropping all but one of the metrics before export. This is a confirmed issue in both docker (using docker compose and netcat) and kubernetes (Kong Statsd exporter -> ADOT)

charz commented 10 months ago

I think this is the same root case (https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/29508) that I can reproduce this problem. Please check https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/30228 to see if that helps.

github-actions[bot] commented 8 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 6 months ago

This issue has been closed as inactive because it has been stale for 120 days with no activity.

Manuelraa commented 6 months ago

Is it correct that this should be closed instead of getting fixed?

Manuelraa commented 6 months ago

I'm not a statsd expert myself. Though I have mixed feeling regarding fix proposed here + #30228 If I read the code correct as long as a persistent UDP connection would be used metrics would be aggregated correctly (?) By implementing #30228 the aggregation by addr gets pointless and could just be removed. As now all metrics are aggregated by the statsdreceivers endpoint addr.

Let's take Telegraf's implementaiton as an example. For reference here they store and buffer the package input: link What we see during processing is that unless the metric uses the datadog statsd format the connection addr is completely ignored. If datadog statsd format is used the connection addr is set as the source tag.

I would argue applications should expose tags which identify themselves. Still the aggregation by connection source can be useful.

Long story short. I propose to make this addr aggregation optional by a configuration flag. And also adding a resource attribute for the source to make the resulting aggregation more clear.

Manuelraa commented 6 months ago

@jmacd , @dmitryax I have created PR #33010 implementing the suggestion. Could you please reopen this issue and have a look at the PR?