open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.1k stars 2.39k forks source link

prometheusreceiver and statsdreceiver behave differently in terms of setting "OTelLib" when awsemfexporter is used #24298

Closed mkielar closed 1 year ago

mkielar commented 1 year ago

Component(s)

exporter/awsemf, receiver/prometheus, receiver/statsd

What happened?

Description

We have aws-otel-collector 0.30.0 running alongside a Java App (which exposes Prometheus metrics) and AWS/Envoy Sidecar (which exposes StatsD metrics). aws-otel-collector is configured to process both those sources using separate pipelines, and to push the metrics to AWS CloudWatch using awsemfexporter. We have previously used version 0.16.1 of the aws-otel-collector and are only now upgraging.

Previously, metrics from both sources were stored in CloudWatch "as-is". After the upgrade, however, we noticed, that the Prometheus metrics gained a new Dimension: OTelLib, with value otelcol/prometheusreceiver. This, obviously broke a few things on our end (like CloudWatch Alarms).

After digging a bit, I found this two tickets, which were supposed to get both of these receivers to the same place in terms of populating otel.library.name:

Unfortunately I was not able to grasp how that translates to OTelLib metric dimension set in awsemfexporter but it seems somehow related at this point.

My understanding is, that it's de-facto standard for the receivers to add the name and version of the library to processed metrics, but I do not understand how or why at all is that information being added as a dimension. I also do not understand if that's an expected outcome, thus, it's hard for me to figure out whether it's a bug in prometheusreceiver (that it adds that as a dimension), statsdreceiver (that it doesn't add it as a dimension) or awsemfexporter. I'd be grateful for any guidance on this matter.

Steps to Reproduce

  1. Use the collector configuration below with two separate sources of metrics (StatsD and Prometheus).
  2. You can adjust (or disable) metric filtering if your sources vary from mine.

Expected Result

I would expect the following:

  1. Make the receivers produce metrics the same way, so that the awsemfexporter would add the new OTelLib Dimension regardless where the metrics come from. Or would not add that at all. I'm not sure what is considered the "correct" behaviour here. I would expect it to be consistent across receivers, however.
  2. I'm not very proficient in Go, but from what I can make of the awsemfexporter configuration, it has dedicated logic to handle that OTelLib Dimension. I think it would be a good idea to be able to implement a switch that would control whether the OTelLib Dimension is being added or not. In our case, forcefully adding this new Dimension to all collected metrics will break A LOT of things around our observability solution.

Actual Result

  1. Metrics collected by prometheusreceiver are stored by awsemfexporter with additional OTelLib dimension set to otelcol/prometheusreceiver.
  2. Metrics collected by statsdreceiver are stored by identical configuration of awsemfexporter without OTelLib dimension.
  3. There's no way to configure awsemfexporter in a way that it would not add the OTelLib dimension.

Collector version

v0.78.0 (according to: https://github.com/aws-observability/aws-otel-collector/releases/tag/v0.30.0)

Environment information

Environment

OS: AWS ECS / Fargate We're running custom-built Docker Image, based on amazonlinux:2, with a Dockerfile lookling like below:

FROM amazonlinux:2 as appmesh-otel-collector
ARG OTEL_VERSION=0.30.0
RUN yum install -y \
        procps \
        shadow-utils \
        https://aws-otel-collector.s3.amazonaws.com/amazon_linux/amd64/v${OTEL_VERSION}/aws-otel-collector.rpm \
    && yum clean all
RUN useradd -m --uid 1337 sidecar && \
    echo "sidecar ALL=NOPASSWD: ALL" >> /etc/sudoers && \
    chown -R sidecar /opt/aws/aws-otel-collector
USER sidecar
ENV RUN_IN_CONTAINER="True"
ENV HOME="/home/sidecar"
ENTRYPOINT ["/opt/aws/aws-otel-collector/bin/aws-otel-collector"]

OpenTelemetry Collector configuration

"exporters":
  "awsemf/prometheus/custom_metrics":
    "dimension_rollup_option": "NoDimensionRollup"
    "log_group_name": "/aws/ecs/staging/kafka-snowflake-connector"
    "log_stream_name": "emf/otel/prometheus/custom_metrics/{TaskId}"
    "namespace": "staging/KafkaSnowflakeConnector"
  "awsemf/statsd/envoy_metrics":
    "dimension_rollup_option": "NoDimensionRollup"
    "log_group_name": "/aws/ecs/staging/kafka-snowflake-connector"
    "log_stream_name": "emf/otel/statsd/envoy_metrics/{TaskId}"
    "namespace": "staging/AppMeshEnvoy"
"processors":
  "batch/prometheus/custom_metrics":
    "timeout": "60s"
  "batch/statsd/envoy_metrics":
    "timeout": "60s"
  "filter/prometheus/custom_metrics":
    "metrics":
      "include":
        "match_type": "regexp"
        "metric_names":
        - "^kafka_consumer_consumer_fetch_manager_metrics_bytes_consumed_rate$"
        - "^kafka_consumer_consumer_fetch_manager_metrics_records_consumed_rate$"
        - "^kafka_connect_connect_worker_metrics_connector_running_task_count$"
        - "^kafka_connect_connect_worker_metrics_connector_failed_task_count$"
        - "^kafka_consumer_consumer_fetch_manager_metrics_records_lag_max$"
        - "^kafka_consumer_consumer_fetch_manager_metrics_records_lag$"
        - "^snowflake_kafka_connector_.*_OneMinuteRate$"
  "filter/statsd/envoy_metrics":
    "metrics":
      "include":
        "match_type": "regexp"
        "metric_names":
        - "^envoy\\.http\\.rq_total$"
        - "^envoy\\.http\\.downstream_rq_xx$"
        - "^envoy\\.http\\.downstream_rq_total$"
        - "^envoy\\.http\\.downstream_rq_time$"
        - "^envoy\\.cluster\\.upstream_cx_connect_timeout$"
        - "^envoy\\.cluster\\.upstream_rq_timeout$"
        - "^envoy\\.appmesh\\.RequestCountPerTarget$"
        - "^envoy\\.appmesh\\.TargetResponseTime$"
        - "^envoy\\.appmesh\\.HTTPCode_.+$"
  "resource":
    "attributes":
    - "action": "extract"
      "key": "aws.ecs.task.arn"
      "pattern": "^arn:aws:ecs:(?P<Region>.*):(?P<AccountId>.*):task/(?P<ClusterName>.*)/(?P<TaskId>.*)$"
  "resourcedetection":
    "detectors":
    - "env"
    - "ecs"
"receivers":
  "prometheus/custom_metrics":
    "config":
      "global":
        "scrape_interval": "1m"
        "scrape_timeout": "10s"
      "scrape_configs":
      - "job_name": "staging/KafkaSnowflakeConnector"
        "metrics_path": ""
        "sample_limit": 10000
        "static_configs":
        - "targets":
          - "localhost:9404"
  "statsd/envoy_metrics":
    "aggregation_interval": "60s"
    "endpoint": "0.0.0.0:8125"
"service":
  "pipelines":
    "metrics/prometheus/custom_metrics":
      "exporters":
      - "awsemf/prometheus/custom_metrics"
      "processors":
      - "resourcedetection"
      - "resource"
      - "filter/prometheus/custom_metrics"
      - "batch/prometheus/custom_metrics"
      "receivers":
      - "prometheus/custom_metrics"
    "metrics/statsd/envoy_metrics":
      "exporters":
      - "awsemf/statsd/envoy_metrics"
      "processors":
      - "resourcedetection"
      - "resource"
      - "filter/statsd/envoy_metrics"
      - "batch/statsd/envoy_metrics"
      "receivers":
      - "statsd/envoy_metrics"

Log output

N/A

Additional context

N/A

github-actions[bot] commented 1 year ago

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

mkielar commented 1 year ago

I think the reason for this may be difference in implementation. See this fragment in prometheusreceiver implementation: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/prometheusreceiver/internal/transaction.go#L201-L202

vs. this implementation in statsdreceiver: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/statsdreceiver/protocol/statsd_parser.go#L249-L252

You can see the objects/types on which the Name and Version attributes are set, differ (pcommon.InstrumentationScope for statsdreceiver vs. pmetric.NewMetrics -> ResourceMetrics -> ScopeMetrics -> Scope for prometheusexporter). It seems the latter makes awsemfexporter use the receiver name as Metric Dimension, and the former does not.

@paologallinaharbur, you seem to be the author of both of those implementations, can you please take a look and/or comment on the issue?

Also: We're also testing the behaviour of https://github.com/open-telemetry/opentelemetry-collector/tree/main/receiver/otlpreceiver with awsemfexporter, I should have some results later this week.

mkielar commented 1 year ago

I've just realized, aws-otel-collector 0.30.0 uses 0.78.0 release of opentelemetry-collector-contrib, and the changes introduced by #23563 were only merged in 0.81.0. I'm going to close this ticket, and wait for aws-otel-collector to catch up with the latest changes, then test again. Apologies for the noise...

paologallinaharbur commented 1 year ago

@mkielar

I did some investigation that I'll dump it here in case you need it. (Otherwise ignore it)

You can see the objects/types on which the Name and Version attributes are set, differ (pcommon.InstrumentationScope for statsdreceiver vs. pmetric.NewMetrics -> ResourceMetrics -> ScopeMetrics -> Scope for prometheusexporter). It seems the latter makes awsemfexporter use the receiver name as Metric Dimension, and the former does not.

prometheusexporter SetName acts as well on pcommon.InstrumentationScope returned by scope() in the line you mentioned

I run the tests and how the scope is added seems exactly the same (I would say that the SetName and SetVersion are good safenets). Moreover, you mentioned that

Statsd

image

Prometheus

image

Redis

Screenshot 2023-07-18 at 14 38 13
mkielar commented 1 year ago

@paologallinaharbur, I managed to set up local workspace and debug the tests, and I saw exactly what you're showing on screenshots. Which led me to the fact, that it's not the implementation, but simply an older version of the dependency in aws-otel-collector. As I said, I'll wait for the AWS to catch up, and try upgrading again in a month or two.

Anyway, thanks a lot for looking into that (and apologies for wasting your time).