open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.07k stars 2.37k forks source link

AWS Cloudwatch Receiver stops/errors after a log group gets removed #35361

Open elburnetto-intapp opened 1 month ago

elburnetto-intapp commented 1 month ago

Component(s)

receiver/awscloudwatch

What happened?

Description

We have the AWS Cloudwatch Receiver setup to auto-discover and poll log groups from our AWS Account, to then be exported out to Kafka. The idea to use auto-discover was so that log groups can be added/removed automatically by the receiver, and not require manual intervention.

However we've noticed when a log group gets removed from AWS, this causes the receiver to panic and completely stop, as it's unable to find the log group (instead of ignoring this and continuing to poll the other log groups). It's as if the functionality to update log groups isn't removing deleted ones.

Steps to Reproduce

Setup the OTLP Collector to use the receiver with Auto-Discovery for log groups, wait 5/10 minutes with it running, then remove a log group from the AWS console.

Expected Result

The receiver to stop polling for logs in a group which no longer exists, and continue polling groups still active.

Actual Result

The receiver stops and continuously errors. The only way to stop this is to delete the pod and wait for the receiver to restart.

Collector version

0.101.0

Environment information

Environment

Kubernetes (EKS)

OpenTelemetry Collector configuration

receivers:
      awscloudwatch/rds:
        logs:
          groups:
            autodiscover:
              limit: 500
              prefix: /aws/rds/instance/
          max_events_per_request: 300
          poll_interval: 5m
        region: us-east-1    
    exporters:
      kafka/logs:
        auth:
          tls:
            ca_file: <path-to-ca-crt>
        brokers: <kafka-broker-url>
        encoding: otlp_json
        producer:
          max_message_bytes: 2000000
        protocol_version: 2.8.0
        retry_on_failure:
          enabled: true
          max_elapsed_time: 600s
          max_interval: 60s
        topic: processed-logs
    service:
      extensions:
      - health_check
      pipelines:
        logs:
          exporters:
          - kafka/logs
          receivers:
          - awscloudwatch/rds

Log output

2024-09-23T13:35:52.472Z    error   awscloudwatchreceiver@v0.101.0/logs.go:213  unable to retrieve logs from cloudwatch {"kind": "receiver", "name": "awscloudwatch/rds", "data_type": "logs", "log group": "/aws/rds/instance/test-log-group-instance/sql", "error": "ResourceNotFoundException: The specified log group does not exist."}
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/awscloudwatchreceiver.(*logsReceiver).pollForLogs
    github.com/open-telemetry/opentelemetry-collector-contrib/receiver/awscloudwatchreceiver@v0.101.0/logs.go:213
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/awscloudwatchreceiver.(*logsReceiver).poll
    github.com/open-telemetry/opentelemetry-collector-contrib/receiver/awscloudwatchreceiver@v0.101.0/logs.go:187
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/awscloudwatchreceiver.(*logsReceiver).startPolling
    github.com/open-telemetry/opentelemetry-collector-contrib/receiver/awscloudwatchreceiver@v0.101.0/logs.go:174

Additional context

No response

github-actions[bot] commented 1 month ago

Pinging code owners:

schmikei commented 1 month ago

Hmm it was my understanding that we rediscover on each poll interval so I imagined that it would fit your use case...

Its not panicking from that error based on the code but behaving correctly for that request. Any other groups should still be getting collected just by looking at the code.

https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/5ec6872d5bffeddcb708437e9be98ab06b668d1a/receiver/awscloudwatchreceiver/logs.go#L213

The only reason I could think is that the AWS CloudWatch Logs API is still returning on subsequent poll intervals. Would you be up for enabling debug logs by adding this service snippet to your config?

service:
  telemetry:
    logs:
      level: debug

I would like to see if it's still getting rediscovered after deletion and after 2 polls.

Expecting a log message to be outputted in the debug level that has the message "discovered log group" with the deleted log group. We may need to add special handling for that error, but I'd rather avoid that if possible.