influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.63k stars 5.58k forks source link

processors.aws_ec2: operation error ec2imds: GetInstanceIdentityDocument, canceled #14064

Closed danielmotaleite closed 1 year ago

danielmotaleite commented 1 year ago

Relevant telegraf.conf

telegraf.conf: |
    [global_tags]
      cluster = "cluster00-ew1-staging-cr"
      environment = "staging"
    [agent]
      collection_jitter = "0s"
      debug = true
      flush_interval = "10s"
      flush_jitter = "0s"
      hostname = "$HOSTNAME"
      interval = "10s"
      logfile = ""
      metric_batch_size = 1000
      metric_buffer_limit = 10000
      omit_hostname = false
      precision = ""
      quiet = false
      round_interval = true
    [[processors.aws_ec2]]
      imds_tags = [
        "availabilityZone"
      ]
      ordered = false
      timeout = "60s"

Logs from Telegraf

2023-10-06T21:22:20Z I! Loading config: /etc/telegraf/telegraf.conf
2023-10-06T21:22:20Z I! Starting Telegraf 1.28.2 brought to you by InfluxData the makers of InfluxDB
2023-10-06T21:22:20Z I! Available plugins: 240 inputs, 9 aggregators, 29 processors, 24 parsers, 59 outputs, 5 secret-stores
2023-10-06T21:22:20Z I! Loaded inputs: conntrack cpu disk diskio exec kernel kubernetes linux_sysctl_fs mem net nstat processes swap system
2023-10-06T21:22:20Z I! Loaded aggregators: 
2023-10-06T21:22:20Z I! Loaded processors: aws_ec2
2023-10-06T21:22:20Z I! Loaded secretstores: 
2023-10-06T21:22:20Z I! Loaded outputs: prometheus_client
2023-10-06T21:22:20Z I! Tags enabled: cluster=cluster00-ew1-staging-cr environment=staging host=ip-10-109-105-69.eu-west-1.compute.internal 
2023-10-06T21:22:20Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"ip-10-109-105-69.eu-west-1.compute.internal", Flush Interval:10s
2023-10-06T21:22:20Z D! [agent] Initializing plugins
2023-10-06T21:22:20Z D! [processors.aws_ec2] Initializing AWS EC2 Processor
2023-10-06T21:22:20Z D! [processors.aws_ec2] Initializing AWS EC2 Processor
2023-10-06T21:22:20Z D! [agent] Connecting outputs
2023-10-06T21:22:20Z D! [agent] Attempting connection to [outputs.prometheus_client]
2023-10-06T21:22:20Z I! [outputs.prometheus_client] Listening on http://[::]:9009/metrics
2023-10-06T21:22:20Z D! [agent] Successfully connected to outputs.prometheus_client
2023-10-06T21:22:20Z D! [processors.aws_ec2] cache: size=1000
2023-10-06T21:22:25Z E! [telegraf] Error running agent: starting processor processors.aws_ec2: failed getting instance identity document: operation error ec2imds: GetInstanceIdentityDocument, canceled, context deadline exceeded

System info

telegraf 1.28.2

Docker

using the telegraf-ds helm v1.1.17

Steps to reproduce

  1. create a aws eks cluster
  2. deploy telegraf-ds with aws_ec2 processor
  3. check the logs , it fails

Expected behavior

Get the metadata, just like the command line can do:

TOKEN=curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600" \ && curl -H "X-aws-ec2-metadata-token: $TOKEN" -v http://169.254.169.254/latest/dynamic/instance-identity/document

Actual behavior

error and telegraf fails to start

Additional info

Maybe telegraf is still using idms v1 access (without any token) and needs to be updated to v2?

powersj commented 1 year ago

Hi,

We have users and recent contributors of this processor so if this was more wide spread I would have expected more reports of any issue.

Is this happening 100% of the time? When you use curl it works?

2023-10-06T21:22:20Z D! [processors.aws_ec2] Initializing AWS EC2 Processor 2023-10-06T21:22:20Z D! [processors.aws_ec2] Initializing AWS EC2 Processor

Why did this print twice? Do you have multiple instances of the processor?

GetInstanceIdentityDocument, canceled, context deadline exceeded

This error comes right after attempting to get the identity document, which are direct calls to the AWS go library. We are passing a background context, so no specific timeout. These operation errors are effectively something went wrong with the request itself by the library. These types of context deadline messages can indicate an issue with networking, DNS, etc. There is not even an HTTP status code at this point.

telegraf-tiger[bot] commented 1 year ago

Hello! I am closing this issue due to inactivity. I hope you were able to resolve your problem, if not please try posting this question in our Community Slack or Community Forums or provide additional details in this issue and reqeust that it be re-opened. Thank you!