influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.59k stars 5.56k forks source link

TLS handshake timeout to AWS timestream with IAM Role #15874

Closed choseh closed 1 month ago

choseh commented 1 month ago

Relevant telegraf.conf

[global_tags]
[agent]
  interval = "60s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  debug = false
  quiet = false
  logfile = ""
  hostname = ""
  omit_hostname = true
[[inputs.cpu]]
  percpu = false
  totalcpu = true
  collect_cpu_time = false
  report_active = true
[[inputs.kernel]]
  interval = "5m"
[[inputs.disk]]
  ignore_fs = ["tmpfs", "devtmpfs", "devfs", "overlay", "aufs", "squashfs", "vfat"]
[[inputs.mem]]
[[inputs.netstat]]
[[inputs.processes]]
[[inputs.swap]]
  fielddrop = ["in","out"]
[[inputs.system]]
  fielddrop = ["uptime","uptime_format"]
[[processors.aws_ec2]]
  imds_tags = ["instanceId", "imageId", "instanceType"]
  ec2_tags = ["Name"]
  timeout = "10s"
  ordered = false
  max_parallel_calls = 10
  tag_cache_size = 1000
[[outputs.timestream]]
  region = "eu-central-1"
  database_name = "telegraf"
  describe_database_on_start = true
  mapping_mode = "multi-table"
  create_table_if_not_exists = true
  create_table_magnetic_store_retention_period_in_days = 365
  create_table_memory_store_retention_period_in_hours = 24
  use_multi_measure_records=true
  measure_name_for_multi_measure_records = "t"

Logs from Telegraf

Sep 12 05:36:06 hostname systemd[1]: Starting telegraf.service - Telegraf...
Sep 12 05:36:07 hostname telegraf[6909]: time="2024-09-12T05:36:07Z" level=warning msg="DBUS_SESSION_BUS_ADDRESS envvar looks to be not set, this can lead to runaway dbus-daemon processes. To avoid this, set envvar DBUS_SESSION_BUS_ADDRESS=$XDG_RUNTIME_DIR/bus (if it exists) or DBUS_SESSION_BUS_ADDRESS=/dev/null." func="gosnowflake.(*defaultLogger).Warn" file="log.go:244"
Sep 12 05:36:07 hostname telegraf[6909]: 2024-09-12T05:36:07Z I! Loading config: /etc/telegraf/telegraf.conf
Sep 12 05:36:07 hostname telegraf[6909]: 2024-09-12T05:36:07Z W! DeprecationWarning: Option "fielddrop" of plugin "inputs.swap" deprecated since version 1.29.0 and will be removed in 1.40.0: use 'fieldexclude' instead
Sep 12 05:36:07 hostname telegraf[6909]: 2024-09-12T05:36:07Z W! DeprecationWarning: Option "fielddrop" of plugin "inputs.system" deprecated since version 1.29.0 and will be removed in 1.40.0: use 'fieldexclude' instead
Sep 12 05:36:07 hostname telegraf[6909]: 2024-09-12T05:36:07Z I! Loading config: /etc/telegraf/telegraf.d/config.conf          
Sep 12 05:36:07 hostname telegraf[6909]: 2024-09-12T05:36:07Z I! Starting Telegraf 1.32.0 brought to you by InfluxData the makers of InfluxDB
Sep 12 05:36:07 hostname telegraf[6909]: 2024-09-12T05:36:07Z I! Available plugins: 235 inputs, 9 aggregators, 32 processors, 26 parsers, 62 outputs, 6 secret-stores
Sep 12 05:36:07 hostname telegraf[6909]: 2024-09-12T05:36:07Z I! Loaded inputs: cpu disk exec (6x) kernel mem netstat processes swap system
Sep 12 05:36:07 hostname telegraf[6909]: 2024-09-12T05:36:07Z I! Loaded aggregators:
Sep 12 05:36:07 hostname telegraf[6909]: 2024-09-12T05:36:07Z I! Loaded processors: aws_ec2
Sep 12 05:36:07 hostname telegraf[6909]: 2024-09-12T05:36:07Z I! Loaded secretstores:
Sep 12 05:36:07 hostname telegraf[6909]: 2024-09-12T05:36:07Z I! Loaded outputs: timestream
Sep 12 05:36:07 hostname telegraf[6909]: 2024-09-12T05:36:07Z I! Tags enabled:
Sep 12 05:36:07 hostname systemd[1]: Started telegraf.service - Telegraf.
Sep 12 05:36:07 hostname telegraf[6909]: 2024-09-12T05:36:07Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"", Flush Interval:10s
Sep 12 05:36:07 hostname telegraf[6909]: 2024-09-12T05:36:07Z I! [outputs.timestream] Constructing Timestream client for "multi-table" mode
Sep 12 05:36:07 hostname telegraf[6909]: 2024-09-12T05:36:07Z I! [outputs.timestream] Describing database "telegraf" in region "eu-central-1"
Sep 12 05:36:39 hostname telegraf[6909]: 2024-09-12T05:36:39Z E! [outputs.timestream] Couldn't describe database "telegraf". Check error, fix permissions, connectivity, create database.
Sep 12 05:36:39 hostname telegraf[6909]: 2024-09-12T05:36:39Z E! [agent] Failed to connect to [outputs.timestream], retrying in 15s, error was "operation error Timestream Write: DescribeDatabase, operation error Timestream Write: DescribeEndpoints, exceeded maximum number of attempts, 3, https response error StatusCode: 0, RequestID: , request send failed, Post \"https://ingest.timestream.eu-central-1.amazonaws.com/\": net/http: TLS handshake timeout"
Sep 12 05:36:54 hostname telegraf[6909]: 2024-09-12T05:36:54Z I! [outputs.timestream] Constructing Timestream client for "multi-table" mode
Sep 12 05:36:54 hostname telegraf[6909]: 2024-09-12T05:36:54Z I! [outputs.timestream] Describing database "telegraf" in region "eu-central-1"
Sep 12 05:37:28 hostname telegraf[6909]: 2024-09-12T05:37:28Z E! [outputs.timestream] Couldn't describe database "telegraf". Check error, fix permissions, connectivity, create database.
Sep 12 05:37:28 hostname telegraf[6909]: 2024-09-12T05:37:28Z E! [telegraf] Error running agent: connecting output outputs.timestream: error connecting to output "outputs.timestream": operation error Timestream Write: DescribeDatabase, operation error Timestream Write: DescribeEndpoints, exceeded maximum number of attempts, 3, https response error StatusCode: 0, RequestID: , request send failed, Post "https://ingest.timestream.eu-central-1.amazonaws.com/": net/http: TLS handshake timeout
Sep 12 05:37:28 hostname systemd[1]: telegraf.service: Main process exited, code=exited, status=1/FAILURE
Sep 12 05:37:28 hostname systemd[1]: telegraf.service: Failed with result 'exit-code'.

System info

Telegraf 1.32.0

Docker

No response

Steps to reproduce

  1. start with config
  2. check logs
  3. ...

Expected behavior

connections to timestream possible

Actual behavior

telegraf not connecting to timestream

Additional info

we're using IAM roles to permit access to the timestream database v 1.31 works

choseh commented 1 month ago

Apparently related to SNI and Network Firewall, 1.32 might have changed something in the request that's no longer sending the necessary information (?)

choseh commented 1 month ago

we found it. https://github.com/hashicorp/terraform-provider-aws/issues/39311 (similar issue, but basically golang in combination with network firewall) have to set GODEBUG=tlskyber=0 to make it work again. So actually it's a golang issue.