influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.55k stars 5.56k forks source link

The panicRecover func cannot capture the panics during plugin executions. #14826

Closed zmyzheng closed 7 months ago

zmyzheng commented 7 months ago

Relevant telegraf.conf

# Telegraf Configuration
#
# Telegraf is entirely plugin driven. All metrics are gathered from the
# declared inputs, and sent to the declared outputs.
#
# Plugins must be declared in here to be active.
# To deactivate a plugin, comment out the name and any variables.
#
# Use 'telegraf -config telegraf.conf -test' to see what metrics a config
# file would generate.
#
# Environment variables can be used anywhere in this config file, simply surround
# them with ${}. For strings the variable must be within quotes (ie, "${STR_VAR}"),
# for numbers and booleans they should be plain (ie, ${INT_VAR}, ${BOOL_VAR})

# Configuration for telegraf agent
[agent]
  ## Default data collection interval for all inputs
  interval = "10s"
  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  round_interval = true

  ## Telegraf will send metrics to outputs in batches of at most
  ## metric_batch_size metrics.
  ## This controls the size of writes that Telegraf sends to output plugins.
  metric_batch_size = 1000

  ## Maximum number of unwritten metrics per output.  Increasing this value
  ## allows for longer periods of output downtime without dropping metrics at the
  ## cost of higher maximum memory usage.
  metric_buffer_limit = 10000

  ## Collection jitter is used to jitter the collection by a random amount.
  ## Each plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things like sysfs at the
  ## same time, which can have a measurable effect on the system.
  collection_jitter = "0s"

  ## Collection offset is used to shift the collection by the given amount.
  ## This can be be used to avoid many plugins querying constraint devices
  ## at the same time by manually scheduling them in time.
  # collection_offset = "0s"

  ## Default flushing interval for all outputs. Maximum flush_interval will be
  ## flush_interval + flush_jitter
  flush_interval = "10s"
  ## Jitter the flush interval by a random amount. This is primarily to avoid
  ## large write spikes for users running a large number of telegraf instances.
  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  flush_jitter = "0s"

  ## Collected metrics are rounded to the precision specified. Precision is
  ## specified as an interval with an integer + unit (e.g. 0s, 10ms, 2us, 4s).
  ## Valid time units are "ns", "us" (or "µs"), "ms", "s".
  ##
  ## By default or when set to "0s", precision will be set to the same
  ## timestamp order as the collection interval, with the maximum being 1s:
  ##   ie, when interval = "10s", precision will be "1s"
  ##       when interval = "250ms", precision will be "1ms"
  ##
  ## Precision will NOT be used for service inputs. It is up to each individual
  ## service input to set the timestamp at the appropriate precision.
  precision = "0s"

  ## Log at debug level.
  # debug = false
  ## Log only error level messages.
  # quiet = false

  ## Log target controls the destination for logs and can be one of "file",
  ## "stderr" or, on Windows, "eventlog".  When set to "file", the output file
  ## is determined by the "logfile" setting.
  # logtarget = "file"

  ## Name of the file to be logged to when using the "file" logtarget.  If set to
  ## the empty string then logs are written to stderr.
  # logfile = ""

  ## The logfile will be rotated after the time interval specified.  When set
  ## to 0 no time based rotation is performed.  Logs are rotated only when
  ## written to, if there is no log activity rotation may be delayed.
  # logfile_rotation_interval = "0h"

  ## The logfile will be rotated when it becomes larger than the specified
  ## size.  When set to 0 no size based rotation is performed.
  # logfile_rotation_max_size = "0MB"

  ## Maximum number of rotated archives to keep, any older logs are deleted.
  ## If set to -1, no archives are removed.
  # logfile_rotation_max_archives = 5

  ## Pick a timezone to use when logging or type 'local' for local time.
  ## Example: America/Chicago
  # log_with_timezone = ""

  ## Override default hostname, if empty use os.Hostname()
  hostname = ""
  ## If set to true, do no set the "host" tag in the telegraf agent.
  omit_hostname = false

  ## Method of translating SNMP objects. Can be "netsnmp" (deprecated) which
  ## translates by calling external programs snmptranslate and snmptable,
  ## or "gosmi" which translates using the built-in gosmi library.
  # snmp_translator = "netsnmp"

  ## Name of the file to load the state of plugins from and store the state to.
  ## If uncommented and not empty, this file will be used to save the state of
  ## stateful plugins on termination of Telegraf. If the file exists on start,
  ## the state in the file will be restored for the plugins.
  # statefile = ""

# Configuration for sending metrics to InfluxDB
[[outputs.influxdb]]
  ## The full HTTP or UDP URL for your InfluxDB instance.
  ##
  ## Multiple URLs can be specified for a single cluster, only ONE of the
  ## urls will be written to each interval.
  # urls = ["unix:///var/run/influxdb.sock"]
  # urls = ["udp://127.0.0.1:8089"]
  # urls = ["http://127.0.0.1:8086"]

  ## The target database for metrics; will be created as needed.
  ## For UDP url endpoint database needs to be configured on server side.
  # database = "telegraf"

  ## The value of this tag will be used to determine the database.  If this
  ## tag is not set the 'database' option is used as the default.
  # database_tag = ""

  ## If true, the 'database_tag' will not be included in the written metric.
  # exclude_database_tag = false

  ## If true, no CREATE DATABASE queries will be sent.  Set to true when using
  ## Telegraf with a user without permissions to create databases or when the
  ## database already exists.
  # skip_database_creation = false

  ## Name of existing retention policy to write to.  Empty string writes to
  ## the default retention policy.  Only takes effect when using HTTP.
  # retention_policy = ""

  ## The value of this tag will be used to determine the retention policy.  If this
  ## tag is not set the 'retention_policy' option is used as the default.
  # retention_policy_tag = ""

  ## If true, the 'retention_policy_tag' will not be included in the written metric.
  # exclude_retention_policy_tag = false

  ## Write consistency (clusters only), can be: "any", "one", "quorum", "all".
  ## Only takes effect when using HTTP.
  # write_consistency = "any"

  ## Timeout for HTTP messages.
  # timeout = "5s"

  ## HTTP Basic Auth
  # username = "telegraf"
  # password = "metricsmetricsmetricsmetrics"

  ## HTTP User-Agent
  # user_agent = "telegraf"

  ## UDP payload size is the maximum packet size to send.
  # udp_payload = "512B"

  ## Optional TLS Config for use on HTTP connections.
  # tls_ca = "/etc/telegraf/ca.pem"
  # tls_cert = "/etc/telegraf/cert.pem"
  # tls_key = "/etc/telegraf/key.pem"
  ## Use TLS but skip chain & host verification
  # insecure_skip_verify = false

  ## HTTP Proxy override, if unset values the standard proxy environment
  ## variables are consulted to determine which proxy, if any, should be used.
  # http_proxy = "http://corporate.proxy:3128"

  ## Additional HTTP headers
  # http_headers = {"X-Special-Header" = "Special-Value"}

  ## HTTP Content-Encoding for write request body, can be set to "gzip" to
  ## compress body or "identity" to apply no encoding.
  # content_encoding = "gzip"

  ## When true, Telegraf will output unsigned integers as unsigned values,
  ## i.e.: "42u".  You will need a version of InfluxDB supporting unsigned
  ## integer values.  Enabling this option will result in field type errors if
  ## existing data has been written.
  # influx_uint_support = false

# Read metrics about cpu usage
[[inputs.cpu]]
  ## Whether to report per-cpu stats or not
  percpu = true
  ## Whether to report total system cpu stats or not
  totalcpu = true
  ## If true, collect raw CPU time metrics
  collect_cpu_time = false
  ## If true, compute and report the sum of all non-idle CPU states
  ## NOTE: The resulting 'time_active' field INCLUDES 'iowait'!
  report_active = false
  ## If true and the info is available then add core_id and physical_id tags
  core_tags = false

Logs from Telegraf

2024-02-15T00:23:13Z I! Loading config: telegraf.conf
2024-02-15T00:23:13Z I! Starting Telegraf 1.30.0-ab2e6936 brought to you by InfluxData the makers of InfluxDB
2024-02-15T00:23:13Z I! Available plugins: 241 inputs, 9 aggregators, 30 processors, 24 parsers, 61 outputs, 5 secret-stores
2024-02-15T00:23:13Z I! Loaded inputs: cpu
2024-02-15T00:23:13Z I! Loaded aggregators: 
2024-02-15T00:23:13Z I! Loaded processors: 
2024-02-15T00:23:13Z I! Loaded secretstores: 
2024-02-15T00:23:13Z I! Loaded outputs: influxdb
2024-02-15T00:23:13Z I! Tags enabled: host=Mingyangs-MacBook-Pro.local
2024-02-15T00:23:13Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"Mingyangs-MacBook-Pro.local", Flush Interval:10s
2024-02-15T00:23:13Z W! [outputs.influxdb] When writing to [http://localhost:8086]: database "telegraf" creation failed: Post "http://localhost:8086/query": dial tcp [::1]:8086: connect: connection refused
panic: genrate panic

goroutine 81 [running]:
github.com/influxdata/telegraf/plugins/inputs/cpu.(*CPUStats).Gather(0x14000ebc768?, {0x1026dd414?, 0x1400257c5a8?})
        /Users/mizhe/Work/telegraf/plugins/inputs/cpu/cpu.go:41 +0x2c
github.com/influxdata/telegraf/models.(*RunningInput).Gather(0x140010a6f60, {0x10aef1750, 0x14000f71ee0})
        /Users/mizhe/Work/telegraf/models/running_input.go:149 +0x54
github.com/influxdata/telegraf/agent.(*Agent).gatherOnce.func1()
        /Users/mizhe/Work/telegraf/agent/agent.go:584 +0x30
created by github.com/influxdata/telegraf/agent.(*Agent).gatherOnce in goroutine 11
        /Users/mizhe/Work/telegraf/agent/agent.go:583 +0xd8

System info

Telegraf 1.29.2 Macbook

Docker

No response

Steps to reproduce

  1. Add a panic() inside the Gather method of any plugin
  2. Build and run the telegraf plugin
  3. Check the stdout, stderr and log files ...

Expected behavior

The log file should contain the below error message: E! FATAL: [cpu] panicked: %s, Stack:... E! PLEASE REPORT THIS PANIC ON GITHUB with stack trace, configuration, and OS information: https://github.com/influxdata/telegraf/issues/new/choose

Actual behavior

panic: genrate panic

goroutine 81 [running]: github.com/influxdata/telegraf/plugins/inputs/cpu.(CPUStats).Gather(0x14000ebc768?, {0x1026dd414?, 0x1400257c5a8?}) /Users/mizhe/Work/telegraf/plugins/inputs/cpu/cpu.go:41 +0x2c github.com/influxdata/telegraf/models.(RunningInput).Gather(0x140010a6f60, {0x10aef1750, 0x14000f71ee0}) /Users/mizhe/Work/telegraf/models/running_input.go:149 +0x54 github.com/influxdata/telegraf/agent.(Agent).gatherOnce.func1() /Users/mizhe/Work/telegraf/agent/agent.go:584 +0x30 created by github.com/influxdata/telegraf/agent.(Agent).gatherOnce in goroutine 11 /Users/mizhe/Work/telegraf/agent/agent.go:583 +0xd8

Additional info

The panicRecover(input) cannot capture the panic inside the Gather method in each plugin because the Gather() method is called in a separate go routine: this line. I think the panicRecover(input) function should be moved inside the go routine instead just before done <- input.Gather(acc)

powersj commented 7 months ago

@zmyzheng,

Thanks for report. Are you willing to file to put up a PR with the change?

zmyzheng commented 7 months ago

Sure @powersj , I'd like to!

zmyzheng commented 7 months ago

Hi @powersj , could you please review my PR? #14840