influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.67k stars 5.59k forks source link

Inputs.internal plugin internal_gather resports error when Inputs.ping receives ping error = 2 #14309

Closed St3f1n closed 11 months ago

St3f1n commented 11 months ago

Relevant telegraf.conf

[global_tags]

[agent]
# The agent table configures Telegraf and the defaults used across all plugins.
  # interval: Default data collection interval for all inputs.
  interval = "2s"
  # round_interval: Rounds collection interval to interval. For example, if interval is set to 10s then always collect on :00, :10, :20, etc.
  round_interval = true
  # metric_batch_size: Telegraf will send metrics to output in batch of at most metric_batch_size metrics. This controls the size of writes that Telegraf sends to output plugins.
  metric_batch_size = 10000
  # metric_buffer_limit: Telegraf will cache metric_buffer_limit metrics for each output, and will flush this buffer on a successful write. This should be a multiple of metric_batch_size and could not be less than 2 times metric_batch_size.Maximum number of unwritten metrics per output. 
  # Increasing this value allows for longer periods of output downtime without dropping metrics at the cost of higher maximum memory usage.
  metric_buffer_limit = 100000
  # collection_jitter: Collection jitter is used to jitter the collection by a random amount. Each plugin will sleep for a random time within jitter before collecting. This can be used to avoid many plugins querying things like sysfs at the same time, which can have a measurable effect on the system.
  collection_jitter = "1s"
  # flush_interval: Default data flushing interval for all outputs. You should not set this below interval. Maximum flush_interval will be flush_interval + flush_jitter
  flush_interval = "2s"
  # flush_jitter:Jitter the flush interval by a random amount. This is primarily to avoid large write spikes for users running a large number of Telegraf instances. For example, a flush_jitter of 5s and flush_interval of 10s means flushes will happen every 10-15s.
  flush_jitter = "1s"
  # precision: Collected metrics are rounded to the precision specified as an interval (integer + unit, ex: 1ns, 1us, 1ms, and 1s . Precision will NOT be used for service inputs, such as logparser and statsd.
  precision = "1ms"
  # debug: Run Telegraf in debug mode.
  debug = false
  # quiet: Run Telegraf in quiet mode (error messages only).
  quiet = false
  # logfile: Specify the log file name. The empty string means to log to stderr. The directry has to exist in advance, else no logfile gets written.
  logfile = ""
  # logtarget: Control the destination for logs. Can be one of “file”, “stderr” or, on Windows, “eventlog”. When set to “file”, the output file is determined by the “logfile” setting.
  logtarget = "file"
  # logfile_rotation_interval: Rotates logfile after the time interval specified. When set to 0 no time based rotation is performed.
  logfile_rotation_interval = 0
  # logfile_rotation_max_size: Rotates logfile when it becomes larger than the specified size. When set to 0 no size based rotation is performed.
  logfile_rotation_max_size = "100KB"
  # logfile_rotation_max_archives: Maximum number of rotated archives to keep, any older logs are deleted. If set to -1, no archives are removed.
  logfile_rotation_max_archives = 50
  # log_with_timezone: Set a timezone to use when logging or type ‘local’ for local time. Example: ‘America/Chicago’. See this page for options/formats.
  # hostname: Override default hostname, if empty use os.Hostname().
  hostname = ""
  # omit_hostname: If true, do no set the host tag in the Telegraf agent.
  omit_hostname = true

###############################################################################
#                             INPUT PLUGINS                                   #
###############################################################################

[[inputs.internal]]
  interval = "5s"
  ## If true, collect telegraf memory stats.
  collect_memstats = true
  ## alloc_bytes is the very same as "heap_alloc_bytes", "alloc_bytes" dropped therefore.

  fieldpass = ["errors"]
  ## Also be aware about wrong values for "heap_alloc_bytes" which get dropped below in a processor plugin.
  [inputs.internal.tags]
    _in = "HostTelegrafProd"

[[inputs.ping]]
  interval = "5s"
  # For avoiding pinging the target hosts at the same time.
  collection_jitter = "1m"
  # fieldpass: For saving data space and only recording required fields.
  fieldpass = ["result_code"]
  # Hosts to send ping packets to.
  urls = ["OTHEREXISTINGHOST"]
  # Number of ping packets to send per interval.  Corresponds to the "-c" option of the ping command.
  count = 1
  [inputs.ping.tags]
    _in = "HostLoadsProd"

###############################################################################
#                            OUTPUT PLUGINS                                   #
###############################################################################

[[outputs.file]]
  files = ["stdout", "./output.out"]
  [outputs.file.tagpass]
    _in = ["HostTelegrafProd", "HostLoadsProd"]

Logs from Telegraf

2023-11-16T12:04:20Z I! Starting Telegraf 1.25.0
2023-11-16T12:04:20Z I! Available plugins: 210 inputs, 9 aggregators, 26 processors, 21 parsers, 57 outputs, 2 secret-stores
2023-11-16T12:04:20Z I! Loaded inputs: internal ping
2023-11-16T12:04:20Z I! Loaded aggregators: 
2023-11-16T12:04:20Z I! Loaded processors: 
2023-11-16T12:04:20Z I! Loaded secretstores: 
2023-11-16T12:04:20Z I! Loaded outputs: file
2023-11-16T12:04:20Z I! Tags enabled: 
2023-11-16T12:04:20Z I! [agent] Config: Interval:2s, Quiet:false, Hostname:"", Flush Interval:2s
2023-11-16T12:04:57Z E! [inputs.ping] Error in plugin: Ping-Anforderung konnte Host "OTHEREXISTINGHOST " nicht finden. šberprfen Sie den Namen, und versuchen Sie es erneut., exit status 1: OTHEREXISTINGHOST 
2023-11-16T12:05:15Z I! [agent] Hang on, flushing any cached metrics before shutdown

System info

Docker

-

Steps to reproduce

Running telegraf with and without network connection.

Expected behavior

The Inputs.ping shall report the ping error = 2 as is. However, the Inputs.internal shall not raise an error from the ping plugin because internal is independent from the network.

The Input.ping error_return 2 should not get treated as an error in general, rather as a normal result.

Actual behavior

I'd like to use the Inputs.internal for alerting purposes if some "internal" error happen and i want to have them independent from the network stability. Unfortunately at the moment i get those "internal" errors when the network is unstable and therefore i need to separate the ping plagin completely from overlaid alerting process.

Line 7 provides the expected return_code 2. Line 8, 12, ... then presents errors=1 (which i'd expect to stay 0). Note that lines 2 and 5 sent errors=0.

outfile:

  1. internal_gather,_in=HostTelegrafProd,input=internal,version=1.25.0 errors=0i 1700136290604000000
  2. internal_gather,_in=HostTelegrafProd,input=ping,version=1.25.0 errors=0i 1700136290604000000
  3. internal_write,_in=HostTelegrafProd,output=file,version=1.25.0 errors=0i 1700136290604000000
  4. internal_gather,_in=HostTelegrafProd,input=internal,version=1.25.0 errors=0i 1700136295438000000
  5. internal_gather,_in=HostTelegrafProd,input=ping,version=1.25.0 errors=0i 1700136295438000000
  6. internal_write,_in=HostTelegrafProd,output=file,version=1.25.0 errors=0i 1700136295438000000
  7. ping,_in=HostLoadsProd,url=CH1PTSPCM1 result_code=2i 1700136297364000000
  8. internal_gather,_in=HostTelegrafProd,input=ping,version=1.25.0 errors=1i 1700136300860000000
  9. internal_write,_in=HostTelegrafProd,output=file,version=1.25.0 errors=0i 1700136300860000000
  10. internal_gather,_in=HostTelegrafProd,input=internal,version=1.25.0 errors=0i 1700136300860000000
  11. internal_gather,_in=HostTelegrafProd,input=internal,version=1.25.0 errors=0i 1700136305032000000
  12. internal_gather,_in=HostTelegrafProd,input=ping,version=1.25.0 errors=1i 1700136305032000000
  13. internal_write,_in=HostTelegrafProd,output=file,version=1.25.0 errors=0i 1700136305032000000
  14. internal_write,_in=HostTelegrafProd,output=file,version=1.25.0 errors=0i 1700136310986000000
  15. internal_gather,_in=HostTelegrafProd,input=internal,version=1.25.0 errors=0i 1700136310986000000
  16. internal_gather,_in=HostTelegrafProd,input=ping,version=1.25.0 errors=1i 1700136310986000000

Additional info

powersj commented 11 months ago

However, the Inputs.internal shall not raise an error from the ping plugin because internal is independent from the network.

The errors reported by the internal plugin are any and all errors by a plugins. It does not differentiate between network errors, retryable errors, or any other types. This is not something we would change at this time as it would require specifying different types of errors across the entire codebase.

St3f1n commented 11 months ago

My point of view is that when the ping plugin receives error return 2, this should not get treated as an error in general, rather a normal result. Secondly especially for the internal_gather states.