influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.69k stars 5.59k forks source link

metric_buffer_limit and metric_batch_size not considered after update to 1.26 #13121

Closed mhoyer closed 1 year ago

mhoyer commented 1 year ago

Relevant telegraf.conf

# cat /etc/telegraf/telegraf.conf 
[agent]
  interval = "10s"
  round_interval = true

  # Telegraf will send metrics to output in batch of at most metric_batch_size metrics.
  metric_batch_size = 200

  # Telegraf will cache metric_buffer_limit metrics for each output, and will flush
  # this buffer on a successful write. This should be a multiple of metric_batch_size
  # and could not be less than 2 times metric_batch_size.
  metric_buffer_limit = 500 # <<< explicitly lowered to 500 to 

  # Collection jitter is used to jitter the collection by a random amount.
  # Each plugin will sleep for a random time within jitter before collecting.
  # This can be used to avoid many plugins querying things like sysfs at the same time,
  # which can have a measurable effect on the system.
  collection_jitter = "1s"

  # Default data flushing interval for all outputs. You should not set this below interval.
  # Maximum flush_interval will be flush_interval + flush_jitter
  flush_interval = "20s"

  # Jitter the flush interval by a random amount. This is primarily to avoid large write spikes
  # for users running a large number of Telegraf instances. For example, a flush_jitter of 5s
  # and flush_interval of 10s means flushes will happen every 10-15s.
  flush_jitter = "1s"

  # Collected metrics are rounded to the precision specified as an interval
  # (integer + unit, ex: 1ns, 1us, 1ms, and 1s).
  precision = ""

  # Run Telegraf in debug mode.
  debug = true

  # Run Telegraf in quiet mode (error messages only).
  quiet = false

  # Controls the destination for logs and can be set to "file", "stderr", or, on Windows, "eventlog".
  # When set to "file", the output file is determined by the logfile setting.
  logtarget = "stderr"

  # Override default hostname, if empty use os.Hostname().
  hostname = ""

  # If true, do no set the host tag in the Telegraf agent.
  omit_hostname = false

[global_tags]
  k8s_cluster = "playground-dev"

Logs from Telegraf

2023-04-21T07:24:16Z I! Loading config file: /etc/telegraf/input.influxdb_v2_listener.conf
2023-04-21T07:24:16Z I! Loading config file: /etc/telegraf/input.internal.conf
2023-04-21T07:24:16Z I! Loading config file: /etc/telegraf/input.kube_inventory.conf
2023-04-21T07:24:16Z I! Loading config file: /etc/telegraf/input.kubernetes.conf
2023-04-21T07:24:16Z I! Loading config file: /etc/telegraf/input.prometheus.conf
2023-04-21T07:24:16Z I! Loading config file: /etc/telegraf/output.influxdb_v2.conf
2023-04-21T07:24:16Z I! Loading config file: /etc/telegraf/telegraf.conf
2023-04-21T07:24:16Z I! Starting Telegraf 1.26.1
2023-04-21T07:24:16Z I! Available plugins: 235 inputs, 9 aggregators, 27 processors, 22 parsers, 57 outputs, 2 secret-stores
2023-04-21T07:24:16Z I! Loaded inputs: influxdb_v2_listener internal kube_inventory kubernetes prometheus (8x)
2023-04-21T07:24:16Z I! Loaded aggregators:
2023-04-21T07:24:16Z I! Loaded processors:
2023-04-21T07:24:16Z I! Loaded secretstores:
2023-04-21T07:24:16Z I! Loaded outputs: influxdb_v2 (2x)
2023-04-21T07:24:16Z I! Tags enabled: host=affinity-telegraf-5dfcd55fc6-q2w6x k8s_cluster=playground-dev
2023-04-21T07:24:16Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"affinity-telegraf-5dfcd55fc6-q2w6x", Flush Interval:20s
2023-04-21T07:24:16Z D! [agent] Initializing plugins
2023-04-21T07:24:16Z D! [agent] Connecting outputs
2023-04-21T07:24:16Z D! [agent] Attempting connection to [outputs.influxdb_v2::out.iflx_v2@k8s-cluster-info]
2023-04-21T07:24:16Z D! [agent] Successfully connected to outputs.influxdb_v2::out.iflx_v2@k8s-cluster-info
2023-04-21T07:24:16Z D! [agent] Attempting connection to [outputs.influxdb_v2::out.iflx_v2@k8s-playground]
2023-04-21T07:24:16Z D! [agent] Successfully connected to outputs.influxdb_v2::out.iflx_v2@k8s-playground
2023-04-21T07:24:16Z D! [agent] Starting service inputs
2023-04-21T07:24:16Z I! [inputs.influxdb_v2_listener::in.influxdb_v2_listener] Started HTTP listener service on :8086
2023-04-21T07:24:17Z D! [inputs.prometheus::in.prometheus-v2@default] will scrape metrics from "http://10.177.104.5:8080/metrics"
2023-04-21T07:24:17Z D! [inputs.prometheus::in.prometheus-v1@default] will scrape metrics from "http://10.177.104.5:8080/metrics"
2023-04-21T07:24:20Z D! [outputs.influxdb_v2::out.iflx_v2@k8s-playground] Wrote batch of 1000 metrics in 107.284907ms
2023-04-21T07:24:20Z D! [outputs.influxdb_v2::out.iflx_v2@k8s-playground] Buffer fullness: 148 / 10000 metrics
...

System info

Telegraf 1.26.1, Alpine Linux v3.17, in Kubernetes cluster (TKGI

Docker

We host it inside a kubernetes cluster v1.23.16 as custom helm chart. If really needed I can (but have to create a sample) Dockerfile.

Steps to reproduce

  1. Configure different values for metric_buffer_limit and metric_batch_size in telegraf.conf
  2. Enable debug = true
  3. Start telegraf and check the logs, which still writes:
    ... Wrote batch of 1000 metrics in ###ms
    ... Buffer fullness: ### / 10000 metrics

Expected behavior

Apply the metric_buffer related settings I configured in /etc/telegraf/telegraf.conf

Actual behavior

At least for those two configuration values the Telegraf instance is falling back to the default values:

Additional info

As you can see in the provided telegraf.conf above we have custom settings for metric_batch_size = 200 and metric_buffer_limit = 500, but in the logs we see the default values 1000 for batch size and 10000 for the buffer limit respectively.

The problem (seeing the ### metrics have been dropped warnings in the logs) suddenly started after updating Telegraf pods from 1.25.3 to 1.26.1 w/o any changes to the configuration files.

I also opened a terminal inside the Telegraf Pod and double checked the following:

/ #  ls -la /etc/telegraf/
total 12
drwxrwxrwx    3 root     root          4096 Apr 21 07:24 .
drwxr-xr-x    1 root     root          4096 Apr 21 07:24 ..
drwxr-xr-x    2 root     root          4096 Apr 21 07:24 ..2023_04_21_07_24_12.130811286
lrwxrwxrwx    1 root     root            31 Apr 21 07:24 ..data -> ..2023_04_21_07_24_12.130811286
lrwxrwxrwx    1 root     root            38 Apr 21 07:24 input.influxdb_v2_listener.conf -> ..data/input.influxdb_v2_listener.conf
lrwxrwxrwx    1 root     root            26 Apr 21 07:24 input.internal.conf -> ..data/input.internal.conf
lrwxrwxrwx    1 root     root            32 Apr 21 07:24 input.kube_inventory.conf -> ..data/input.kube_inventory.conf
lrwxrwxrwx    1 root     root            28 Apr 21 07:24 input.kubernetes.conf -> ..data/input.kubernetes.conf
lrwxrwxrwx    1 root     root            28 Apr 21 07:24 input.prometheus.conf -> ..data/input.prometheus.conf
lrwxrwxrwx    1 root     root            30 Apr 21 07:24 output.influxdb_v2.conf -> ..data/output.influxdb_v2.conf
lrwxrwxrwx    1 root     root            20 Apr 21 07:24 telegraf.conf -> ..data/telegraf.conf
/ # grep 'metric_buffer' /etc/telegraf/* 
/etc/telegraf/telegraf.conf:  # Telegraf will cache metric_buffer_limit metrics for each output, and will flush
/etc/telegraf/telegraf.conf:  metric_buffer_limit = 500
mhoyer commented 1 year ago

Additional info. I just tried to set the two config parameters again inside the outputs.influxdb_v2 settings which then really seem to apply as expected. Still, I wonder why it's not working when configured in the [agent] section only.

powersj commented 1 year ago

Hi,

Config files are loaded one at a time, which means that top level [agent] settings that affect plugins, like inputs or outputs, needs to be specified first.

Take the following example config files:

config.toml

[[outputs.file]]
   alias = "first"

[[inputs.exec]]
    commands = ["echo metric,tag=1 field=42"]
    data_format = "influx"

zconfig.toml

[agent]
    metric_buffer_limit = 100
    debug = true

[[inputs.exec]]
    commands = ["echo other,tag=1 field=42"]
    data_format = "influx"

[[outputs.file]]
   alias = "second"
2023-04-21T13:43:17Z I! Starting Telegraf 1.25.3
2023-04-21T13:43:17Z I! Available plugins: 228 inputs, 9 aggregators, 26 processors, 21 parsers, 57 outputs, 2 secret-stores
2023-04-21T13:43:17Z I! Loaded inputs: exec (2x)
2023-04-21T13:43:17Z I! Loaded aggregators: 
2023-04-21T13:43:17Z I! Loaded processors: 
2023-04-21T13:43:17Z I! Loaded secretstores: 
2023-04-21T13:43:17Z I! Loaded outputs: file (2x)
2023-04-21T13:43:17Z I! Tags enabled: host=ryzen
2023-04-21T13:43:17Z D! [agent] Initializing plugins
2023-04-21T13:43:17Z D! [agent] Connecting outputs
2023-04-21T13:43:17Z D! [agent] Attempting connection to [outputs.file::first]
2023-04-21T13:43:17Z D! [agent] Successfully connected to outputs.file::first
2023-04-21T13:43:17Z D! [agent] Attempting connection to [outputs.file::second]
2023-04-21T13:43:17Z D! [agent] Successfully connected to outputs.file::second
2023-04-21T13:43:17Z D! [agent] Starting service inputs
2023-04-21T13:43:17Z D! [agent] Stopping service inputs
2023-04-21T13:43:17Z D! [agent] Input channel closed
2023-04-21T13:43:17Z I! [agent] Hang on, flushing any cached metrics before shutdown
metric,host=ryzen,tag=1 field=42 1682084598000000000
other,host=ryzen,tag=1 field=42 1682084598000000000
metric,host=ryzen,tag=1 field=42 1682084598000000000
other,host=ryzen,tag=1 field=42 1682084598000000000
2023-04-21T13:43:17Z D! [outputs.file::second] Wrote batch of 2 metrics in 15.2µs
2023-04-21T13:43:17Z D! [outputs.file::second] Buffer fullness: 0 / 100 metrics
2023-04-21T13:43:17Z D! [outputs.file::first] Wrote batch of 2 metrics in 13.1µs
2023-04-21T13:43:17Z D! [outputs.file::first] Buffer fullness: 0 / 10000 metrics
2023-04-21T13:43:17Z I! [agent] Stopping running outputs
2023-04-21T13:43:17Z D! [agent] Stopped Successfully

Note how the second file output gets the global agent metric size set to 100 as specified by the config, but the first does not. That is because when the first file is loaded, the agent settings are not known.

Also note that the above is with the version you said worked. I do see the same behavior all the way back to v1.19.0:

../telegraf-builds/telegraf-v1.19.0 --config config.toml --config zconfig.toml --once
2023-04-21T13:51:40Z I! Starting Telegraf 1.19.0
2023-04-21T13:51:40Z D! [agent] Initializing plugins
2023-04-21T13:51:40Z D! [agent] Connecting outputs
2023-04-21T13:51:40Z D! [agent] Attempting connection to [outputs.file::first]
2023-04-21T13:51:40Z D! [agent] Successfully connected to outputs.file::first
2023-04-21T13:51:40Z D! [agent] Attempting connection to [outputs.file::second]
2023-04-21T13:51:40Z D! [agent] Successfully connected to outputs.file::second
2023-04-21T13:51:40Z D! [agent] Starting service inputs
2023-04-21T13:51:40Z D! [agent] Stopping service inputs
2023-04-21T13:51:40Z D! [agent] Input channel closed
2023-04-21T13:51:40Z I! [agent] Hang on, flushing any cached metrics before shutdown
other,host=ryzen,tag=1 field=42 1682085101000000000
metric,host=ryzen,tag=1 field=42 1682085101000000000
2023-04-21T13:51:40Z D! [outputs.file::first] Wrote batch of 2 metrics in 17.94µs
2023-04-21T13:51:40Z D! [outputs.file::first] Buffer fullness: 0 / 10000 metrics
other,host=ryzen,tag=1 field=42 1682085101000000000
metric,host=ryzen,tag=1 field=42 1682085101000000000
2023-04-21T13:51:40Z D! [outputs.file::second] Wrote batch of 2 metrics in 4.47µs
2023-04-21T13:51:40Z D! [outputs.file::second] Buffer fullness: 0 / 100 metrics
2023-04-21T13:51:40Z I! [agent] Stopping running outputs
2023-04-21T13:51:40Z D! [agent] Stopped Successfully

I'm inclined to say this is working as expected for now, but I do think we can and should better document this limitation.

mhoyer commented 1 year ago

Thanks @powersj, I guess the sequence of when what config block is read was the correct hint. The missing piece (which led to my misbehavior) probably was the way how we start telegraf in our k8s pod:

telegraf --config-directory /etc/telegraf

And this is now quite reproduceable for me with a small setup using docker-compose:

version: "3"
services:
  telegraf:
    image: telegraf:1.26.1-alpine
    # image: telegraf:1.25.3-alpine
    container_name: telegraf
    volumes:
      - ./conf:/etc/telegraf
    entrypoint: telegraf --config-directory /etc/telegraf

And besides the docker-compose.yaml from above, just place those two files into a ./conf folder:

inputs.conf (same file as your config.toml):

[[outputs.file]]
   alias = "first"

[[inputs.exec]]
    commands = ["echo metric,tag=1 field=42"]
    data_format = "influx"

telegraf.conf (same file as your zconfig.toml):

[agent]
    metric_buffer_limit = 100
    debug = true

[[inputs.exec]]
    commands = ["echo other,tag=1 field=42"]
    data_format = "influx"

[[outputs.file]]
   alias = "second"

Now if I run that docker-compose.yaml with the different Telegraf versions I can reproduce it:

with 1.25.3:

telegraf    | 2023-04-21T20:24:42Z I! Using config file: /etc/telegraf/telegraf.conf
telegraf    | 2023-04-21T20:24:42Z I! Starting Telegraf 1.25.3
telegraf    | 2023-04-21T20:24:42Z I! Available plugins: 228 inputs, 9 aggregators, 26 processors, 21 parsers, 57 outputs, 2 secret-stores
telegraf    | 2023-04-21T20:24:42Z I! Loaded inputs: exec (3x)
telegraf    | 2023-04-21T20:24:42Z I! Loaded aggregators: 
telegraf    | 2023-04-21T20:24:42Z I! Loaded processors: 
telegraf    | 2023-04-21T20:24:42Z I! Loaded secretstores: 
telegraf    | 2023-04-21T20:24:42Z I! Loaded outputs: file (3x)
telegraf    | 2023-04-21T20:24:42Z I! Tags enabled: host=5eb112530210
telegraf    | 2023-04-21T20:24:42Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"5eb112530210", Flush Interval:10s
telegraf    | 2023-04-21T20:24:42Z D! [agent] Initializing plugins
telegraf    | 2023-04-21T20:24:42Z D! [agent] Connecting outputs
telegraf    | 2023-04-21T20:24:42Z D! [agent] Attempting connection to [outputs.file::second]
telegraf    | 2023-04-21T20:24:42Z D! [agent] Successfully connected to outputs.file::second
telegraf    | 2023-04-21T20:24:42Z D! [agent] Attempting connection to [outputs.file::first]
telegraf    | 2023-04-21T20:24:42Z D! [agent] Successfully connected to outputs.file::first
telegraf    | 2023-04-21T20:24:42Z D! [agent] Attempting connection to [outputs.file::second]
telegraf    | 2023-04-21T20:24:42Z D! [agent] Successfully connected to outputs.file::second
telegraf    | 2023-04-21T20:24:42Z D! [agent] Starting service inputs
telegraf    | 2023-04-21T20:24:52Z D! [outputs.file::second] Wrote batch of 3 metrics in 76.2µs
telegraf    | 2023-04-21T20:24:52Z D! [outputs.file::second] Buffer fullness: 0 / 100 metrics
telegraf    | 2023-04-21T20:24:52Z D! [outputs.file::second] Wrote batch of 3 metrics in 13.6µs
telegraf    | 2023-04-21T20:24:52Z D! [outputs.file::second] Buffer fullness: 0 / 100 metrics
telegraf    | 2023-04-21T20:24:52Z D! [outputs.file::first] Wrote batch of 3 metrics in 13.4µs
telegraf    | 2023-04-21T20:24:52Z D! [outputs.file::first] Buffer fullness: 0 / 100 metrics

with 1.26.1:

telegraf    | 2023-04-21T20:25:45Z I! Loading config file: /etc/telegraf/inputs.conf
telegraf    | 2023-04-21T20:25:45Z I! Loading config file: /etc/telegraf/telegraf.conf
telegraf    | 2023-04-21T20:25:45Z I! Starting Telegraf 1.26.1
telegraf    | 2023-04-21T20:25:45Z I! Available plugins: 235 inputs, 9 aggregators, 27 processors, 22 parsers, 57 outputs, 2 secret-stores
telegraf    | 2023-04-21T20:25:45Z I! Loaded inputs: exec (2x)
telegraf    | 2023-04-21T20:25:45Z I! Loaded aggregators: 
telegraf    | 2023-04-21T20:25:45Z I! Loaded processors: 
telegraf    | 2023-04-21T20:25:45Z I! Loaded secretstores: 
telegraf    | 2023-04-21T20:25:45Z I! Loaded outputs: file (2x)
telegraf    | 2023-04-21T20:25:45Z I! Tags enabled: host=dd6d242d10d3
telegraf    | 2023-04-21T20:25:45Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"dd6d242d10d3", Flush Interval:10s
telegraf    | 2023-04-21T20:25:45Z D! [agent] Initializing plugins
telegraf    | 2023-04-21T20:25:45Z D! [agent] Connecting outputs
telegraf    | 2023-04-21T20:25:45Z D! [agent] Attempting connection to [outputs.file::first]
telegraf    | 2023-04-21T20:25:45Z D! [agent] Successfully connected to outputs.file::first
telegraf    | 2023-04-21T20:25:45Z D! [agent] Attempting connection to [outputs.file::second]
telegraf    | 2023-04-21T20:25:45Z D! [agent] Successfully connected to outputs.file::second
telegraf    | 2023-04-21T20:25:45Z D! [agent] Starting service inputs
telegraf    | other,host=dd6d242d10d3,tag=1 field=42 1682108750000000000
telegraf    | metric,host=dd6d242d10d3,tag=1 field=42 1682108750000000000
telegraf    | other,host=dd6d242d10d3,tag=1 field=42 1682108750000000000
telegraf    | metric,host=dd6d242d10d3,tag=1 field=42 1682108750000000000
telegraf    | 2023-04-21T20:25:55Z D! [outputs.file::second] Wrote batch of 2 metrics in 70.9µs
telegraf    | 2023-04-21T20:25:55Z D! [outputs.file::second] Buffer fullness: 0 / 100 metrics
telegraf    | 2023-04-21T20:25:55Z D! [outputs.file::first] Wrote batch of 2 metrics in 16.5µs
telegraf    | 2023-04-21T20:25:55Z D! [outputs.file::first] Buffer fullness: 0 / 10000 metrics

So to me it seems to be related to the use of --config-directory option when starting Telegraf and a changed behavior about the way the config files are loaded (in which order).

Does this make sense to you?

powersj commented 1 year ago

telegraf --config-directory /etc/telegraf

I did not think this would work in v1.25.3. It was not until v1.26.0 that we only allowed a config directory to be passed in.

telegraf | 2023-04-21T20:24:42Z I! Using config file: /etc/telegraf/telegraf.conf telegraf | 2023-04-21T20:24:42Z I! Starting Telegraf 1.25.3 telegraf | 2023-04-21T20:25:45Z I! Loaded inputs: exec (2x) telegraf | 2023-04-21T20:25:45Z I! Loaded outputs: file (2x)

You only loaded one file in the 1.25.3 output which has 3x exec inputs and 2x file outputs. That does not align with your example configs.

It sounds more likely that you went from a single file -> config directory. And now you are seeing that the config directory loads file in their listed order. In which case your [agent] setting is in a later file.

My suggestion is to put your [agent] settings in a file and pass that first. Either call it agent.conf in your config dir or pass it as --config agent.toml to ensure it is read first.

telegraf-tiger[bot] commented 1 year ago

Hello! I am closing this issue due to inactivity. I hope you were able to resolve your problem, if not please try posting this question in our Community Slack or Community Page. Thank you!