influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.19k stars 5.52k forks source link

directory_monitor + unpivot stalls with no error message #10894

Open meowcat opened 2 years ago

meowcat commented 2 years ago

Relevant telegraf.conf

# Global tags can be specified here in key="value" format.
[global_tags]
  # dc = "us-east-1" # will tag all metrics with dc=us-east-1
  # rack = "1a"
  ## Environment variables can be used as tags, and throughout the config file
  # user = "$USER"

# Configuration for telegraf agent
[agent]
  ## Default data collection interval for all inputs
  interval = "3s"
  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  round_interval = true

  ## Telegraf will send metrics to outputs in batches of at most
  ## metric_batch_size metrics.
  ## This controls the size of writes that Telegraf sends to output plugins.
  metric_batch_size = 1000

  ## Maximum number of unwritten metrics per output.  Increasing this value
  ## allows for longer periods of output downtime without dropping metrics at the
  ## cost of higher maximum memory usage.
  metric_buffer_limit = 100000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  ## Log at debug level.
  debug = true
  ## Name of the file to be logged to when using the "file" logtarget.  If set to
  ## the empty string then logs are written to stderr.
  logfile = ""
  ## Override default hostname, if empty use os.Hostname()
  hostname = ""
  ## If set to true, do no set the "host" tag in the telegraf agent.
  omit_hostname = false

#Send telegraf metrics to file(s)
[[outputs.file]]
  ## Files to write to, "stdout" is a specially handled file.
  files = ["/opt/telegraf/metrics.out"]
  data_format = "influx"
  use_batch_format = false
  rotation_interval = "0"
  rotation_max_size = "0MB"
  rotation_max_archives = 5

[[inputs.directory_monitor]]

  name_override = "mymetric"
  directory = "/data/in"
  #
  ## The directory to move finished files to.
  finished_directory = "/data/out"
  error_directory = "/data/err"
  files_to_monitor = [".*\\.csv"]
  directory_duration_threshold = "10s"
  max_buffered_metrics = 5000
  file_queue_size = 100000

  data_format = "csv"
  csv_delimiter = "\t"
  csv_header_row_count = 0
  #csv_skip_rows = 1
  csv_column_names = [
    "ts",
    "date",
    "C",
    "D",
    "E",
    "F",
    "G",
    "H",
    "I",
    "J",
    "K",
    "L",
    "M",
    "N",
    "O",
    "P",
    "Q",
    "R",
    "S",
    "T",
    "U",
    "V",
    "W",
    "X",
    "Y",
    "Z",
    "AA",
    "AB",
    "AC",
    "AD",
    "AE",
    "AF",
    "AG",
    "AH",
    "AI",
    "AJ",
    "AK",
    "AL",
    "AM",
    "AN",
    "AO",
    "AP",
    "AQ",
    "AR",
    "AS",
    "AT",
    "AU"
  ]
  csv_timestamp_column = "date"
  csv_timestamp_format = "2006-01-02 15:04:05"
  file_tag = "machine"

[[processors.unpivot]]
  order = 1
  namepass = ["mymetric"]
  tag_key = "name"
  value_key = "value"
  #tagdrop = ["machine"]

Logs from Telegraf

docker-compose up Starting telegraf ... done Attaching to telegraf telegraf | 2022-03-25T13:59:22Z I! Using config file: /etc/telegraf/telegraf.conf telegraf | 2022-03-25T13:59:22Z I! Starting Telegraf 1.22.0 telegraf | 2022-03-25T13:59:22Z I! Loaded inputs: directory_monitor telegraf | 2022-03-25T13:59:22Z I! Loaded aggregators: telegraf | 2022-03-25T13:59:22Z I! Loaded processors: unpivot telegraf | 2022-03-25T13:59:22Z I! Loaded outputs: file telegraf | 2022-03-25T13:59:22Z I! Tags enabled: host=61a134236d67 telegraf | 2022-03-25T13:59:22Z I! [agent] Config: Interval:3s, Quiet:false, Hostname:"61a134236d67", Flush Interval:10s telegraf | 2022-03-25T13:59:22Z D! [agent] Initializing plugins telegraf | 2022-03-25T13:59:22Z D! [agent] Connecting outputs telegraf | 2022-03-25T13:59:22Z D! [agent] Attempting connection to [outputs.file] telegraf | 2022-03-25T13:59:22Z D! [agent] Successfully connected to outputs.file telegraf | 2022-03-25T13:59:22Z D! [agent] Starting service inputs telegraf | 2022-03-25T13:59:24Z D! [outputs.file] Wrote batch of 1000 metrics in 130.74331ms telegraf | 2022-03-25T13:59:24Z D! [outputs.file] Buffer fullness: 85158 / 100000 metrics telegraf | 2022-03-25T13:59:24Z D! [outputs.file] Wrote batch of 1000 metrics in 135.992551ms telegraf | 2022-03-25T13:59:24Z D! [outputs.file] Buffer fullness: 100000 / 100000 metrics telegraf | 2022-03-25T13:59:24Z W! [outputs.file] Metric buffer overflow; 73271 metrics have been dropped telegraf | 2022-03-25T13:59:24Z D! [outputs.file] Wrote batch of 1000 metrics in 108.057716ms telegraf | 2022-03-25T13:59:24Z D! [outputs.file] Buffer fullness: 100000 / 100000 metrics telegraf | 2022-03-25T13:59:24Z W! [outputs.file] Metric buffer overflow; 53729 metrics have been dropped telegraf | 2022-03-25T13:59:24Z D! [outputs.file] Wrote batch of 1000 metrics in 94.145368ms telegraf | 2022-03-25T13:59:24Z D! [outputs.file] Buffer fullness: 99000 / 100000 metrics telegraf | 2022-03-25T13:59:32Z D! [outputs.file] Wrote batch of 1000 metrics in 100.444691ms telegraf | 2022-03-25T13:59:32Z D! [outputs.file] Wrote batch of 1000 metrics in 92.004273ms telegraf | 2022-03-25T13:59:32Z D! [outputs.file] Wrote batch of 1000 metrics in 90.857812ms telegraf | 2022-03-25T13:59:32Z D! [outputs.file] Wrote batch of 1000 metrics in 90.163131ms telegraf | 2022-03-25T13:59:32Z D! [outputs.file] Wrote batch of 1000 metrics in 88.621189ms telegraf | 2022-03-25T13:59:32Z D! [outputs.file] Wrote batch of 1000 metrics in 90.008234ms telegraf | 2022-03-25T13:59:32Z D! [outputs.file] Wrote batch of 1000 metrics in 89.934326ms telegraf | 2022-03-25T13:59:33Z D! [outputs.file] Wrote batch of 1000 metrics in 106.54132ms telegraf | 2022-03-25T13:59:33Z D! [outputs.file] Wrote batch of 1000 metrics in 99.471323ms telegraf | 2022-03-25T13:59:33Z D! [outputs.file] Wrote batch of 1000 metrics in 96.228434ms telegraf | 2022-03-25T13:59:33Z D! [outputs.file] Wrote batch of 1000 metrics in 89.097881ms telegraf | 2022-03-25T13:59:33Z D! [outputs.file] Wrote batch of 1000 metrics in 92.837014ms telegraf | 2022-03-25T13:59:33Z D! [outputs.file] Wrote batch of 1000 metrics in 92.332164ms telegraf | 2022-03-25T13:59:33Z D! [outputs.file] Wrote batch of 1000 metrics in 90.554029ms telegraf | 2022-03-25T13:59:33Z D! [outputs.file] Wrote batch of 1000 metrics in 90.685714ms telegraf | 2022-03-25T13:59:33Z D! [outputs.file] Wrote batch of 1000 metrics in 90.896274ms telegraf | 2022-03-25T13:59:33Z D! [outputs.file] Wrote batch of 1000 metrics in 90.339688ms telegraf | 2022-03-25T13:59:34Z D! [outputs.file] Wrote batch of 1000 metrics in 89.751924ms telegraf | 2022-03-25T13:59:34Z D! [outputs.file] Wrote batch of 1000 metrics in 95.288561ms telegraf | 2022-03-25T13:59:34Z D! [outputs.file] Wrote batch of 1000 metrics in 91.075279ms telegraf | 2022-03-25T13:59:34Z D! [outputs.file] Wrote batch of 1000 metrics in 92.867588ms telegraf | 2022-03-25T13:59:34Z D! [outputs.file] Wrote batch of 1000 metrics in 94.296442ms telegraf | 2022-03-25T13:59:34Z D! [outputs.file] Wrote batch of 1000 metrics in 95.643658ms telegraf | 2022-03-25T13:59:34Z D! [outputs.file] Wrote batch of 1000 metrics in 90.940446ms telegraf | 2022-03-25T13:59:34Z D! [outputs.file] Wrote batch of 1000 metrics in 90.953624ms telegraf | 2022-03-25T13:59:34Z D! [outputs.file] Wrote batch of 1000 metrics in 90.253176ms telegraf | 2022-03-25T13:59:34Z D! [outputs.file] Wrote batch of 1000 metrics in 90.764847ms telegraf | 2022-03-25T13:59:34Z D! [outputs.file] Wrote batch of 1000 metrics in 90.825323ms telegraf | 2022-03-25T13:59:35Z D! [outputs.file] Wrote batch of 1000 metrics in 98.191707ms telegraf | 2022-03-25T13:59:35Z D! [outputs.file] Wrote batch of 1000 metrics in 98.44443ms telegraf | 2022-03-25T13:59:35Z D! [outputs.file] Wrote batch of 1000 metrics in 91.792156ms telegraf | 2022-03-25T13:59:35Z D! [outputs.file] Wrote batch of 1000 metrics in 92.139663ms telegraf | 2022-03-25T13:59:35Z D! [outputs.file] Wrote batch of 1000 metrics in 97.008161ms telegraf | 2022-03-25T13:59:35Z D! [outputs.file] Wrote batch of 1000 metrics in 91.652571ms telegraf | 2022-03-25T13:59:35Z D! [outputs.file] Wrote batch of 1000 metrics in 93.820425ms telegraf | 2022-03-25T13:59:35Z D! [outputs.file] Wrote batch of 1000 metrics in 89.481059ms telegraf | 2022-03-25T13:59:35Z D! [outputs.file] Wrote batch of 1000 metrics in 92.411415ms telegraf | 2022-03-25T13:59:35Z D! [outputs.file] Wrote batch of 1000 metrics in 90.682637ms telegraf | 2022-03-25T13:59:35Z D! [outputs.file] Wrote batch of 1000 metrics in 91.006148ms telegraf | 2022-03-25T13:59:36Z D! [outputs.file] Wrote batch of 1000 metrics in 98.686651ms telegraf | 2022-03-25T13:59:36Z D! [outputs.file] Wrote batch of 1000 metrics in 104.666173ms telegraf | 2022-03-25T13:59:36Z D! [outputs.file] Wrote batch of 1000 metrics in 90.323336ms telegraf | 2022-03-25T13:59:36Z D! [outputs.file] Wrote batch of 1000 metrics in 95.431408ms telegraf | 2022-03-25T13:59:36Z D! [outputs.file] Wrote batch of 1000 metrics in 92.028407ms telegraf | 2022-03-25T13:59:36Z D! [outputs.file] Wrote batch of 1000 metrics in 92.479711ms telegraf | 2022-03-25T13:59:36Z D! [outputs.file] Wrote batch of 1000 metrics in 93.743418ms telegraf | 2022-03-25T13:59:36Z D! [outputs.file] Wrote batch of 1000 metrics in 90.14061ms telegraf | 2022-03-25T13:59:36Z D! [outputs.file] Wrote batch of 1000 metrics in 90.475251ms telegraf | 2022-03-25T13:59:36Z D! [outputs.file] Wrote batch of 1000 metrics in 89.103945ms telegraf | 2022-03-25T13:59:36Z D! [outputs.file] Wrote batch of 1000 metrics in 94.284879ms telegraf | 2022-03-25T13:59:37Z D! [outputs.file] Wrote batch of 1000 metrics in 99.504767ms telegraf | 2022-03-25T13:59:37Z D! [outputs.file] Wrote batch of 1000 metrics in 92.68628ms telegraf | 2022-03-25T13:59:37Z D! [outputs.file] Wrote batch of 1000 metrics in 97.967189ms telegraf | 2022-03-25T13:59:37Z D! [outputs.file] Wrote batch of 1000 metrics in 90.719876ms telegraf | 2022-03-25T13:59:37Z D! [outputs.file] Wrote batch of 1000 metrics in 91.548769ms telegraf | 2022-03-25T13:59:37Z D! [outputs.file] Wrote batch of 1000 metrics in 93.821748ms telegraf | 2022-03-25T13:59:37Z D! [outputs.file] Wrote batch of 1000 metrics in 92.599069ms telegraf | 2022-03-25T13:59:37Z D! [outputs.file] Wrote batch of 1000 metrics in 95.495148ms telegraf | 2022-03-25T13:59:37Z D! [outputs.file] Wrote batch of 1000 metrics in 93.170502ms telegraf | 2022-03-25T13:59:37Z D! [outputs.file] Wrote batch of 1000 metrics in 90.158257ms telegraf | 2022-03-25T13:59:38Z D! [outputs.file] Wrote batch of 1000 metrics in 91.565076ms telegraf | 2022-03-25T13:59:38Z D! [outputs.file] Wrote batch of 1000 metrics in 89.902957ms telegraf | 2022-03-25T13:59:38Z D! [outputs.file] Wrote batch of 1000 metrics in 93.075782ms telegraf | 2022-03-25T13:59:38Z D! [outputs.file] Wrote batch of 1000 metrics in 90.948143ms telegraf | 2022-03-25T13:59:38Z D! [outputs.file] Wrote batch of 1000 metrics in 93.961806ms telegraf | 2022-03-25T13:59:38Z D! [outputs.file] Wrote batch of 1000 metrics in 90.226733ms telegraf | 2022-03-25T13:59:38Z D! [outputs.file] Wrote batch of 1000 metrics in 94.135023ms telegraf | 2022-03-25T13:59:38Z D! [outputs.file] Wrote batch of 1000 metrics in 90.930213ms telegraf | 2022-03-25T13:59:38Z D! [outputs.file] Wrote batch of 1000 metrics in 91.236827ms telegraf | 2022-03-25T13:59:38Z D! [outputs.file] Wrote batch of 1000 metrics in 91.843087ms telegraf | 2022-03-25T13:59:38Z D! [outputs.file] Wrote batch of 1000 metrics in 88.750102ms telegraf | 2022-03-25T13:59:39Z D! [outputs.file] Wrote batch of 1000 metrics in 98.032994ms telegraf | 2022-03-25T13:59:39Z D! [outputs.file] Wrote batch of 1000 metrics in 105.283709ms telegraf | 2022-03-25T13:59:39Z D! [outputs.file] Wrote batch of 1000 metrics in 93.072829ms telegraf | 2022-03-25T13:59:39Z D! [outputs.file] Wrote batch of 1000 metrics in 93.998129ms telegraf | 2022-03-25T13:59:39Z D! [outputs.file] Wrote batch of 1000 metrics in 95.085389ms telegraf | 2022-03-25T13:59:39Z D! [outputs.file] Wrote batch of 1000 metrics in 93.96725ms telegraf | 2022-03-25T13:59:39Z D! [outputs.file] Wrote batch of 1000 metrics in 95.94414ms telegraf | 2022-03-25T13:59:39Z D! [outputs.file] Wrote batch of 1000 metrics in 91.170316ms telegraf | 2022-03-25T13:59:39Z D! [outputs.file] Wrote batch of 1000 metrics in 95.634189ms telegraf | 2022-03-25T13:59:39Z D! [outputs.file] Wrote batch of 1000 metrics in 93.092077ms telegraf | 2022-03-25T13:59:39Z D! [outputs.file] Wrote batch of 1000 metrics in 92.511018ms telegraf | 2022-03-25T13:59:40Z D! [outputs.file] Wrote batch of 1000 metrics in 93.181399ms telegraf | 2022-03-25T13:59:40Z D! [outputs.file] Wrote batch of 1000 metrics in 90.559947ms telegraf | 2022-03-25T13:59:40Z D! [outputs.file] Wrote batch of 1000 metrics in 93.576686ms telegraf | 2022-03-25T13:59:40Z D! [outputs.file] Wrote batch of 1000 metrics in 92.486305ms telegraf | 2022-03-25T13:59:40Z D! [outputs.file] Wrote batch of 1000 metrics in 92.303799ms telegraf | 2022-03-25T13:59:40Z D! [outputs.file] Wrote batch of 1000 metrics in 92.645501ms telegraf | 2022-03-25T13:59:40Z D! [outputs.file] Wrote batch of 1000 metrics in 93.655857ms telegraf | 2022-03-25T13:59:40Z D! [outputs.file] Wrote batch of 1000 metrics in 98.235836ms telegraf | 2022-03-25T13:59:40Z D! [outputs.file] Wrote batch of 1000 metrics in 91.470974ms telegraf | 2022-03-25T13:59:40Z D! [outputs.file] Wrote batch of 1000 metrics in 92.420418ms telegraf | 2022-03-25T13:59:41Z D! [outputs.file] Wrote batch of 1000 metrics in 90.485421ms telegraf | 2022-03-25T13:59:41Z D! [outputs.file] Wrote batch of 1000 metrics in 92.662521ms telegraf | 2022-03-25T13:59:41Z D! [outputs.file] Wrote batch of 1000 metrics in 92.379334ms telegraf | 2022-03-25T13:59:41Z D! [outputs.file] Wrote batch of 1000 metrics in 91.873798ms telegraf | 2022-03-25T13:59:41Z D! [outputs.file] Wrote batch of 1000 metrics in 93.482953ms telegraf | 2022-03-25T13:59:41Z D! [outputs.file] Wrote batch of 1000 metrics in 92.49551ms telegraf | 2022-03-25T13:59:41Z D! [outputs.file] Wrote batch of 1000 metrics in 91.801329ms telegraf | 2022-03-25T13:59:41Z D! [outputs.file] Buffer fullness: 0 / 100000 metrics telegraf | 2022-03-25T13:59:42Z D! [outputs.file] Buffer fullness: 0 / 100000 metrics telegraf | 2022-03-25T13:59:52Z D! [outputs.file] Buffer fullness: 0 / 100000 metrics telegraf | 2022-03-25T14:00:02Z D! [outputs.file] Buffer fullness: 0 / 100000 metrics telegraf | 2022-03-25T14:00:12Z D! [outputs.file] Buffer fullness: 0 / 100000 metrics Gracefully stopping... (press Ctrl+C again to force) Stopping telegraf ... done

System info

Telegraf 1.21.4 or 1.22, Docker on Debian 10, native Docker on Debian 10 on WSL, Docker Desktop on Win 10 and others

Docker

version: '3.4'

services: telegraf: container_name: telegraf image: telegraf:1.22 volumes:

Steps to reproduce

(More details in https://github.com/meowcat/20220325-telegraf-unpivot-reprex)

Expected behavior

Telegraf should process all input files in data/in and move them to data/out. If anything goes wrong (e.g. dropped metrics) it should provide some indication of the error and gracefully continue.

Actual behavior

Telegraf processes ~10 files from data/in to data/out and then stops processing files with no error message or other indication of a problem. Putting new files into the input dir will also not trigger processing.

After restarting the Docker, Telegraf will process another ~10 files, and so on.

Additional info

https://github.com/meowcat/20220325-telegraf-unpivot-reprex Data: https://drive.switch.ch/index.php/s/OqWzQ4rVEEtEc0u

The error only happens when using the unpivot processor after directory_monitor, not with directory_monitor alone (see github for the alternative config.)

I have not checked tail or file; I would suspect the large number of metrics generated by unpivoting causes an overflow somewhere.

I have tried different combinations of buffer sizes, file queue length, batch sizes, intervals etc to no avail.

The provided log is from WSL; less metrics are dropped on native debian.

meowcat commented 2 years ago

Increasing agent.metric_buffer_limit = 10000000 and inputs.directory_monitor.max_buffered:metrics = 50000 gets to 216 processed files before stalling (112 files left)

meowcat commented 2 years ago
agent.metric_buffer_limit = 10000000
inputs.directory_monitor.file_queue_size = 1
inputs.directory_monitor.max_buffered_metrics= 50000

gets all 336 files processed.

meowcat commented 2 years ago

With debug = false, this is not slow enough: 216 processed / 120 left / 2 failed

powersj commented 2 years ago

Hi,

As you discovered if you have a very large number of files, large files, or a combination of the two, when using the file plugin you may need to modify your agent settings (e.g. buffers limit, interval, flush_interval, etc.) based on what your system can handle.

For example:

  interval = "3s"
  flush_interval = "10s"

This means that you are essentially collecting metrics every 3 seconds, but only sending them to an output every 10 seconds. This is not something I would recommend when you are reading lots of metrics. While metrics can also be sent if the batch size is waiting in the buffer, it is better to align the interval and flush interval to help alleviate any pressure on the buffer.

Telegraf should process all input files in data/in and move them to data/out. If anything goes wrong (e.g. dropped metrics) it should provide some indication of the error and gracefully continue.

I certainly agree that Telegraf should process the files in most scenarios. However, that does not mean that configuration or tuning may be required out of the box. Your sample data is 334 files totaling 76,487 lines with 47 columns each I believe qualifies for needing to tune Telegraf. Right away we know you are going to be creating more than the 10,000 metric buffer limit based on those numbers.

As far as issues or errors, Telegraf does provide clearly some comments about dropped metrics, which are even in your logs in the form of:

telegraf | 2022-03-25T13:59:24Z W! [outputs.file] Metric buffer overflow; 53729 metrics have been dropped

I have tried different combinations of buffer sizes, file queue length, batch sizes, intervals etc to no avail.

Below is the simplest config I could come up with pretty quickly and easily processed all the files:

[agent]
  omit_hostname = true
  metric_buffer_limit = 1000000

[[outputs.file]]
  files = ["/home/powersj/telegraf/metrics.out"]
  data_format = "influx"

[[inputs.directory_monitor]]
  name_override = "mymetric"
  directory = "/home/powersj/Downloads/in"
  finished_directory = "/home/powersj/Downloads/out"
  error_directory = "/home/powersj/Downloads/err"
  files_to_monitor = [".*\\.csv"]
  max_buffered_metrics = 100000
  file_queue_size = 100

  file_tag = "machine"

  data_format = "csv"
  csv_delimiter = "\t"
  csv_header_row_count = 0
  csv_timestamp_column = "date"
  csv_timestamp_format = "2006-01-02 15:04:05"
  csv_column_names = [
    "ts",
    "date",
    "C",
    "D",
    "E",
    "F",
    "G",
    "H",
    "I",
    "J",
    "K",
    "L",
    "M",
    "N",
    "O",
    "P",
    "Q",
    "R",
    "S",
    "T",
    "U",
    "V",
    "W",
    "X",
    "Y",
    "Z",
    "AA",
    "AB",
    "AC",
    "AD",
    "AE",
    "AF",
    "AG",
    "AH",
    "AI",
    "AJ",
    "AK",
    "AL",
    "AM",
    "AN",
    "AO",
    "AP",
    "AQ",
    "AR",
    "AS",
    "AT",
    "AU"
  ]

[[processors.unpivot]]
  order = 1
  namepass = ["mymetric"]
  tag_key = "name"
  value_key = "value"

You could further tune the batch size, but this was able to easily parse the files.

Is there documentation that would be helpful to update this for others in the future?

meowcat commented 2 years ago

Hi,

my issue is not that metrics are dropped; this is expected behaviour if I have bad settings. It is evident from the output, and shows me I have to do some tuning. (In the meantime I found a solution that works with file_queue_size = 1.)

My issue is that directory_monitor stops processing the files (indefinitely) when some internal limit is hit, and no new files ever get processed. This happens without any error message. When the bug/limitation triggers, all unprocessed files stay in the directory indefinitely and never move to the finished_directory or error_directory. I expect the worst case to be that all files end up in the error_directory, or that all files end up in the finished_directory but no metrics get written. Both would be fine for me, I just don't think the system stalling without any notice is expected behavior. My only workaround for this would be to restart telegraf every x hours.

Is there documentation that would be helpful to update this for others in the future?

I imagine a "Tuning Telegraf" section in the documentation (e.g. below "configure plugins") could be useful. (I actually had never clicked on the "videos" section because it's not a natural way for me to consume content, so I never knew there is a "Telegraf Agent Configuration Best Practices" but this level is not covered there.)

powersj commented 2 years ago

Both would be fine for me, I just don't think the system stalling without any notice is expected behavior. My only workaround for this would be to restart telegraf every x hours.

I agree that having Telegraf apparently stop parsing the files is undesirable. However, the large number of warnings about dropped metrics is a clear indication that something is not working as it should and is a call to action to the user that something needs to be tuned.

For example, I also think your original setting of a smaller max_buffered_metrics, which the docs say to match metric_buffer_limit in size, may have harmed your attempts as well.

I imagine a "Tuning Telegraf" section in the documentation

ok I am going to leave this issue open and use some of the important fields in your situation as what config options should be talked about in a tuning section.

next steps: add a tuning telegraf section that explains what to do if metrics are getting dropped, things to consider doing, and things to avoid doing.

Hipska commented 2 years ago

@meowcat does the issue with directory_monitor stopping to process files after a certain time go away if you increase the metric buffers? Or is the issue still there in that case? We need to know to see if something still needs to be fixed..

@powersj I was looking into this max_buffered_metrics option as well, and was wondering if it is even relevant. In other words, can't we just always make it the same as metric_buffer_limit?

powersj commented 2 years ago

@powersj I was looking into this max_buffered_metrics option as well, and was wondering if it is even relevant. In other words, can't we just always make it the same as metric_buffer_limit?

To reiterate what we said in chat, let's chat with the team about deprecating or better documenting.

The only comments I saw about the setting on the original PR were this one and this one. So I am not sure of any additional history.

meowcat commented 2 years ago

Hi,

I don't have a sure-shot way to guarantee it never stalls, but it works much better with larger buffers. The usecase for this is when an interrupted connection gets restored and therefore a large amount of data is imported at once. So I never really know just how much data it will be.

Hipska commented 2 years ago

Stalling should never happen, so it seems still to be a bug. Can you give us the fastest way to replicate this? Very small buffer and only a few files maybe?

meowcat commented 2 years ago

Hi,

I now trimmed the reprex down (in_mini.zip and adapted config files; https://github.com/meowcat/20220325-telegraf-unpivot-reprex). I can still see the problem with only 3 input files if I set the remaining limits correspondingly (interval = 1s, metric_buffer_limit = 500, flush_interval = 2s, max_buffered_metrics = 100)

allato commented 3 months ago

I have a problem similar to this with only one file at a time, it always gets moved to the processed folder, but without generating any metrics. I have 5 different input files with the following parameters: agent.metric_buffer_limit = 10000000 inputs.directory_monitor.max_buffered_metrics = 50000 inputs.directory_monitor.file_queue_size = 1

I also used the default values and still encountered the same problem