Open meowcat opened 2 years ago
Increasing agent.metric_buffer_limit = 10000000
and inputs.directory_monitor.max_buffered:metrics = 50000
gets to 216 processed files before stalling (112 files left)
agent.metric_buffer_limit = 10000000
inputs.directory_monitor.file_queue_size = 1
inputs.directory_monitor.max_buffered_metrics= 50000
gets all 336 files processed.
With debug = false
, this is not slow enough: 216 processed / 120 left / 2 failed
Hi,
As you discovered if you have a very large number of files, large files, or a combination of the two, when using the file plugin you may need to modify your agent settings (e.g. buffers limit, interval, flush_interval, etc.) based on what your system can handle.
For example:
interval = "3s"
flush_interval = "10s"
This means that you are essentially collecting metrics every 3 seconds, but only sending them to an output every 10 seconds. This is not something I would recommend when you are reading lots of metrics. While metrics can also be sent if the batch size is waiting in the buffer, it is better to align the interval and flush interval to help alleviate any pressure on the buffer.
Telegraf should process all input files in data/in and move them to data/out. If anything goes wrong (e.g. dropped metrics) it should provide some indication of the error and gracefully continue.
I certainly agree that Telegraf should process the files in most scenarios. However, that does not mean that configuration or tuning may be required out of the box. Your sample data is 334 files totaling 76,487 lines with 47 columns each I believe qualifies for needing to tune Telegraf. Right away we know you are going to be creating more than the 10,000 metric buffer limit based on those numbers.
As far as issues or errors, Telegraf does provide clearly some comments about dropped metrics, which are even in your logs in the form of:
telegraf | 2022-03-25T13:59:24Z W! [outputs.file] Metric buffer overflow; 53729 metrics have been dropped
I have tried different combinations of buffer sizes, file queue length, batch sizes, intervals etc to no avail.
Below is the simplest config I could come up with pretty quickly and easily processed all the files:
[agent]
omit_hostname = true
metric_buffer_limit = 1000000
[[outputs.file]]
files = ["/home/powersj/telegraf/metrics.out"]
data_format = "influx"
[[inputs.directory_monitor]]
name_override = "mymetric"
directory = "/home/powersj/Downloads/in"
finished_directory = "/home/powersj/Downloads/out"
error_directory = "/home/powersj/Downloads/err"
files_to_monitor = [".*\\.csv"]
max_buffered_metrics = 100000
file_queue_size = 100
file_tag = "machine"
data_format = "csv"
csv_delimiter = "\t"
csv_header_row_count = 0
csv_timestamp_column = "date"
csv_timestamp_format = "2006-01-02 15:04:05"
csv_column_names = [
"ts",
"date",
"C",
"D",
"E",
"F",
"G",
"H",
"I",
"J",
"K",
"L",
"M",
"N",
"O",
"P",
"Q",
"R",
"S",
"T",
"U",
"V",
"W",
"X",
"Y",
"Z",
"AA",
"AB",
"AC",
"AD",
"AE",
"AF",
"AG",
"AH",
"AI",
"AJ",
"AK",
"AL",
"AM",
"AN",
"AO",
"AP",
"AQ",
"AR",
"AS",
"AT",
"AU"
]
[[processors.unpivot]]
order = 1
namepass = ["mymetric"]
tag_key = "name"
value_key = "value"
You could further tune the batch size, but this was able to easily parse the files.
Is there documentation that would be helpful to update this for others in the future?
Hi,
my issue is not that metrics are dropped; this is expected behaviour if I have bad settings. It is evident from the output, and shows me I have to do some tuning. (In the meantime I found a solution that works with file_queue_size = 1
.)
My issue is that directory_monitor
stops processing the files (indefinitely) when some internal limit is hit, and no new files ever get processed. This happens without any error message. When the bug/limitation triggers, all unprocessed files stay in the directory
indefinitely and never move to the finished_directory
or error_directory
. I expect the worst case to be that all files end up in the error_directory
, or that all files end up in the finished_directory
but no metrics get written. Both would be fine for me, I just don't think the system stalling without any notice is expected behavior. My only workaround for this would be to restart telegraf every x hours.
Is there documentation that would be helpful to update this for others in the future?
I imagine a "Tuning Telegraf" section in the documentation (e.g. below "configure plugins") could be useful. (I actually had never clicked on the "videos" section because it's not a natural way for me to consume content, so I never knew there is a "Telegraf Agent Configuration Best Practices" but this level is not covered there.)
Both would be fine for me, I just don't think the system stalling without any notice is expected behavior. My only workaround for this would be to restart telegraf every x hours.
I agree that having Telegraf apparently stop parsing the files is undesirable. However, the large number of warnings about dropped metrics is a clear indication that something is not working as it should and is a call to action to the user that something needs to be tuned.
For example, I also think your original setting of a smaller max_buffered_metrics
, which the docs say to match metric_buffer_limit
in size, may have harmed your attempts as well.
I imagine a "Tuning Telegraf" section in the documentation
ok I am going to leave this issue open and use some of the important fields in your situation as what config options should be talked about in a tuning section.
next steps: add a tuning telegraf section that explains what to do if metrics are getting dropped, things to consider doing, and things to avoid doing.
@meowcat does the issue with directory_monitor stopping to process files after a certain time go away if you increase the metric buffers? Or is the issue still there in that case? We need to know to see if something still needs to be fixed..
@powersj I was looking into this max_buffered_metrics
option as well, and was wondering if it is even relevant. In other words, can't we just always make it the same as metric_buffer_limit
?
@powersj I was looking into this max_buffered_metrics option as well, and was wondering if it is even relevant. In other words, can't we just always make it the same as metric_buffer_limit?
To reiterate what we said in chat, let's chat with the team about deprecating or better documenting.
The only comments I saw about the setting on the original PR were this one and this one. So I am not sure of any additional history.
Hi,
I don't have a sure-shot way to guarantee it never stalls, but it works much better with larger buffers. The usecase for this is when an interrupted connection gets restored and therefore a large amount of data is imported at once. So I never really know just how much data it will be.
Stalling should never happen, so it seems still to be a bug. Can you give us the fastest way to replicate this? Very small buffer and only a few files maybe?
Hi,
I now trimmed the reprex down (in_mini.zip and adapted config files; https://github.com/meowcat/20220325-telegraf-unpivot-reprex). I can still see the problem with only 3 input files if I set the remaining limits correspondingly (interval = 1s, metric_buffer_limit = 500, flush_interval = 2s, max_buffered_metrics = 100)
I have a problem similar to this with only one file at a time, it always gets moved to the processed folder, but without generating any metrics. I have 5 different input files with the following parameters: agent.metric_buffer_limit = 10000000 inputs.directory_monitor.max_buffered_metrics = 50000 inputs.directory_monitor.file_queue_size = 1
I also used the default values and still encountered the same problem
Relevant telegraf.conf
Logs from Telegraf
System info
Telegraf 1.21.4 or 1.22, Docker on Debian 10, native Docker on Debian 10 on WSL, Docker Desktop on Win 10 and others
Docker
version: '3.4'
services: telegraf: container_name: telegraf image: telegraf:1.22 volumes:
Steps to reproduce
(More details in https://github.com/meowcat/20220325-telegraf-unpivot-reprex)
docker-compose -f up
telegraf.conf
as config, wheredirectory_monitoring
input is rotated to long form with theunpivot
processor and then written tofile
outputfile
andinfluxdb
output plugins, as well as both in parallel.Expected behavior
Telegraf should process all input files in
data/in
and move them todata/out
. If anything goes wrong (e.g. dropped metrics) it should provide some indication of the error and gracefully continue.Actual behavior
Telegraf processes ~10 files from
data/in
todata/out
and then stops processing files with no error message or other indication of a problem. Putting new files into the input dir will also not trigger processing.After restarting the Docker, Telegraf will process another ~10 files, and so on.
Additional info
https://github.com/meowcat/20220325-telegraf-unpivot-reprex Data: https://drive.switch.ch/index.php/s/OqWzQ4rVEEtEc0u
The error only happens when using the
unpivot
processor afterdirectory_monitor
, not withdirectory_monitor
alone (see github for the alternative config.)I have not checked
tail
orfile
; I would suspect the large number of metrics generated by unpivoting causes an overflow somewhere.I have tried different combinations of buffer sizes, file queue length, batch sizes, intervals etc to no avail.
The provided log is from WSL; less metrics are dropped on native debian.