Closed Vitaljok closed 1 year ago
The bug is not related to the file permissions
That is exactly what this is :) it just happens to be during a file rotate telegraf does not have permission to the file. Any fix will be in response to an error around file permissions, no?
But it would be nice to have built-in retry mechanism for such cases.
What level of retry, as immediately retrying would probably still fail given your log rotate took a few seconds:
That is exactly what this is :)
Ok, the bug is releated to file permissions, indeed. It was an attempt to distinguish it from dozens of posts on how to configure the Telegraf to properly start at the first place. :)
What level of retry...
I would say the good example is tail
from coreutils
.
I did few experiments with it and tried to mimic logrotate
process.
In one terminal I started tail -v -F /tmp/test.log
, and on the second I manipulated the target file.
1| $ tail -v -F /tmp/test.log
1| tail: cannot open '/tmp/test.log' for reading: No such file or directory
2| # (umask 026; touch /tmp/test.log)
2| # ll /tmp/test.log
2| -rw-r----- 1 root root 0 apr 28 08:52 /tmp/test.log
2| # echo "Hello!" >> /tmp/test.log
2| # chown root:adm /tmp/test.log
2| # echo "World" >> /tmp/test.log
1| tail: '/tmp/test.log' has become accessible
1| Hello!
1| World
2| # mv /tmp/test.log /tmp/test.log.old
1| tail: '/tmp/test.log' has become inaccessible: No such file or directory
2| # (umask 026; touch /tmp/test.log)
2| # chown root:adm /tmp/test.log
2| # echo "It still works..." >> /tmp/test.log
1| tail: '/tmp/test.log' has become accessible
1| It still works...
It also works when the file is not moved away, but rather access rights are removed.
2| # chown root:root /tmp/test.log
1| tail: cannot watch '/tmp/test.log': Permission denied
2| # chown root:adm /tmp/test.log
2| # echo "Works even after permission denied" >> /tmp/test.log
1| Works even after permission denied
As you can see tail -F
is quite tolerant to missing files and permissions. And after getting access to file, it is able to "catch up" with all new lines.
Not sure if it is possible in Telegraf plugin, but it is good benchmark for expectations.
only retry if it has successfully read the file once?
This would be good option to "fail fast" if the file is not accessible at startup (like typo in file name, or really missing permissions).
Thanks for the response - I want to chat this over with sven next
One possible option we talked about today: the tail plugin is a service plugin. Meaning it gathers data as it needs to, rather than using the gather cycle. We could shift to using gather cycle to retry files that may have failed. We would need to keep a list of failed files, re-try during gather, and remove them if they start working.
@powersj is there an example of this pattern? I would like to attempt to take on this issue.
@dayvar14,
Happy to see a PR! We don't have something precisely like this already. Essentially I think what we wanted to do was:
Gather
function start to check if that slice is not empty and if so go through those files and re-try themThis would help with short-term issues like these rotation issues, so we do not fail, but at least do a retry.
@dayvar14,
Happy to see a PR! We don't have something precisely like this already. Essentially I think what we wanted to do was:
- Catch this specific error and add the file to a slice of failed files
- In the
Gather
function start to check if that slice is not empty and if so go through those files and re-try them- If the read starts to succeed we can remove it from the slice and continue to read, essentially re-init on that one file
- If it fails, we can try again during the next call to gather
This would help with short-term issues like these rotation issues, so we do not fail, but at least do a retry.
@powersj
@ericmiller-wpengine discovered that simply deleting the tail from the list of active tails when the error occurs seems to resolve when the tailNewFiles is called during the Gather function here. Seems to do what you and sven discussed.
PR was opened https://github.com/influxdata/telegraf/pull/14357.
@Vitaljok can you test to see if this PR resolves the issue for you?
Thanks for the PR.
Deployed it to my host.
2023-11-28T11:02:55+02:00 I! Starting Telegraf 1.29.0-4b867c55 brought to you by InfluxData the makers of InfluxDB
So far no problems.
Also tried to manually run logrotate
, but I'm not sure if it actually did anything.
Let's see how it behaves during true nightly logrotate.
I will let you know if I note anything strange.
Seems the PR did not resolved the issue completely. :(
2023-12-13T00:00:01+02:00 E! [inputs.tail::stream tail] Tailing "/var/log/nginx/stream.log": Unable to open file /var/log/nginx/stream.log: open /var/log/nginx/stream.log: permission denied
2023-12-13T00:00:01+02:00 E! [inputs.tail::stream tail] Deleting tailer for "/var/log/nginx/stream.log"
Seems the PR did not resolved the issue completely. :(
2023-12-13T00:00:01+02:00 E! [inputs.tail::stream tail] Tailing "/var/log/nginx/stream.log": Unable to open file /var/log/nginx/stream.log: open /var/log/nginx/stream.log: permission denied 2023-12-13T00:00:01+02:00 E! [inputs.tail::stream tail] Deleting tailer for "/var/log/nginx/stream.log"
Do you have the entire log or does it stop after deleting the trailer? The PR outputs that message that it's deleting the trailer, then creates a new one. Are we sure that the file ingestion was completely interrupted and that no new tailer was created?
@dayvar14 is correct; the Deleting trailer
log message was added in the fix PR. the next time the Gather() function runs it should open the file you are tailing with the plugin again, so the important question is whether you can still see output from the tail plugin after those messages were logged. do you see whether you are still getting output from "/var/log/nginx/stream.log", wherever your telegraf is outputting it?
@dayvar14, unfortinately after this Deleting tailer
message the service stopped.
I run it inside docker container, and it just exited after these messages.
I assume the main telegraf
process (PID 1) just stoped or crashed.
Logs also don't show anything more.
do you see whether you are still getting output from "/var/log/nginx/stream.log", wherever your telegraf is outputting it?
No, actually this is how I recognized the case - target InfluxDB had no new records.
I'm now wondering, might it be when "...next time the Gather() function runs..." it actually creates new process and messes up with Docker expecting PID 1 to be running?
No, at least telegraf main process (PID 1) should still be running..
In my case, we have numerous telegraf containers each with many input plugins in use, the tail plugin is tailing multiple log files, and a log message like this (copying yours):
2023-04-27T00:00:02+03:00 E! [inputs.tail] Tailing "/var/log/nginx/access.log": Unable to open file /var/log/nginx/access.log: open /var/log/nginx/access.log: permission denied
was definitely not fatal to the telegraf process (which is PID 1 in a container, fwiw). This can happen to 1 log file and not affect others also being consumed with inputs.tail.
When I first ran into this, I ended up finding that ls -la /proc/1/fd
would show that there was no open file descriptor for the logged-about file. Others that were still collecting appeared, but telegraf did not reopen the file unless I sent a HUP. Now, when I get a message like yours:
2023-12-13T00:00:01+02:00 E! [inputs.tail::stream tail] Deleting tailer for "/var/log/nginx/stream.log"
I will see an open file descriptor within seconds, the next interval
iirc.
There's a debug level log when a file is opened here https://github.com/influxdata/telegraf/blob/master/plugins/inputs/tail/tail.go#L220 as well.
Relevant telegraf.conf
Logs from Telegraf
System info
Telegraf 1.26.1 (git: HEAD@8f421f25), Docker 20.10.21, Ubuntu 22.04.2 LTS
Docker
Releant part of
docker-compose.yml
config. Group 4 isadm
with read access to the log files.Steps to reproduce
tail
on log file;logrotate
utility;Expected behavior
Retry tailing the file after failure.
Actual behavior
Tail stops after the error.
Additional info
The bug is not related to the file permissions, I'm able to start the Telegraf and obtain the logs as expected.
But due to (un-)lucky circumstances
tail
plugin was trying to read the file exactly at the same timelogrotate
did its job.As far as I understand internal workings of the
logrotate
, it first creates the new log file, and then applieschmod
andchown
. And seemstail
plugin failed in a middle of this process and stopped sending the logs. Restarting the Telegraf at the morning did resolved the problem.But it would be nice to have built-in retry mechanism for such cases. Actually
tail
does retry, but only when file does not exist, rather then when file exists but (temporarly) is not accessible. https://github.com/influxdata/tail/blob/master/tail.go#L194