influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.63k stars 5.58k forks source link

Permission denied in a middle of logrotate process #13154

Closed Vitaljok closed 11 months ago

Vitaljok commented 1 year ago

Relevant telegraf.conf

[[inputs.tail]]
  files = ["/var/log/nginx/access.log"]
  data_format = "grok"
  grok_patterns = ["%{COMBINED_LOG_FORMAT}"]

Logs from Telegraf

2023-04-27T00:00:02+03:00 E! [inputs.tail] Tailing "/var/log/nginx/access.log": Unable to open file /var/log/nginx/access.log: open /var/log/nginx/access.log: permission denied

System info

Telegraf 1.26.1 (git: HEAD@8f421f25), Docker 20.10.21, Ubuntu 22.04.2 LTS

Docker

Releant part of docker-compose.yml config. Group 4 is adm with read access to the log files.

services:
    telegraf:
        image: telegraf
        volumes:
            - /var/log/nginx:/var/log/nginx:ro
        group_add:
            - "4"

Steps to reproduce

  1. Start Telegraf with tail on log file;
  2. Daily rotate log file using logrotate utility;
  3. Enentually tail will stop with permission denied error. In my case it stopped after 9 days with daily log rotation.

Expected behavior

Retry tailing the file after failure.

Actual behavior

Tail stops after the error.

Additional info

The bug is not related to the file permissions, I'm able to start the Telegraf and obtain the logs as expected.

  File: /var/log/nginx/access.log
  Size: 285582      Blocks: 568        IO Block: 4096   regular file
Device: fd00h/64768d    Inode: 524853      Links: 1
Access: (0640/-rw-r-----)  Uid: (   33/www-data)   Gid: (    4/     adm)
Access: 2023-04-27 09:07:29.001913262 +0300
Modify: 2023-04-27 12:32:01.672785512 +0300
Change: 2023-04-27 12:32:01.672785512 +0300
 Birth: 2023-04-27 00:00:02.631913631 +0300

But due to (un-)lucky circumstances tail plugin was trying to read the file exactly at the same time logrotate did its job.

Apr 27 00:00:01 docker-host systemd[1]: Starting Rotate log files...
Apr 27 00:00:03 docker-host systemd[1]: logrotate.service: Deactivated successfully.
Apr 27 00:00:03 docker-host systemd[1]: Finished Rotate log files.
Apr 27 00:00:03 docker-host systemd[1]: logrotate.service: Consumed 1.022s CPU time.

As far as I understand internal workings of the logrotate, it first creates the new log file, and then applies chmod and chown. And seems tail plugin failed in a middle of this process and stopped sending the logs. Restarting the Telegraf at the morning did resolved the problem.

But it would be nice to have built-in retry mechanism for such cases. Actually tail does retry, but only when file does not exist, rather then when file exists but (temporarly) is not accessible. https://github.com/influxdata/tail/blob/master/tail.go#L194

powersj commented 1 year ago

The bug is not related to the file permissions

That is exactly what this is :) it just happens to be during a file rotate telegraf does not have permission to the file. Any fix will be in response to an error around file permissions, no?

But it would be nice to have built-in retry mechanism for such cases.

What level of retry, as immediately retrying would probably still fail given your log rotate took a few seconds:

Vitaljok commented 1 year ago

That is exactly what this is :)

Ok, the bug is releated to file permissions, indeed. It was an attempt to distinguish it from dozens of posts on how to configure the Telegraf to properly start at the first place. :)

What level of retry...

I would say the good example is tail from coreutils. I did few experiments with it and tried to mimic logrotate process. In one terminal I started tail -v -F /tmp/test.log, and on the second I manipulated the target file.

1| $ tail -v -F /tmp/test.log
1| tail: cannot open '/tmp/test.log' for reading: No such file or directory

2| # (umask 026; touch /tmp/test.log)
2| # ll /tmp/test.log 
2| -rw-r----- 1 root root 0 apr 28 08:52 /tmp/test.log
2| # echo "Hello!" >> /tmp/test.log 
2| # chown root:adm /tmp/test.log 
2| # echo "World" >> /tmp/test.log 

1| tail: '/tmp/test.log' has become accessible
1| Hello!
1| World

2| # mv /tmp/test.log /tmp/test.log.old

1| tail: '/tmp/test.log' has become inaccessible: No such file or directory

2| # (umask 026; touch /tmp/test.log)
2| # chown root:adm /tmp/test.log 
2| # echo "It still works..." >> /tmp/test.log 

1| tail: '/tmp/test.log' has become accessible
1| It still works...

It also works when the file is not moved away, but rather access rights are removed.

2| # chown root:root /tmp/test.log

1| tail: cannot watch '/tmp/test.log': Permission denied

2| # chown root:adm /tmp/test.log
2| # echo "Works even after permission denied" >> /tmp/test.log

1| Works even after permission denied

As you can see tail -F is quite tolerant to missing files and permissions. And after getting access to file, it is able to "catch up" with all new lines. Not sure if it is possible in Telegraf plugin, but it is good benchmark for expectations.

only retry if it has successfully read the file once?

This would be good option to "fail fast" if the file is not accessible at startup (like typo in file name, or really missing permissions).

powersj commented 1 year ago

Thanks for the response - I want to chat this over with sven next

powersj commented 1 year ago

One possible option we talked about today: the tail plugin is a service plugin. Meaning it gathers data as it needs to, rather than using the gather cycle. We could shift to using gather cycle to retry files that may have failed. We would need to keep a list of failed files, re-try during gather, and remove them if they start working.

dayvar14 commented 1 year ago

@powersj is there an example of this pattern? I would like to attempt to take on this issue.

powersj commented 1 year ago

@dayvar14,

Happy to see a PR! We don't have something precisely like this already. Essentially I think what we wanted to do was:

  1. Catch this specific error and add the file to a slice of failed files
  2. In the Gather function start to check if that slice is not empty and if so go through those files and re-try them
  3. If the read starts to succeed we can remove it from the slice and continue to read, essentially re-init on that one file
  4. If it fails, we can try again during the next call to gather

This would help with short-term issues like these rotation issues, so we do not fail, but at least do a retry.

dayvar14 commented 11 months ago

@dayvar14,

Happy to see a PR! We don't have something precisely like this already. Essentially I think what we wanted to do was:

  1. Catch this specific error and add the file to a slice of failed files
  2. In the Gather function start to check if that slice is not empty and if so go through those files and re-try them
  3. If the read starts to succeed we can remove it from the slice and continue to read, essentially re-init on that one file
  4. If it fails, we can try again during the next call to gather

This would help with short-term issues like these rotation issues, so we do not fail, but at least do a retry.

@powersj

@ericmiller-wpengine discovered that simply deleting the tail from the list of active tails when the error occurs seems to resolve when the tailNewFiles is called during the Gather function here. Seems to do what you and sven discussed.

PR was opened https://github.com/influxdata/telegraf/pull/14357.

dayvar14 commented 11 months ago

@Vitaljok can you test to see if this PR resolves the issue for you?

Vitaljok commented 11 months ago

Thanks for the PR.

Deployed it to my host.

2023-11-28T11:02:55+02:00 I! Starting Telegraf 1.29.0-4b867c55 brought to you by InfluxData the makers of InfluxDB

So far no problems. Also tried to manually run logrotate, but I'm not sure if it actually did anything. Let's see how it behaves during true nightly logrotate.

I will let you know if I note anything strange.

Vitaljok commented 10 months ago

Seems the PR did not resolved the issue completely. :(

2023-12-13T00:00:01+02:00 E! [inputs.tail::stream tail] Tailing "/var/log/nginx/stream.log": Unable to open file /var/log/nginx/stream.log: open /var/log/nginx/stream.log: permission denied
2023-12-13T00:00:01+02:00 E! [inputs.tail::stream tail] Deleting tailer for "/var/log/nginx/stream.log"
dayvar14 commented 10 months ago

Seems the PR did not resolved the issue completely. :(


2023-12-13T00:00:01+02:00 E! [inputs.tail::stream tail] Tailing "/var/log/nginx/stream.log": Unable to open file /var/log/nginx/stream.log: open /var/log/nginx/stream.log: permission denied

2023-12-13T00:00:01+02:00 E! [inputs.tail::stream tail] Deleting tailer for "/var/log/nginx/stream.log"

Do you have the entire log or does it stop after deleting the trailer? The PR outputs that message that it's deleting the trailer, then creates a new one. Are we sure that the file ingestion was completely interrupted and that no new tailer was created?

ericmiller-wpengine commented 10 months ago

@dayvar14 is correct; the Deleting trailer log message was added in the fix PR. the next time the Gather() function runs it should open the file you are tailing with the plugin again, so the important question is whether you can still see output from the tail plugin after those messages were logged. do you see whether you are still getting output from "/var/log/nginx/stream.log", wherever your telegraf is outputting it?

Vitaljok commented 10 months ago

@dayvar14, unfortinately after this Deleting tailer message the service stopped. I run it inside docker container, and it just exited after these messages. I assume the main telegraf process (PID 1) just stoped or crashed. Logs also don't show anything more.

do you see whether you are still getting output from "/var/log/nginx/stream.log", wherever your telegraf is outputting it?

No, actually this is how I recognized the case - target InfluxDB had no new records.

I'm now wondering, might it be when "...next time the Gather() function runs..." it actually creates new process and messes up with Docker expecting PID 1 to be running?

Hipska commented 10 months ago

No, at least telegraf main process (PID 1) should still be running..

ericmiller-wpengine commented 10 months ago

In my case, we have numerous telegraf containers each with many input plugins in use, the tail plugin is tailing multiple log files, and a log message like this (copying yours):

2023-04-27T00:00:02+03:00 E! [inputs.tail] Tailing "/var/log/nginx/access.log": Unable to open file /var/log/nginx/access.log: open /var/log/nginx/access.log: permission denied

was definitely not fatal to the telegraf process (which is PID 1 in a container, fwiw). This can happen to 1 log file and not affect others also being consumed with inputs.tail.

When I first ran into this, I ended up finding that ls -la /proc/1/fd would show that there was no open file descriptor for the logged-about file. Others that were still collecting appeared, but telegraf did not reopen the file unless I sent a HUP. Now, when I get a message like yours:

2023-12-13T00:00:01+02:00 E! [inputs.tail::stream tail] Deleting tailer for "/var/log/nginx/stream.log"

I will see an open file descriptor within seconds, the next interval iirc.

There's a debug level log when a file is opened here https://github.com/influxdata/telegraf/blob/master/plugins/inputs/tail/tail.go#L220 as well.