influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.63k stars 5.58k forks source link

tail input fails silently when trying to read files inside directories with only execute permissions #9129

Open hluaces opened 3 years ago

hluaces commented 3 years ago

The tail input (and I think that logparser does this too) fails silently when attempting to read a file inside a directory which has execute but not read permissions for the telegraf user.

I'm providing a Dockerfile to reproduce the issues below.

Below you can see the file permissions schema and proof that the user can read the file:

bash-4.2$ whoami
telegraf

bash-4.2$ namei -mo /var/log/apache2/error_log 
f: /var/log/apache2/error_log
 drwxr-xr-x root root /
 drwxr-xr-x root root var
 drwxr-xr-x root root log
 drwx--x--x root root apache2
 -rw-r--r-- root root error_log

bash-4.2$ tail /var/log/apache2/error_log 
example,result=ok value=1i
example,result=ok value=2i
example,result=ok value=3i
example,result=error value=3i

Below you can see that telegraf fails silently when running under the telegraf user:

bash-4.2$ /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d --debug --test-wait 15 
2021-04-14T10:15:01Z I! Starting Telegraf 1.18.1
2021-04-14T10:15:01Z D! [agent] Initializing plugins
2021-04-14T10:15:01Z D! [agent] Starting service inputs
2021-04-14T10:15:16Z D! [agent] Stopping service inputs
2021-04-14T10:15:16Z D! [agent] Input channel closed
2021-04-14T10:15:16Z D! [agent] Stopped Successfully

It works as expected with a root user or with a /var/log/apache2 directory with read and execute permissions:

[root@400fb31a3249 /]# /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d --debug --test-wait 15 
2021-04-14T10:17:07Z I! Starting Telegraf 1.18.1
2021-04-14T10:17:07Z D! [agent] Initializing plugins
2021-04-14T10:17:07Z D! [agent] Starting service inputs
2021-04-14T10:17:07Z D! [inputs.tail] Tail added for "/var/log/apache2/error_log"
> example,host=400fb31a3249,path=/var/log/apache2/error_log,result=ok value=1i 1618395427510170897
> example,host=400fb31a3249,path=/var/log/apache2/error_log,result=ok value=2i 1618395427510177215
> example,host=400fb31a3249,path=/var/log/apache2/error_log,result=ok value=3i 1618395427510181769
> example,host=400fb31a3249,path=/var/log/apache2/error_log,result=error value=3i 1618395427510183472
2021-04-14T10:17:22Z D! [agent] Stopping service inputs
2021-04-14T10:17:22Z D! [inputs.tail] Tail removed for "/var/log/apache2/error_log"
2021-04-14T10:17:22Z D! [agent] Input channel closed
2021-04-14T10:17:22Z D! [agent] Stopped Successfully

Relevant telegraf.conf:

[global_tags]

[agent]
  interval = "1s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "5s"
  flush_jitter = "0s"
  precision = ""
  debug = false
  quiet = false
  logfile = ""
  hostname = ""
  omit_hostname = false

[[outputs.file]]
  files = ["stdout"]

[[inputs.tail]]
  files = ["/var/log/apache2/error_log"]
  from_beginning = true
  watch_method = "poll"
  data_format = "influx"

System info:

Docker

Steps to reproduce:

I've provided a Dockerfile that reproduces the error:

https://github.com/hluaces/telegraf-bug-9129

If you run the container (docker run --rm local/bug-telegraf) you'll see that no data is gathered after a 15s wait time.

If you run with as root inside the container (docker run --rm -u root local/bug-telegraf) you'll see that data is gathered.

You can move inside the container by using docker run --rm -it --entrypoint bash local/bug-telegraf.

Expected behavior:

Actual behavior:

Additional info:

sspaink commented 3 years ago

@hluaces Amazing job writing this issue! Providing a dockerfile recreating the issue is super useful and fantastic, I really appreciate it. I was able to re-create the issue locally as well, and the issue is right here: https://github.com/bmatcuk/doublestar/blob/v3/doublestar.go#L441 the error for opening the directory path isn't returned. Although the code (https://github.com/influxdata/telegraf/blob/master/internal/globpath/globpath.go#L53) isn't checking the error either if even it did.

Looks like the doublestar library recently went through a big overhaul and now has a version 4 available, it might be handled in this new version. I can play around with it and see if it fixes the problem.

hluaces commented 3 years ago

Are there any news on this?

I'm not trying to pressure or anything, just trying to manage expectations with this bug as its presence forces me to do some ugly workarounds. Thank you for your time.

sspaink commented 3 years ago

Thanks for reminding me about this I honestly forgot, I'm afraid no news as I haven't made progress with updating the doublestar library. I will make a note to look at it this week and get back to you, hopefully it will resolve the issue.

edit: I remember now what stopped me, v4 depends on io/fs and therefore required Go v1.16+. We will be moving away from v1.15 after this pr is merged: https://github.com/influxdata/telegraf/pull/9642. I will go ahead and get a pr ready to make it sure it fixes this issue.

sspaink commented 3 years ago

Upgrading doublestar to v4 is trickier then I thought, it depends on the new io/fs package that isn't straightforward to use and causes more changes then I'd expect.

@hluaces I have a draft pr that does address this issue, if you have time can you try out the artifacts to see if it works for you? You can find them posted by the telegraf bot here. This change does depend on the pull request to the project doublestar to accepted and merged: https://github.com/bmatcuk/doublestar/pull/57 at the moment the draft pr uses my forked repo.

To test the changes I updated the Dockerfile you provided to copy a local telegraf binary:

FROM centos:7

COPY files/influxdb.repo /etc/yum.repos.d

RUN mkdir -p /var/log/apache2 \
    && adduser apache2 \
    && touch /var/log/apache2/error_log \
    && chmod 711 /var/log/apache2/ \
    && chmod 644 /var/log/apache2/error_log \
    && echo "example,result=ok value=1i" >> /var/log/apache2/error_log \
    && echo "example,result=ok value=2i" >> /var/log/apache2/error_log \
    && echo "example,result=ok value=3i" >> /var/log/apache2/error_log \
    && echo "example,result=error value=3i" >> /var/log/apache2/error_log
RUN adduser telegraf
USER telegraf
COPY files/telegraf.conf /etc/telegraf/telegraf.conf
COPY files/telegraf /usr/bin/telegraf

ENTRYPOINT ["/usr/bin/telegraf", "-config", "/etc/telegraf/telegraf.conf", "-config-directory", "/etc/telegraf/telegraf.d", "--debug", "--test-wait", "10"]

New expected results:

2021-08-20T14:27:48Z I! Starting Telegraf 
2021-08-20T14:27:48Z W! Telegraf is not permitted to read /etc/telegraf/telegraf.d
2021-08-20T14:27:48Z D! [agent] Initializing plugins
2021-08-20T14:27:48Z D! [agent] Starting service inputs
2021-08-20T14:27:48Z E! [inputs.tail] Failed to match for filepath "/var/log/apache2/error_log": open /var/log/apache2: permission denied
2021-08-20T14:27:48Z E! [inputs.tail] Failed to match for filepath "/var/log/apache2/error_log": open /var/log/apache2: permission denied
2021-08-20T14:27:58Z D! [agent] Stopping service inputs
2021-08-20T14:27:58Z D! [agent] Input channel closed
2021-08-20T14:27:58Z D! [agent] Stopped Successfully
2021-08-20T14:27:58Z E! [telegraf] Error running agent: input plugins recorded 2 errors
hluaces commented 3 years ago

I've managed to try that on my end and the error reporting works as expected, thank you very much for your work. I was able to see how telegraf reported the errors exactly as you've shown in your example.

Nevertheless, I'd like to bring attention to the fact that the telegraf user is able to read that file:

bash-4.2$ whoami
telegraf

bash-4.2$ namei -om /var/log/apache2/error_log
f: /var/log/apache2/error_log
 drwxr-xr-x root root /
 drwxr-xr-x root root var
 drwxr-xr-x root root log
 drwx--x--x root root apache2
 -rw-r--r-- root root error_log

bash-4.2$ tail /var/log/apache2/error_log
example,result=ok value=1i
example,result=ok value=2i
example,result=ok value=3i
example,result=error value=3i

As you can see my issue raised two problems:

I suposse that's because, at some point, the underlying library thinks that a directory without a read permission is not able to be read, which is not the case, as one with only execution permissions does indeed allow to read files inside which have the proper permissions configuration.

Maybe my issue was not clear. I apologize if that was the case.