influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.63k stars 5.58k forks source link

nfsclient panic inside Docker #13979

Closed badnetmask closed 1 year ago

badnetmask commented 1 year ago

Relevant telegraf.conf

[[inputs.nfsclient]]

Logs from Telegraf

Nothing relevant on telegraf.log
2023-09-23T20:51:34Z D! [inputs.nfsclient] using [/proc/self/mountstats] for mountstats
2023-09-23T20:51:34Z D! [inputs.nfsclient] Including all mounts.
2023-09-23T20:51:34Z D! [inputs.nfsclient] Not excluding any mounts.
2023-09-23T20:51:34Z D! [inputs.nfsclient] Including all operations.
2023-09-23T20:51:34Z D! [inputs.nfsclient] Not excluding any operations.
2023-09-23T20:51:34Z D! [agent] Connecting outputs
2023-09-23T20:51:34Z D! [agent] Attempting connection to [outputs.influxdb]

However docker logs show this:
telegraf           | 2023-09-23T20:48:21Z I! Loading config: /etc/telegraf/telegraf.conf
telegraf           | panic: runtime error: invalid memory address or nil pointer dereference
telegraf           | [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x26e65a0]

System info

Telegraf 1.27.4, Raspberry Pi OS 64-bit (Debian 11.7), docker-1.5-2, docker-compose-1.25.0-1

Docker

  telegraf:
    container_name: telegraf
    image: telegraf:latest
    command: bash -c "telegraf"
    depends_on:
      - influxdb
    devices:
      - /dev/vchiq:/dev/vchiq
    ports:
      - 998:998
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /var:/host/var:ro
      - /run:/host/run:ro
      - ./etc/telegraf:/etc/telegraf:ro
      - ./var-log/telegraf:/var/log/telegraf
      - /etc/group:/etc/group:ro
      - /usr/bin/vcgencmd:/usr/bin/vcgencmd:ro
      - /lib/aarch64-linux-gnu/libvcos.so.0:/lib/aarch64-linux-gnu/libvcos.so.0:ro
      - /lib/aarch64-linux-gnu/libvchiq_arm.so.0:/lib/aarch64-linux-gnu/libvchiq_arm.so.0:ro
    environment:
      HOST_PROC: /host/proc
      HOST_SYS: /host/sys
      HOST_VAR: /host/var
      HOST_RUN: /host/run
    restart: unless-stopped

Steps to reproduce

  1. Enable the nfsclient plugin
  2. Restart docker

Expected behavior

Data would be collected.

Actual behavior

Container crashes and restarts.

Additional info

$ docker image inspect telegraf
[
    {
        "Id": "sha256:9b643f547eceab158f3542624c60b7cc716e41da222e79181020c4b3fa5a3cd1",
        "RepoTags": [
            "telegraf:latest"
        ],
        "RepoDigests": [
            "telegraf@sha256:6790fa19556e4a19408a5adbabccc635fef46979d0fdf529fb0ad35e1c0fe979"
        ],
        "Parent": "",
        "Comment": "",
        "Created": "2023-09-07T19:38:28.230169016Z",
powersj commented 1 year ago

Hi,

Going to need some additional information.

Can you jump into the container, call telegraf directly please, and restart docker? I would like for you to collect the full stack trace if possible. After hitting the panic you should see a long list of additional logs that point to a line of code that is causing the nil pointer deference.

Restart docker

When you say restart, what are you doing? sudo systemctl restart docker?

restart: unless-stopped

Does telegraf start successfully afterwards?

Thanks

badnetmask commented 1 year ago

When I say "restart" I mean "docker-compose restart telegraf" to pickup the new config. After that, docker restarts the container because of the "unless-stopped", but it keeps looping until I disable the plugin and restart the container again.

Manually running in debug mode from inside the container doesn't really give me anything that wasn't already in the logs (note: I chmod 000 the logs on purpose so it won't write there, and the influxdb creation forbidden is not an issue because the db already exists).

$ docker exec -u telegraf -it telegraf /bin/bash
telegraf@f808db8d4a81:/$ telegraf --debug
2023-09-25T13:51:27Z I! Loading config: /etc/telegraf/telegraf.conf
2023-09-25T13:51:27Z E! Unable to open /var/log/telegraf/telegraf.log (open /var/log/telegraf/telegraf.log: permission denied), using stderr
2023-09-25T13:51:27Z I! Starting Telegraf 1.27.4
2023-09-25T13:51:27Z I! Available plugins: 237 inputs, 9 aggregators, 28 processors, 23 parsers, 59 outputs, 4 secret-stores
2023-09-25T13:51:27Z I! Loaded inputs: cpu disk diskio exec (6x) file kernel mem net netstat nfsclient processes swap system temp
2023-09-25T13:51:27Z I! Loaded aggregators: 
2023-09-25T13:51:27Z I! Loaded processors: 
2023-09-25T13:51:27Z I! Loaded secretstores: 
2023-09-25T13:51:27Z I! Loaded outputs: influxdb
2023-09-25T13:51:27Z I! Tags enabled: host=localhost
2023-09-25T13:51:27Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"localhost", Flush Interval:10s
2023-09-25T13:51:27Z D! [agent] Initializing plugins
2023-09-25T13:51:27Z D! [inputs.nfsclient] using [/proc/self/mountstats] for mountstats
2023-09-25T13:51:27Z D! [inputs.nfsclient] Including all mounts.
2023-09-25T13:51:27Z D! [inputs.nfsclient] Not excluding any mounts.
2023-09-25T13:51:27Z D! [inputs.nfsclient] Including all operations.
2023-09-25T13:51:27Z D! [inputs.nfsclient] Not excluding any operations.
2023-09-25T13:51:27Z W! DeprecationWarning: Value "false" for option "ignore_protocol_stats" of plugin "inputs.net" deprecated since version 1.27.3 and will be removed in 1.36.0: use the 'inputs.nstat' plugin instead
2023-09-25T13:51:27Z D! [agent] Connecting outputs
2023-09-25T13:51:27Z D! [agent] Attempting connection to [outputs.influxdb]
2023-09-25T13:51:27Z W! [outputs.influxdb] When writing to [https://<ip>:8086]: database "telegraf" creation failed: 403 Forbidden
2023-09-25T13:51:27Z D! [agent] Successfully connected to outputs.influxdb
2023-09-25T13:51:27Z D! [agent] Starting service inputs
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x26e65a0]
powersj commented 1 year ago

Two issues:

2023-09-25T13:51:27Z D! [inputs.nfsclient] using [/proc/self/mountstats] for mountstats

The first issue is that file most likely does not exist. As you mounted proc to /host/proc, the file you are trying to monitor will be: /host/proc/self/mountstats. That can be set by adding another environment variable:

MOUNT_PROC: /host/proc/self/mountstats

The next issue is that will probably get mounted with only read permissions for root:

❯ docker exec telegraf ls -la /host/proc/self/mountstats
-r-------- 1 root root 0 Sep 25 14:47 /host/proc/self/mountstats

The default user for our docker images is telegraf, not root. I would have expected an error not a panic as well.

My new PR will show the following for the wrong permissions:

telegraf  | 2023-09-25T15:08:00Z E! [inputs.nfsclient] Error in plugin: open /host/proc/self/mountstats: permission denied

and the following for a file that does not exist:

telegraf  | 2023-09-25T15:08:50Z E! [inputs.nfsclient] Error in plugin: stat /host/proc/self/mountstat: no such file or directory

I have also added info about MOUNT_PROC to the readme.

badnetmask commented 1 year ago

Okay, adding MOUNT_PROC makes it look at the correct file, but how do you suggest I fix the permissions problem then, since it's set by the OS automatically?

powersj commented 1 year ago

how do you suggest I fix the permissions problem then, since it's set by the OS automatically?

Since you are mounting the files read-only you cannot change the permissions on them, so you are left with running telegraf as root.

edit: I know that is far from idea and should be discouraged wherever possible. This really is a question about unix permissions at this point involving how docker mounts things into the container and the user the telegraf service runs as.

badnetmask commented 1 year ago

I understand what you're saying about the permissions, and it makes sense, but there is something else here, since I can cat the file as the telegraf user, and it can see the NFS mount.

$ docker exec -u telegraf telegraf env | grep MOUNT
MOUNT_PROC=/host/proc/self/mountstats
$ docker exec -u telegraf telegraf cat /host/proc/self/mountstats
(...)
device /dev/root mounted on /etc/group with fstype ext4
device 192.168.x.x:/volume/drive/path/telegraf mounted on /etc/telegraf with fstype nfs4 statvers=1.1
    opts:   rw,sync,vers=4.0,rsize=131072,wsize=131072,namlen=143,acregmin=3,acregmax=60,acdirmin=30,acdirmax=60,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.x.x,local_lock=none
(...)

So, what could be missing here?

powersj commented 1 year ago

I think some docker foo is tripping you up now:

❯  docker exec telegraf ls -la /host/proc/self/mountstats
-r-------- 1 root root 0 Sep 25 17:16 /host/proc/self/mountstats
❯  docker exec -u telegraf telegraf ls -la /host/proc/self/mountstats
-r-------- 1 telegraf telegraf 0 Sep 25 17:16 /host/proc/self/mountstats

As far as the telegraf process is concerned that file is owned by root:root.