influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.59k stars 5.56k forks source link

smart plugin using docker #7743

Closed no-clu closed 2 years ago

no-clu commented 4 years ago

There seems to be an issue with using smart plugin when used inside a container. I've set the environmental parameters and mounts as per the FAQ - docs/FAQ.md

Running by just enabling the plugin in the config file I get a smartctl not found error. If I point to the volume/bind/mounts of hostfs i.e. "/hostfs/usr/sbin" where smartctl is located then I get a error regarding GLIBC version.

config:

# # Read metrics from storage devices supporting S.M.A.R.T. [[inputs.smart]] # ## Optionally specify the path to the smartctl executable path = "hostfs/usr/sbin/smartctl"

The version on the host is 2.28, and the version in the container is 2.24. The minimum version required by smartctl is 2.27 according to the error output.

2020-06-25T15:59:20Z E! [inputs.smart] Error in plugin: failed to run command 'hostfs/usr/sbin/smartctl --scan': exit status 1 - hostfs/usr/sbin/smartctl: /lib/aarch64-linux-gnu/libc.so.6: version 'GLIBC_2.27' not found (required by hostfs/usr/sbin/smartctl)

I've tried numerous mapping of volumes/binds/mounts with no luck.

danielnelson commented 4 years ago

I'm guessing, but this looks like the docker engine host has a newer OS than the container, and smartctl was compiled against a later libc version. You could try using smartctl from the container OS, for testing you can start the container with:

docker run telegraf /bin/sh -c 'apt-get update && apt-get install -y smartmontools && telegraf'

If that works, a more permanent solution is shown on this page under Install Additional Packages.

no-clu commented 4 years ago

Thanks for the reply. I have tried installing smartmontolls on the container while running using EXEC but just got other issues whereby the disks on the host are then not found to be scanned.

To replicate Started Docker container with Environmental Variables and mounts as per FAQ. Using Portainer use the Execute utility, and run the following. apt update apt install smartmon tools

This all goes fine. But then if I run smartctl is cannot find the disks.

smartctl --scan

scan_smart_devices: glob(3) aborted matching pattern /dev/discs/disc*

Having looked up the error it seems to be that smartctl cannot find any /dev/sd*[a-z] If I try specifying the disk as they are mounted still no joy

smartctl /hostfs/dev/sda

smartctl 6.6 2016-05-31 r4324 [aarch64-linux-5.4.45-rockchip64] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org<

/hostfs/dev/sda: Unable to detect device type Please specify device type with the -d option.

Specifying the disk type gives yet another error

smartctl -d ata /hostfs/dev/sda

smartctl 6.6 2016-05-31 r4324 [aarch64-linux-5.4.45-rockchip64] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

Smartctl open device: /hostfs/dev/sda failed: Operation not permitted

Here is df from within the container

df -h

Filesystem Size Used Avail Use% Mounted on overlay 293G 3.1G 275G 2% / tmpfs 64M 0 64M 0% /dev tmpfs 1.9G 0 1.9G 0% /sys/fs/cgroup shm 64M 0 64M 0% /dev/shm /dev/sdb1 293G 3.1G 275G 2% /hostfs udev 1.9G 0 1.9G 0% /hostfs/dev tmpfs 1.9G 0 1.9G 0% /hostfs/dev/shm tmpfs 387M 21M 367M 6% /hostfs/run tmpfs 5.0M 4.0K 5.0M 1% /hostfs/run/lock tmpfs 1.9G 0 1.9G 0% /hostfs/sys/fs/cgroup tmpfs 1.9G 4.0K 1.9G 1% /hostfs/tmp /dev/sda1 3.6T 89M 3.6T 1% /hostfs/srv/dev-disk-by-label-storage1 /dev/mmcblk1p1 3.4G 60M 3.3G 2% /hostfs/media/mmcboot overlay 293G 3.1G 275G 2% /hostfs/var/lib/docker/overlay2/1476fdd0a24ce239a79aeeef55a0ca724262c5eec442fe7ef922e091e3a96c39/merged overlay 293G 3.1G 275G 2% /hostfs/var/lib/docker/overlay2/d2720f226447273a9d519c1067265add25f0b7b7a2d6665a4fbba448f657e64c/merged overlay 293G 3.1G 275G 2% /hostfs/var/lib/docker/overlay2/a4e1d0ef6f236150786ab92bd7c1b694fef3f098065855dfb5088c219f3435b5/merged tmpfs 1.9G 0 1.9G 0% /proc/asound tmpfs 1.9G 0 1.9G 0% /sys/firmware tmpfs 387M 0 387M 0% /hostfs/var/lib/docker/overlay2/fa49a208d66a0cd0e63cfb9a6baaa39a481fe6b9a659742e60e34b310664c3e0/merged/hostfs/run/user/1000

Despite being mounted /dev/sda1, and ls on /dev does show the disks ls -l

total 0 lrwxrwxrwx 1 root root 13 Jun 26 10:16 fd -> /proc/self/fd crw-rw-rw- 1 root root 1, 7 Jun 26 10:16 full drwxrwxrwt 2 root root 40 Jun 26 10:16 mqueue crw-rw-rw- 1 root root 1, 3 Jun 26 10:16 null lrwxrwxrwx 1 root root 8 Jun 26 10:16 ptmx -> pts/ptmx drwxr-xr-x 2 root root 0 Jun 26 10:16 pts crw-rw-rw- 1 root root 1, 8 Jun 26 10:16 random drwxrwxrwt 2 root root 40 Jun 26 10:16 shm lrwxrwxrwx 1 root root 15 Jun 26 10:16 stderr -> /proc/self/fd/2 lrwxrwxrwx 1 root root 15 Jun 26 10:16 stdin -> /proc/self/fd/0 lrwxrwxrwx 1 root root 15 Jun 26 10:16 stdout -> /proc/self/fd/1 crw-rw-rw- 1 root root 5, 0 Jun 26 10:16 tty crw-rw-rw- 1 root root 1, 9 Jun 26 10:16 urandom crw-rw-rw- 1 root root 1, 5 Jun 26 10:16 zero

Thanks

danielnelson commented 4 years ago

Maybe it will help if you start the container with --privileged?

no-clu commented 4 years ago

Thanks @danielnelson

Okay, progress. Running with --privileged now allows me to run smartctl -a /dev/sd[a-z]. I had tried this before but I've tried so many options it must have not been with the right configuration as it hadn't worked.

Anyway, after starting the container with --privileged I did apt update and apt install smartmontools from EXEC in the container. I can then run smartctl -a /dev/sd[a-z] which shows me the information expected, good news. Stopping and then starting container (or restarting) sometimes retains smartctl (smartmontools) and sometimes is dones't, this confuses me somewhat.

However I still have problems. If I check the telegraf database USE telegraf and then SELECT * FROM smart_device limit 25 the entries have only timestamp, device, exit status and host. Nothing else at all.

All entries in the database have exit status = 2, for which smartctl manual says this:

Bit 2: Some SMART or other ATA command to the disk failed, or there was a checksum error in a SMART data structure (see '-b' option above).

The log currently gives me no errors, anyone know how to show more info/debug?

I am looking into this further but I'm not sure I'll get this working by myself so any pointers in the meantime would be great. Either way I will post back, success or not.


edit: removed comment that a restart still allowed smartctl to run. It doesn't. I don't know what happend but I cannot replicate this behaviour.


edit 2: added behaviour back in, a stop and start or a restart sometimes seems to retain smartctl in the container. I cannot explain this!

no-clu commented 4 years ago

A little trial and error. If i stop and restart the container, the install of smartmontools persists. If I re-deploy (using portainer) this deletes the container and creates a new one and as such I have to reinstall smartmontools.

Now I've managed to get the smart plugin working but I had to do a number of steps to get there which seems like a lot of configuring when I have the capability on the host to simply just run smartctl. Anyway these were the steps I needs to take.

I'm not happy with the solution overall and will keep looking for an alternative solution. Moreover this telegraf container seems to crash after an hour or so. Happy to post the log but not sure how to best do that?

Here is the start of the log. As you can see I cannot tell when it crashed but I can see from the status in Portainer that it had been stopped for 9hours when I checked. Thus this was working for few hours only. After the SIGILL: illegal instruction there are lots of messages re: github, go sources and plugins followed by the r codes? I've no idea about this. Any help appreciated, I'm going to disable smart plugin for now to see how the container behaves and to see if the crash linked to smart plugin.

2020-06-29T06:57:16Z I! Starting Telegraf 1.14.4 2020-06-29T06:57:16Z I! Using config file: /etc/telegraf/telegraf.conf 2020-06-29T06:57:16Z I! Loaded inputs: disk mem processes swap system smart temp cpu diskio kernel 2020-06-29T06:57:16Z I! Loaded aggregators: 2020-06-29T06:57:16Z I! Loaded processors: 2020-06-29T06:57:16Z I! Loaded outputs: influxdb 2020-06-29T06:57:16Z I! Tags enabled: host=566e379017cb 2020-06-29T06:57:16Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"566e379017cb", Flush Interval:10s SIGILL: illegal instruction PC=0x6db34 m=7 sigcode=1

r0 0x4000a0c0a0 0xffffffffffffffe0 r2 0x40003fdc00 r3 0x0 r4 0x0 [etc...]

danielnelson commented 4 years ago

Thanks for documenting your findings. This is with the linux/arm64/v8 container? The SIGILL: illegal instruction error to me indicates either the telegraf binary you have is built for a different architecture, some other arm flavor, we have done something unsafe, a hardware issue or even a bug in the Go compiler. If you attach the full stack trace as a file it might have some clues and it might be worth trying the other arm containers or even the Telegraf 1.13.4 container (which was built with Go 1.13).

Can you also attach the output of cat /proc/cpuinfo?

no-clu commented 4 years ago

@danielnelson thanks for your reply. I've attcahed the output of cat /proc/cpuinfo as a .txt file. cpuinfo.txt

With regards to the container being linux/arm64/v8 I wish I could respond more confidently, all I can say is that I'm using OpenMediaVault with PluginExtras which enables a simple "click a button" install of Docker and Portainer. I then used CLI to get my containers up and running and then use Portainer to manage and check logs. From Portainer I can see that the image for telegraf is as follows. Docker and containers are all totally new to me 2 weeks ago, and linux I've some very minor experience to date so please forgive.

telegraf:latest@sha256:e0add6e572b009eb3fa8cd9947ebdf62ab3fed81f306113704bc9b9a0cec89df telegraf version 1.14.4

docker --version

Docker version 19.03.12, build 48a6621

Inspecting the container from Portainer container inspect.txt

Hope that helps. I'll have a look at other arm containers (time to hit the search engines).

guice commented 3 years ago

For what it's worth, I was able to run this without errors today. All I had to do was overwrite the command: on launch:

    telegraf:
        image: telegraf
        privileged: true
        command:
          - /bin/bash
          - -c
          - |
            apt update
            apt install -y smartmontools
            telegraf

Started successfully:

2020-10-31T20:22:05Z I! Starting Telegraf 1.16.1
2020-10-31T20:22:05Z I! Using config file: /etc/telegraf/telegraf.conf
2020-10-31T20:22:05Z I! Loaded inputs: cpu disk diskio docker filecount httpjson influxdb kernel mem net netstat processes smart system
2020-10-31T20:22:05Z I! Loaded aggregators: 
2020-10-31T20:22:05Z I! Loaded processors: 
2020-10-31T20:22:05Z I! Loaded outputs: influxdb

This is running on a Synology DS218+ NAS.

jnaav commented 3 years ago

I tried that solution and did not work. Please, let me know is there is some other (and easier) way to use inputs.smart in docker, I would be very interested. Thanks

akrea commented 2 years ago

Hello

May I come back to this thread. I took me quite a while to get smartctl running inside docker-telegraf. In a nutshell: adding privileged: true to my docker-compose-file-telegraf-service helped.

telegraf:
        image: telegraf
        container_name: telegraf
        hostname: telegraf
        networks:
            - default
        privileged: true #for smartctl
        environment:
            - TZ=${TZ}
            - HOST_VAR=/hostfs/var
            - HOST_PROC=/hostfs/proc
            - HOST_SYS=/hostfs/sys
            - HOST_MOUNT_PREFIX=/hostfs
            - HOST_ETC=/hostfs/etc
            - HOST_RUN=/hostfs/run
        links:
            - influxdb
        volumes:
            - /var/run/docker.sock:/var/run/docker.sock:ro
            - $APPDATADIR/telegraf/telegraf.conf:/etc/telegraf/telegraf.conf:ro
            - /:/hostfs:ro
        depends_on:
            - influxdb
        restart: always

In the container CLI I can query the drives with /hostfs/usr/sbin/smarctl -a /dev/sdX. My problem is that influxdb only shows "exit_status" with the value "2". No other data are fed into influxdb.

My telegraf.conf file looks like:

[[inputs.smart]]
    path_smartctl = "/hostfs/usr/sbin/smartctl"
    read_method = "sequential"
telegraf-tiger[bot] commented 2 years ago

Hello! I am closing this issue due to inactivity. I hope you were able to resolve your problem, if not please try posting this question in our Community Slack or Community Page. Thank you!

EcceGratum commented 1 year ago

I have been facing the same problem today and got the same errors as in some of the comments but it's now working.

The steps are :

I kept "--privileged" from a previous test, not sure it's necessary but doesn't hurt. Docker is running on Lubuntu 22.04 LTS . Data is scraped by prometheus.

ykrasik commented 1 year ago

Here is how I got this to work:

  1. Create a Dockerfile, we will be building a custom image:
    
    FROM telegraf

Update and install smartmontools

RUN apt-get update && apt-get install -y sudo smartmontools nvme-cli

Modify the sudoers file to allow the telegraf user to run smartctl and nvme without a password

RUN echo 'Cmnd_Alias SMARTCTL = /usr/sbin/smartctl' >> /etc/sudoers && \ echo 'Cmnd_Alias NVME = /usr/sbin/nvme' >> /etc/sudoers && \ echo 'telegraf ALL=(ALL) NOPASSWD: SMARTCTL, NVME' >> /etc/sudoers && \ echo 'Defaults!SMARTCTL !logfile, !syslog, !pam_session' >> /etc/sudoers && \ echo 'Defaults!NVME !logfile, !syslog, !pam_session' >> /etc/sudoers


2. In `telegraf.conf`:

[[inputs.smart]]

On most platforms used cli utilities requires root access.

Setting 'use_sudo' to true will make use of sudo to run smartctl or nvme-cli.

Sudo must be configured to allow the telegraf user to run smartctl or nvme-cli

without a password.

use_sudo = true

Gather all returned S.M.A.R.T. attribute metrics and the detailed

information from each drive into the 'smart_attribute' measurement.

attributes = true

`attributes = true` will enable Telegraf to read and store SMART attributes, which I think is useful, but not required for the setup to work as a whole.

3. Build and run, I prefer docker-compose:

version: '3.7' services: telegraf: container_name: telegraf build: context: /path/to/your/Dockerfile-dir restart: unless-stopped privileged: true volumes:

davidnewhall commented 3 months ago

There's a container built for you already. https://github.com/golift/telegraf-docker Because of this: https://github.com/influxdata/influxdata-docker/issues/563