influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.59k stars 5.56k forks source link

Cannot get `smart` plugin to work with passwordless `sudo` #8690

Closed courtarro closed 3 years ago

courtarro commented 3 years ago

I'm unable to get the smart plugin to work with a locally-built and installed version of smartmontools using sudo. My telegraf runs as its own user (telegraf), and I've got a sudoers clause set up to enable passwordless execution of /usr/local/sbin/smartctl by Telegraf, yet I get an error.

The log entries are visible below. I think the key message is "sudo: unable to change to root gid: Operation not permitted". I don't understand what this means or why it's appearing. When I manually run sudo -u telegraf to impersonate Telegraf, I'm able to run sudo -n /usr/local/sbin/smartctl --scan just fine, no password needed. Any idea what might be wrong with my configuration?

Relevant telegraf.conf:

[[inputs.smart]]
  path = "/usr/local/sbin/smartctl"
  use_sudo = true

Relevant sudoers entries:

Cmnd_Alias SMARTCTL = /usr/local/sbin/smartctl
telegraf  ALL=(ALL) NOPASSWD: SMARTCTL
Defaults!SMARTCTL !logfile, !syslog, !pam_session

System info:

Running Telegraf version 1.13.0-1 for Ubuntu Bionic (18.04)

Expected behavior:

Telegraf runs smartctl and gathers the relevant metrics.

Actual behavior:

Telegraf fails and the following log entries appear in its systemd log:

Jan 13 16:37:30 prismo telegraf[2893]: 2021-01-13T21:37:30Z E! [inputs.smart] Error in plugin: failed to run command '/usr/local/sbin/smartctl --scan': exit status 1 - sudo: unable to change to root gid: Operation not permitted
Jan 13 16:37:30 prismo telegraf[2893]: sudo: unable to initialize policy plugin
p-zak commented 3 years ago

@courtarro can you try with Telegraf 1.17.0?

courtarro commented 3 years ago

@p-zak I just upgraded via the InfluxDB PPA to 1.17.0 and the result is the same, unfortunately.

courtarro commented 3 years ago

For comparison, here's what it looks like when I test from the command line:

root@prismo:/usr/local/sbin# sudo -u telegraf -s
telegraf@prismo:/usr/local/sbin$ whoami
telegraf
telegraf@prismo:/usr/local/sbin$ id
uid=999(telegraf) gid=998(telegraf) groups=998(telegraf)
telegraf@prismo:/usr/local/sbin$ ./smartctl --all /dev/sda
smartctl 7.0 2018-12-30 r5164 [x86_64-linux-4.15.0-129-generic] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

Smartctl open device: /dev/sda failed: Permission denied
telegraf@prismo:/usr/local/sbin$ sudo ./smartctl --all /dev/sda
(WORKS)
KubaTrojan commented 3 years ago

I have tried to reproduce this behaviour and here is what I got:

System info: Ubuntu 18.04 (bionic), bare-metal Telegraf version: 1.17.0

  1. Create new telegraf_test user: sudo adduser telegraf_test
  2. Add below entry to the bottom of sudoers file: sudo visudo
    Cmnd_Alias SMARTCTL = /usr/sbin/smartctl
    telegraf_test  ALL=(ALL) NOPASSWD: SMARTCTL
    Defaults!SMARTCTL !logfile, !syslog, !pam_session
  3. Log into created user: sudo su telegraf_test
  4. Run id command and observe output: id
    uid=1003(telegraf_test) gid=1003(telegraf_test) groups=1003(telegraf_test)
  5. Run smartctl by telegraf_test user and observe output:
    telegraf_test@XXX:/home/XXX/telegraf$ /usr/sbin/smartctl --scan
    /dev/sda -d scsi # /dev/sda, SCSI device
    telegraf_test@XXX:/home/XXX/telegraf$ sudo /usr/sbin/smartctl --scan
    /dev/sda -d scsi # /dev/sda, SCSI device
  6. Configure smart plugin in telegraf configuration:
    [[inputs.smart]]
    path = "/usr/sbin/smartctl"
    use_sudo = true
  7. Run telegraf by telegraf_test user and observe output:
    telegraf_test@XXX:/home/XXX/telegraf$ ./telegraf --config=telegraf.conf --test
    2021-01-18T11:57:18Z I! Starting Telegraf 
    2021-01-18T11:57:18Z D! [agent] Initializing plugins
    2021-01-18T11:57:18Z D! [agent] Starting service inputs
    boot_time=1607077935i,context_switches=8447125900i,entropy_avail=3027i,interrupts=2862299797i,processes_forked=1921046i 1610971038000000000
    2021-01-18T11:57:18Z D! [agent] Stopping service inputs
    2021-01-18T11:57:18Z D! [agent] Input channel closed
    2021-01-18T11:57:18Z D! [agent] Stopped Successfully
    > smart_device,capacity=512110190592,device=sda,enabled=Enabled,host=XXX,model=XXX,serial_no=XXX,wwn=XXX exit_status=0i,health_ok=true,temp_c=22i,udma_crc_errors=0i 1610971038000000000
  8. Remove entries provided in point 2.
  9. Run smartctl by telegraf_test user and observe output:
    telegraf_test@XXX:/home/XXX/telegraf$ /usr/sbin/smartctl --scan
    /dev/sda -d scsi # /dev/sda, SCSI device
    telegraf_test@XXX:/home/XXX/telegraf$ sudo /usr/sbin/smartctl --scan
    [sudo] password for telegraf_test: 
    telegraf_test is not in the sudoers file.  This incident will be reported.
  10. Run telegraf by telegraf_test user and observe output:
    telegraf_test@XXX:/home/XXX/telegraf$ ./telegraf --config=telegraf.conf --test
    2021-01-18T13:45:37Z I! Starting Telegraf 
    2021-01-18T13:45:37Z D! [agent] Initializing plugins
    2021-01-18T13:45:37Z D! [agent] Starting service inputs
    boot_time=1607077935i,context_switches=8454885756i,entropy_avail=2214i,interrupts=2865442312i,processes_forked=1924482i 1610977537000000000
    2021-01-18T13:45:37Z E! [inputs.smart] Error in plugin: failed to run command '/usr/sbin/smartctl [--scan]': exit status 1 - sudo: a password is required
    2021-01-18T13:45:37Z D! [agent] Stopping service inputs
    2021-01-18T13:45:37Z D! [agent] Input channel closed
    2021-01-18T13:45:37Z D! [agent] Stopped Successfully
    2021-01-18T13:45:37Z E! [telegraf] Error running agent: input plugins recorded 1 errors

    As you can see, I have no such problem with permission. Could you take the same steps and write your inputs and outputs? Also, let me know if you have any entries in sudoers file about telegraf group.

p-zak commented 3 years ago

@courtarro Did you have time to check this?

courtarro commented 3 years ago

@p-zak Okay, I finally nailed this down. It's an estoteric issue related to a change made per the ping input readme. I'm also using that input plugin, and to so with the "native" option, I added the suggested lines to a systemd override file:

[Service]
CapabilityBoundingSet=CAP_NET_RAW
AmbientCapabilities=CAP_NET_RAW

If I understand correctly, this suggestion is not ideal. Because CapabilityBoundingSet is being set, the permissions obtainable via sudo are more restricted than they would otherwise be, and this causes Telegraf to be unable to perform sudo successfully to run the smartctl command.

Instead, I changed the override file for ping to:

[Service]
AmbientCapabilities=CAP_NET_RAW

This enables the ping input plugin to do its job, while not limiting the sudo command used by the smart plugin. So now everything works. I suggest updating the ping readme to use this particular configuration instead of the current one (remove the CapabilityBoundingSet clause).

zeus86 commented 5 months ago

for further readers: there are other limits as well, that can mess around with your config. in my case it was the DynamicUser=yes statement, which enabled some sandboxing, including NoNewPrivileges=true. Overriding this via systemctl edit won't actually do anything, you need to use systemctl edit --full to have systemd place a new copy to /etc