prometheus-community / ansible

Ansible Collection for Prometheus
https://prometheus-community.github.io/ansible/
Apache License 2.0
357 stars 125 forks source link

node_exporter systemd unit file incorrectly formatted when using sysctl.include collector #341

Open 0xdeadbeefJERKY opened 4 months ago

0xdeadbeefJERKY commented 4 months ago

Bug Summary

Installing node exporter on an EC2 instance configured with the Amazon Linux 2 AMI (systemd 219) fails:

TASK [prometheus.prometheus.node_exporter : Ensure Node Exporter is enabled on boot] ***
fatal: [default]: FAILED! => {"changed": false, "msg": "Error loading unit file 'node_exporter': org.freedesktop.DBus.Error.InvalidArgs \"Invalid argument\""}

Here's the playbook being used:

- hosts: 127.0.0.1
  vars:
    node_exporter_enabled_collectors:
      - sysctl:
          include:
            vm:
              - overcommit_memory
              - overcommit_ratio
              - dirty_background_bytes
              - dirty_background_bytes
              - dirty_background_ratio
              - dirty_bytes
              - dirty_expire_centisecs
              - dirty_ratio
              - swappiness
  roles:
    - prometheus.prometheus.node_exporter

Upon further investigation, it appears the systemd unit file becomes malformed when attempting to wrap the sysctl.include collector in single quotes:

#
# Ansible managed
#

[Unit]
Description=Prometheus Node Exporter
After=network-online.target

[Service]
Type=simple
User=node-exp
Group=node-exp
ExecStart=/usr/local/bin/node_exporter \
    '--collector.sysctl.include={'vm': ['overcommit_memory', 'overcommit_ratio', 'dirty_background_bytes', 'dirty_background_bytes', 'dirty_background_ratio', 'dirty_bytes', 'dirty_expire_centisecs', 'dirty_ratio', 'swappiness']}' 

SyslogIdentifier=node_exporter
Restart=always
RestartSec=1
StartLimitInterval=0

ProtectHome=yes
NoNewPrivileges=yes

ProtectSystem=full

[Install]
WantedBy=multi-user.target

On line 14, you can see the single quote wrapping is prematurely terminated once it reaches 'vm'. More details can be found when checking the status of the service or using journalctl:

$ sudo systemctl status node_exporter
● node_exporter.service - Prometheus Node Exporter
   Loaded: error (Reason: Invalid argument)
   Active: failed (Result: resources) since Wed 2024-04-24 15:37:55 UTC; 20s ago
 Main PID: 2443 (code=killed, signal=KILL)

Apr 24 15:37:54 ip-10-66-137-116.us-west-2.compute.internal systemd[1]: node_exporter.service: main process exited, code=killed, status=9/KILL
Apr 24 15:37:54 ip-10-66-137-116.us-west-2.compute.internal systemd[1]: Unit node_exporter.service entered failed state.
Apr 24 15:37:54 ip-10-66-137-116.us-west-2.compute.internal systemd[1]: node_exporter.service failed.
Apr 24 15:37:55 ip-10-66-137-116.us-west-2.compute.internal systemd[1]: node_exporter.service holdoff time over, scheduling restart.
Apr 24 15:37:55 ip-10-66-137-116.us-west-2.compute.internal systemd[1]: node_exporter.service failed to schedule restart job: Unit is not loaded properly: Invalid argument.
Apr 24 15:37:55 ip-10-66-137-116.us-west-2.compute.internal systemd[1]: Unit node_exporter.service entered failed state.
Apr 24 15:37:55 ip-10-66-137-116.us-west-2.compute.internal systemd[1]: node_exporter.service failed.
Apr 24 15:38:12 ip-10-66-137-116.us-west-2.compute.internal systemd[1]: [/etc/systemd/system/node_exporter.service:13] Trailing garbage, ignoring.
Apr 24 15:38:12 ip-10-66-137-116.us-west-2.compute.internal systemd[1]: node_exporter.service lacks both ExecStart= and ExecStop= setting. Refusing.

Proposed Solution

This can be fixed by using double quotes for wrapping each collector argument being passed to node_exporter in the node_exporter.service.j2 template.

#
# Ansible managed
#

[Unit]
Description=Prometheus Node Exporter
After=network-online.target

[Service]
Type=simple
User=node-exp
Group=node-exp
ExecStart=/usr/local/bin/node_exporter \
    "--collector.sysctl.include={'vm': ['overcommit_memory', 'overcommit_ratio', 'dirty_background_bytes', 'dirty_background_bytes', 'dirty_background_ratio', 'dirty_bytes', 'dirty_expire_centisecs', 'dirty_ratio', 'swappiness']}" \

SyslogIdentifier=node_exporter
Restart=always
RestartSec=1
StartLimitInterval=0

ProtectHome=yes
NoNewPrivileges=yes

ProtectSystem=full

[Install]
WantedBy=multi-user.target
VermiumSifell commented 3 weeks ago

I'm affected too. Did you manage to solve it without manual intervention?