prometheus / node_exporter

Exporter for machine metrics
https://prometheus.io/
Apache License 2.0
11.18k stars 2.36k forks source link

expose scheduled shutdown times #3110

Open anarcat opened 1 month ago

anarcat commented 1 month ago

I've been struggling with porting a monitoring check from Nagios to Prometheus. What it does is raise a flag if there's a shutdown scheduled on a server. It does this through this horrendous NRPE check:

command[dsa2_shutdown]=if /usr/lib/nagios/plugins/check_procs -w 1: -u root -C shutdown > /dev/null || /usr/lib/nagios/plugins/check_procs -w 1: -u root -a /lib/systemd/systemd-shutdownd > /dev/null || ( busctl get-property org.freedesktop.login1 /org/freedesktop/login1 org.freedesktop.login1.Manager ScheduledShutdown 2> /dev/null | sed 's/[^"]*"//;s/".*//' | grep -v dry- | grep . ); then echo 'system-in-shutdown'; else echo 'no shutdown running' ; fi

i hope you can unsee this one day.

we can probably get rid of all the check_procs stuff there and assume systemd, at least that's what we're asserting it, which turns this into something like:

busctl get-property org.freedesktop.login1 /org/freedesktop/login1 org.freedesktop.login1.Manager ScheduledShutdown

and in fact, I wrote a Python script that would extract a metric out of that nicely:

#!/usr/bin/python3

import logging
import shlex
from subprocess import CalledProcessError, PIPE, run

def test_parse_dbus():
    no_sched = '(st) "" 18446744073709551615'
    assert parse_dbus(no_sched) == ("", 0)
    sched_reboot = '(st) "reboot" 1725477267406843'
    assert parse_dbus(sched_reboot) == ("reboot", 1725477267.406843)
    sched_reboot_round = '(st) "reboot" 1725477267506843'
    assert parse_dbus(sched_reboot_round) == ("reboot", 1725477267.506843)
    # theoritical: i've seen the metric "0" with the label "suspend"
    # before adding this test. i couldn't reproduce by suspending my
    # laptop, so i'm not sure wtf happened there.
    sched_suspend = '(st) "suspend" 0'
    assert parse_dbus(sched_suspend) == ("", 0)
    garbage = '(st) "reboot" 1725477267506843 jfdklafjds'
    assert parse_dbus(garbage) == ("", 0)
    assert parse_dbus("(st) ...") == ("", 0)
    assert parse_dbus("") == ("", 0)

def parse_dbus(output: str) -> tuple[str, float]:
    logging.debug("parsing DBus output: %s", output)
    try:
        _, kind, timestamp_str = output.split(maxsplit=2)
    except ValueError as exc:
        logging.warning("could not parse DBus output: %r (%s)", output, exc)
        return "", 0
    kind = kind.replace('"', "")
    try:
        timestamp = int(timestamp_str) / 1000000
    except ValueError as exc:
        logging.warning(
            "could not parse DBus timestamp: %r (%s)",
            timestamp_str,
            exc,
        )
        return "", 0
    logging.debug("found kind %r, timestamp %r", kind, timestamp)
    if kind and timestamp:
        return kind, timestamp
    else:
        return "", 0

def main():
    cmd = shlex.split(
        "busctl get-property org.freedesktop.login1 /org/freedesktop/login1 org.freedesktop.login1.Manager ScheduledShutdown"  # noqa: E501
    )
    try:
        proc = run(cmd, check=True, stdout=PIPE, encoding="ascii")
    except CalledProcessError as exc:
        logging.warning("could not call command %r: %s", shlex.join(cmd), exc)
        kind, timestamp = "", 0
    else:
        kind, timestamp = parse_dbus(proc.stdout)
    print("# HELP node_shutdown_scheduled_timestamp_seconds time of the next scheduled reboot, or zero")
    print("# TYPE node_shutdown_scheduled_timestamp_seconds gauge")
    if timestamp:
        print(
            "node_shutdown_scheduled_timestamp_seconds{kind=%s} %s" % (kind, timestamp)
        )
    else:
        print("node_shutdown_scheduled_timestamp_seconds 0")

if __name__ == "__main__":
    main()

the problem is there's nowhere to call this thing from: shutdown(8) doesn't have any post hooks, and i don't think systemd will fire any specific service when a shutdown is scheduled... there are some dbus signal sent around though, namely ScheduledShutdown which we can get with:

busctl get-property org.freedesktop.login1 /org/freedesktop/login1 org.freedesktop.login1.Manager ScheduledShutdown

... which is essentially what we're doing above.

But i figured a better place to do this would be in the node exporter itself, since it's already a daemon just sitting there.

SuperQ commented 1 month ago

Getting that property should be reasonably easy to do in the systemd collector.

SuperQ commented 1 month ago

I created https://github.com/prometheus/node_exporter/pull/3111 as a draft. It doesn't work. I don't think the dbus API we have supports that generic call.

anarcat commented 1 month ago

awesome work, thanks! i've followed up there.