prometheus / node_exporter

Exporter for machine metrics
https://prometheus.io/
Apache License 2.0
10.92k stars 2.33k forks source link

[node_exporter][metric] node_systemd_unit_state with labels: "high-level unit activation state" and "low-level unit activation state" #1440

Open fchiorascu opened 5 years ago

fchiorascu commented 5 years ago

Host operating system: output of uname -a

Linux server01 3.10.0-957.1.3.el7.x86_64 #1 SMP Thu Nov 29 14:49:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

node_exporter version: output of node_exporter --version

node_exporter, version 0.18.1 (branch: HEAD, revision: 3db77732e925c08f675d7404a8c46466b2ece83e) build user: root@b50852a1acba build date: 20190604-16:41:18 go version: go1.12.5

node_exporter command line flags

usage: node_exporter [<flags>]
Flags:
  -h, --help                    Show context-sensitive help (also try --help-long and --help-man).
      --collector.diskstats.ignored-devices="^(ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\\d+n\\d+p)\\d+$"
                                Regexp of devices to ignore for diskstats.
      --collector.filesystem.ignored-mount-points="^/(dev|proc|sys|var/lib/docker/.+)($|/)"
                                Regexp of mount points to ignore for filesystem collector.
      --collector.filesystem.ignored-fs-types="^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$"
                                Regexp of filesystem types to ignore for filesystem collector.
      --collector.netclass.ignored-devices="^$"
                                Regexp of net devices to ignore for netclass collector.
      --collector.netdev.ignored-devices="^$"
                                Regexp of net devices to ignore for netdev collector.
      --collector.netstat.fields="^(.*_(InErrors|InErrs)|Ip_Forwarding|Ip(6|Ext)_(InOctets|OutOctets)|Icmp6?_(InMsgs|OutMsgs)|TcpExt_(Listen.*|Syncookies.*|TCPSynRetrans)|Tcp_(ActiveOpens|InSegs|OutSegs|PassiveOpens|RetransSegs|CurrEstab)|Udp6?_(InDatagrams|OutDatagrams|NoPorts))$"
                                Regexp of fields to return for netstat collector.
      --collector.ntp.server="127.0.0.1"
                                NTP server to use for ntp collector
      --collector.ntp.protocol-version=4
                                NTP protocol version
      --collector.ntp.server-is-local
                                Certify that collector.ntp.server address is the same local host as this collector.
      --collector.ntp.ip-ttl=1  IP TTL to use while sending NTP query
      --collector.ntp.max-distance=3.46608s
                                Max accumulated distance to the root
      --collector.ntp.local-offset-tolerance=1ms
                                Offset between local clock and local ntpd time to tolerate
      --path.procfs="/proc"     procfs mountpoint.
      --path.sysfs="/sys"       sysfs mountpoint.
      --path.rootfs="/"         rootfs mountpoint.
      --collector.qdisc.fixtures=""
                                test fixtures to use for qdisc collector end-to-end testing
      --collector.runit.servicedir="/etc/service"
                                Path to runit service directory.
      --collector.supervisord.url="http://localhost:9001/RPC2"
                                XML RPC endpoint.
      --collector.systemd.unit-whitelist=".+"
                                Regexp of systemd units to whitelist. Units must both match whitelist and not match blacklist to be included.
      --collector.systemd.unit-blacklist=".+\\.(automount|device|mount|scope|slice)"
                                Regexp of systemd units to blacklist. Units must both match whitelist and not match blacklist to be included.
      --collector.systemd.private
                                Establish a private, direct connection to systemd without dbus.
      --collector.systemd.enable-task-metrics
                                Enables service unit tasks metrics unit_tasks_current and unit_tasks_max
      --collector.systemd.enable-restarts-metrics
                                Enables service unit metric service_restart_total
      --collector.systemd.enable-start-time-metrics
                                Enables service unit metric unit_start_time_seconds
      --collector.textfile.directory=""
                                Directory to read text files with metrics from.
      --collector.vmstat.fields="^(oom_kill|pgpg|pswp|pg.*fault).*"
                                Regexp of fields to return for vmstat collector.
      --collector.wifi.fixtures=""
                                test fixtures to use for wifi collector metrics
      --collector.arp           Enable the arp collector (default: enabled).
      --collector.bcache        Enable the bcache collector (default: enabled).
      --collector.bonding       Enable the bonding collector (default: enabled).
      --collector.buddyinfo     Enable the buddyinfo collector (default: disabled).
      --collector.conntrack     Enable the conntrack collector (default: enabled).
      --collector.cpu           Enable the cpu collector (default: enabled).
      --collector.cpufreq       Enable the cpufreq collector (default: enabled).
      --collector.diskstats     Enable the diskstats collector (default: enabled).
      --collector.drbd          Enable the drbd collector (default: disabled).
      --collector.edac          Enable the edac collector (default: enabled).
      --collector.entropy       Enable the entropy collector (default: enabled).
      --collector.filefd        Enable the filefd collector (default: enabled).
      --collector.filesystem    Enable the filesystem collector (default: enabled).
      --collector.hwmon         Enable the hwmon collector (default: enabled).
      --collector.infiniband    Enable the infiniband collector (default: enabled).
      --collector.interrupts    Enable the interrupts collector (default: disabled).
      --collector.ipvs          Enable the ipvs collector (default: enabled).
      --collector.ksmd          Enable the ksmd collector (default: disabled).
      --collector.loadavg       Enable the loadavg collector (default: enabled).
      --collector.logind        Enable the logind collector (default: disabled).
      --collector.mdadm         Enable the mdadm collector (default: enabled).
      --collector.meminfo       Enable the meminfo collector (default: enabled).
      --collector.meminfo_numa  Enable the meminfo_numa collector (default: disabled).
      --collector.mountstats    Enable the mountstats collector (default: disabled).
      --collector.netclass      Enable the netclass collector (default: enabled).
      --collector.netdev        Enable the netdev collector (default: enabled).
      --collector.netstat       Enable the netstat collector (default: enabled).
      --collector.nfs           Enable the nfs collector (default: enabled).
      --collector.nfsd          Enable the nfsd collector (default: enabled).
      --collector.ntp           Enable the ntp collector (default: disabled).
      --collector.perf          Enable the perf collector (default: disabled).
      --collector.pressure      Enable the pressure collector (default: enabled).
      --collector.processes     Enable the processes collector (default: disabled).
      --collector.qdisc         Enable the qdisc collector (default: disabled).
      --collector.runit         Enable the runit collector (default: disabled).
      --collector.sockstat      Enable the sockstat collector (default: enabled).
      --collector.stat          Enable the stat collector (default: enabled).
      --collector.supervisord   Enable the supervisord collector (default: disabled).
      --collector.systemd       Enable the systemd collector (default: disabled).
      --collector.tcpstat       Enable the tcpstat collector (default: disabled).
      --collector.textfile      Enable the textfile collector (default: enabled).
      --collector.time          Enable the time collector (default: enabled).
      --collector.timex         Enable the timex collector (default: enabled).
      --collector.uname         Enable the uname collector (default: enabled).
      --collector.vmstat        Enable the vmstat collector (default: enabled).
      --collector.wifi          Enable the wifi collector (default: disabled).
      --collector.xfs           Enable the xfs collector (default: enabled).
      --collector.zfs           Enable the zfs collector (default: enabled).
      --web.listen-address=":9100"
                                Address on which to expose metrics and web interface.
      --web.telemetry-path="/metrics"
                                Path under which to expose metrics.
      --web.disable-exporter-metrics
                                Exclude metrics about the exporter itself (promhttp_*, process_*, go_*).
      --web.max-requests=40     Maximum number of parallel scrape requests. Use 0 to disable.
      --log.level="info"        Only log messages with the given severity or above. Valid levels: [debug, info, warn, error, fatal]
      --log.format="logger:stderr"
                                Set the log target and format. Example: "logger:syslog?appname=bob&local=7" or "logger:stdout?json=true"
      --version                 Show application version.

Are you running node_exporter in Docker?

N/A

What did you do that produced an error?

It will be great to have the label SUB (low-level unit activation state) exposed by node_exporter for node_systemd_unit metric (https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/system_administrators_guide/sect-managing_services_with_systemd-services).

$ systemctl list-units --type service 
UNIT                    LOAD    ACTIVE  SUB      DESCRIPTION
node_exporter.service   loaded  active  running   Prometheus Node Exporter

$ systemctl status node_exporter -l
● node_exporter.service - Prometheus Node Exporter
   Loaded: loaded (/usr/lib/systemd/system/node_exporter.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2019-07-01 10:39:36 CEST; 1 day 1h ago
 Main PID: 13320 (node_exporter)

What did you expect to see?

It will be great to see the "high-level unit activation state" (ACTIVE) and "low-level unit activation state" (SUB) as labels on metric: node_systemd_unit_state (for the moment there is only the state without substate), below I've added the label. node_systemd_unit_state{alias="server01",env="int",instance="192.168.11.11:9100",job="server01",name="node_exporter.service",state="active",substate="running",type="simple"}

What did you see?

node_systemd_unit_state{alias="server01",env="int",instance="192.168.11.11:9100",job="server01",name="node_exporter.service",state="active",type="simple"}

SuperQ commented 5 years ago

While this may seem trivial at first glance, it's a lot more complicated. The combination of unit states and substates is quite a long list.

From systemctl --state=help:

Available unit load states:
stub
loaded
not-found
error
merged
masked

Available unit active states:
active
reloading
inactive
failed
activating
deactivating

Available automount unit substates:
dead
waiting
running
failed

Available device unit substates:
dead
tentative
plugged

Available mount unit substates:
dead
mounting
mounting-done
mounted
remounting
unmounting
remounting-sigterm
remounting-sigkill
unmounting-sigterm
unmounting-sigkill
failed

Available path unit substates:
dead
waiting
running
failed

Available scope unit substates:
dead
running
abandoned
stop-sigterm
stop-sigkill
failed

Available service unit substates:
dead
start-pre
start
start-post
running
exited
reload
stop
stop-sigabrt
stop-sigterm
stop-sigkill
stop-post
final-sigterm
final-sigkill
failed
auto-restart

Available slice unit substates:
dead
active

Available socket unit substates:
dead
start-pre
start-chown
start-post
listening
running
stop-pre
stop-pre-sigterm
stop-pre-sigkill
stop-post
final-sigterm
final-sigkill
failed

Available swap unit substates:
dead
activating
activating-done
active
deactivating
deactivating-sigterm
deactivating-sigkill
failed

Available target unit substates:
dead
active

Available timer unit substates:
dead
waiting
running
elapsed
failed

In order to do this correctly, we have to expand the current state bitmask into the full combination of sub-states. Even with this help info, the valid state + sub-state combinations aren't mapped. For example is failed + running a valid combination?

We also need to detect which type of unit each one is and only expose the sub-states that are valid for that type.

This might make a better separate metric, node_systemd_unit_substate. This would simplify dealing with the valid combinations.

fchiorascu commented 5 years ago

Sounds great, thank you for this detailed explanation. At the begining, I was thinking if possible to have only two scenarios like:

node_systemd_unit_state{alias="server01",env="int",instance="192.168.11.11:9100",job="server01",name="node_exporter.service",state="active",substate="running",type="simple"}

and

To put under substate="failed" all the substates != substate="running". {alias="server01",env="int",instance="192.168.11.11:9100",job="server01",name="node_exporter.service",state="active",substate="failed",type="simple"}

But what are you detailed I think makes more sense.

fchiorascu commented 2 years ago

Hi @discordianfish any news? :)

discordianfish commented 2 years ago

Not that I'm aware of. We're open for submissions to implement that but I don't think anyone has done something to address this.

fchiorascu commented 1 year ago

Hi @discordianfish , @SuperQ, maybe in future releases of node_exporter will have this.

discordianfish commented 1 year ago

I think we're open to including this so if you want to implement this, we'll consider it

jmnote commented 1 year ago

I am interested in discussing this issue. The status of my system is as follows.

[root@localhost ~]# systemctl is-enabled node_exporter
disabled
[root@localhost ~]# systemctl list-units --type service
  UNIT                   LOAD   ACTIVE SUB     DESCRIPTION
● node_exporter.service  loaded failed failed  Prometheus Node Exporter
[root@localhost ~]# systemctl status node_exporter
● node_exporter.service - Prometheus Node Exporter
   Loaded: loaded (/etc/systemd/system/node_exporter.service; disabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Tue 2023-03-21 19:28:28 KST; 1 day 15h ago
 Main PID: 8572 (code=exited, status=1/FAILURE)
...

In this status, an alert is triggerd by the following rule, which we do not want.

node_systemd_unit_state{state="failed",type!="oneshot"} == 1

It would be good if we could prevent the alert using expressions like:

node_systemd_unit_state{state!="disabled",substate="failed",type!="oneshot"} == 1
node_systemd_unit_state{state!="inactive",substate="failed",type!="oneshot"} == 1