prometheus / node_exporter

Exporter for machine metrics
https://prometheus.io/
Apache License 2.0
10.7k stars 2.31k forks source link

Systemd service metrics missing when loaded but disabled #1082

Open hueyg opened 5 years ago

hueyg commented 5 years ago

Host operating system: output of uname -a

3.10.0-862.11.6.el7.x86_64 Red Hat Enterprise Linux Server release 7.5 (Maipo)

node_exporter version: output of node_exporter --version

node_exporter, version 0.16.0 (branch: HEAD, revision: d42bd70f4363dced6b77d8fc311ea57b63387e4f) build user: root@a67a9bc13a69 build date: 20180515-15:52:42 go version: go1.9.6

node_exporter command line flags

ExecStart=/home/prometheus/node_exporter/node_exporter --collector.systemd

Are you running node_exporter in Docker?

No

What did you do that produced an error?

Have a custom systemd service defined in /etc/systemd/system for the Keepalived daemon. Running the following query returns the expected results with all five defined states: node_systemd_unit_state{instance="x.x.x.x:9100",name="keepalived.service"}

node_systemd_unit_state{instance="192.245.221.215:9100",job="node2",name="keepalived.service",state="activating"} 0 node_systemd_unit_state{instance="192.245.221.215:9100",job="node2",name="keepalived.service",state="active"} 1 node_systemd_unit_state{instance="192.245.221.215:9100",job="node2",name="keepalived.service",state="deactivating"} 0 node_systemd_unit_state{instance="192.245.221.215:9100",job="node2",name="keepalived.service",state="failed"} 0 node_systemd_unit_state{instance="192.245.221.215:9100",job="node2",name="keepalived.service",state="inactive"} 0

Once i issue a sudo systemctl stop keepalived.service and run the query again, then the prometheus returns nothing. It is as if the service was never defined. I run the query and don't filter on job name and every other service is returned. Once I start the service, the metrics will return again.

What did you expect to see?

Expected to continue to see the states returned, but with Active=0 and Inactive=1

What did you see instead?

No metrics for the service were returned period. Nothing. Blank screen. I have an a secondary server which I thought was an exact mirror image of the server exhibiting the issue and it does not experience this problem. Thank you for everyone's time.

hueyg commented 5 years ago

Just to add more information, this appears to be an issue with user defined services only as the default system services do not disappear after stopping.

SuperQ commented 5 years ago

That's very strange. Prometheus is doing fairly simple requests for data over dbus. Perhaps it's a systemd bug?

Maybe a bug with ListUnitsFiltered dbus request. We could try going back to ListUnits, and filter the loaded ones out in the exporter instead of trusting systemd.

SuperQ commented 5 years ago

@hueyg Can you verify https://github.com/prometheus/node_exporter/pull/1083 fixes the issue?

hueyg commented 5 years ago

@SuperQ Hey Ben, I apologize but I am a pretty ignorant Github user. I can see where it looks like you have updated the GO code to perform some more checks on the state of the units defined in SystemD. Does this mean that you want me to recompile a new version of node_explorer with this updated code and try again?

SuperQ commented 5 years ago

@hueyg Yes, if you can checkout the code, and build it, that would help. Otherwise I can post a binary if you trust me. :grin:

You can follow the standard build instructions but run git checkout superq/systemd_filter before you run make build.

SuperQ commented 5 years ago

In my test GCE instance of CentOS 7.5, I see this difference in metrics: Before:

node_systemd_units{state="active"} 101
node_systemd_units{state="inactive"} 58

After:

node_systemd_units{state="active"} 101
node_systemd_units{state="inactive"} 75

But I don't see a difference in the number of unique units in node_systemd_unit_state, which I only see 160 of. Very strange.

hueyg commented 5 years ago

@SuperQ I have no problem if you want to post a compiled binary, but I will work on it in the meantime. I am really pushing to get this resolved because this is a show stopper for the project. It definitely seems related to custom defined services. What has me totally confused is that this problem is not exhibited on what I am pretty sure is an identical secondary server. This is group of two HAProxy servers with custom units defined for HAProxy and KeepaliveD.

hueyg commented 5 years ago

@SuperQ I think I found the issue Ben. Give me ten more minutes.

hueyg commented 5 years ago

@SuperQ First let me apologize if this is a known issue/requirement but the difference is that the custom defined unit was specifically "enabled" on the working server and "disabled" by default on the non-working server. My limited understanding of SystemD is that this simply means whether the unit was set to start at runtime or not. So the "STATE" of the until file itself plays a role in what node_exporter can see. Once I set the custom service to enable: sudo systemctl enable haproxy.service Once that command is issued the service will still be returned from node_explorer after being issued a stop command.

SuperQ commented 5 years ago

Interesting, I thought for sure that even a disabled but running service would show up.

SuperQ commented 5 years ago

@lucab Do you have any idea why a service like this wouldn't show up in ListUnits():

# systemctl status chronyd.service
● chronyd.service - NTP client/server
   Loaded: loaded (/etc/systemd/system/chronyd.service; disabled; vendor preset: enabled)
   Active: inactive (dead)
SuperQ commented 5 years ago

@hueyg I did some additional testing, it doesn't seem to matter if the stopped/disabled unit is in /etc/systemd/system or /usr/lib/systemd/system. It fails to show up when stopped/disabled.

lucab commented 5 years ago

@SuperQ I think that's because the unit is inactive and disabled. Additionally, I fear that at some point the ListUnits DBus method may have changed semantics as its documentations mentions "loaded units" everywhere (or either the doc or the code is wrong). My suggestion would be to try using ListUnitsFiltered instead.

SuperQ commented 5 years ago

@lucab It seems like ListUnitsFiltered() has the same problem. systemctl status says it's loaded, but when we ask for the "loaded" list, inactive/disabled are missing.

SuperQ commented 5 years ago

I double checked, same problem with ListUnitsFiltered([]string{}). "loaded" but disabled are not returned.

lucab commented 5 years ago

@SuperQ which OS and systemd version are you seeing this on (OP was on RHEL7.5)? I can carry this over to a go-systemd ticket and have a look as soon as I have time.

SuperQ commented 5 years ago

@lucab I was testing with CentOS 7.5 and Ubuntu 18.04 (systemd 237).

Thanks, feel free to ping me on the go-systemd upstream.

lucab commented 5 years ago

Getting back to this, it looks like systemctl status is tricking us due to ephemerally loading the observed unit, which is instead not loaded right before or after the observation (being disabled and inactive).

It looks like there are DBus methods to get the "enabled state" for unit files, and to get the "active state" for enabled units, but I don't think there is a single method to get the union of those (and the primary keys are different object-types).

In the end, I think this boils down to semantics. This collector is in fact reporting activation state for loaded units, but that set is dynamic and also influenced by observers.

SuperQ commented 5 years ago

Thanks, one option is we could use the list unit files function.

miono commented 5 years ago

@SuperQ I'm planning to do some PoC for this later tonight (using ListUnitFiles as the base instead of ListUnits) if you're not already working on it?

SuperQ commented 5 years ago

@miono No, I haven't started on it. Looking forward to the PoC. Thanks!

miono commented 5 years ago

So my initial thought was to keep the call to ListUnits and add another call to ListUnitFiles. Then diffing the loaded units with the unit-files.

By adding a bool to the unit-struct called "enabled" or something we could just add the disabled unit-files as unit-structs with 0's in (activating|active|deactivating|failed|inactive) and also populating this field for the loaded units with the data we get from ListUnitFiles.

However:

That behaviour would be more desirable for us at my workplace, since we're using a whitelist-parameter. But of course everyone aren't us.

What is the desired behaviour? My two cents is that it's confusing with metrics that suddenly disappear, it can also cause some problems if you're alerting on active = 1 and then there's no such metric, and no alert, if a by mistake-disabled service stops running.

saniatk1985 commented 5 years ago

I have the same issue when running node exporter of the latest version in a container -- nginx and postgresql service statuses are not returned at all when these services are stopped but when you start them node-exporter shows all possible statuses with 1 on the active status however mysqld service is shown correctly-when it's stopped the statuses are returned with 1 showing on inactive

mlushpenko commented 5 years ago

In my case I will workaround for now by using blackbox exporter to query endpoint, but that's not exactly the same as checking if process is up and in many cases process may not have any endpoints, hope it will be fixed some time soon

SuperQ commented 5 years ago

@mlushpenko The best option is to have a Prometheus /metrics endpoint on the service. This provides both the blackbox check and service status, eliminating the need for watching systemd at all. :smile:

zhanglijingisme commented 5 years ago

Hi @SuperQ , is there any fix on this? I still notice that in NodeExporter v0.17.0 this issue still exists. Sincerely thanks...

SuperQ commented 5 years ago

There is no current fix, because systemd does not provide the required information over dbus.

mbigras commented 4 years ago

I asked how to keep a unit loaded even when stopped here: https://github.com/systemd/systemd/issues/5063#issuecomment-518456418

This is the response I got:

https://github.com/systemd/systemd/issues/5063#issuecomment-518553166

Use RefUnit() via the bus to continuously reference a unit. In that case it stays loaded until you call UnrefUnit(), or disconnect from the bus, and no other reason is in place to keep it loaded. RefUnit() is available to privileged clients only and since v232 (i.e. ~2016)

I looked at the code

https://github.com/systemd/systemd/blob/f3d3a9ca0734c298cc3bf08f8c4907dd19ee9939/src/core/dbus-manager.c#L2488

https://github.com/systemd/systemd/blob/f3d3a9ca0734c298cc3bf08f8c4907dd19ee9939/src/core/dbus-manager.c#L654

https://github.com/systemd/systemd/blob/f3d3a9ca0734c298cc3bf08f8c4907dd19ee9939/src/core/dbus-manager.c#L567

But I'm still not sure how to use RefUnit in a systemd unit file.

Are these links helpful?

My work around to get on, off, failed state information from my process manager to my metrics system is to use supervisor instead.

vagrant@srv0:~$ sudo systemctl start app{1,2,3}
vagrant@srv0:~$ sudo supervisorctl start app{1,2,3}
app1: started
app2: started
app3: started
vagrant@srv0:~$ sudo systemctl stop app3
vagrant@srv0:~$ sudo supervisorctl stop app3
app3: stopped
vagrant@srv0:~$ curl -s localhost:9100/metrics | grep 'app[123]' | grep state
node_supervisord_state{group="app1",name="app1"} 20
node_supervisord_state{group="app2",name="app2"} 20
node_supervisord_state{group="app3",name="app3"} 0
node_systemd_unit_state{name="app1.service",state="activating",type="simple"} 0
node_systemd_unit_state{name="app1.service",state="active",type="simple"} 1
node_systemd_unit_state{name="app1.service",state="deactivating",type="simple"} 0
node_systemd_unit_state{name="app1.service",state="failed",type="simple"} 0
node_systemd_unit_state{name="app1.service",state="inactive",type="simple"} 0
node_systemd_unit_state{name="app2.service",state="activating",type="simple"} 0
node_systemd_unit_state{name="app2.service",state="active",type="simple"} 1
node_systemd_unit_state{name="app2.service",state="deactivating",type="simple"} 0
node_systemd_unit_state{name="app2.service",state="failed",type="simple"} 0
node_systemd_unit_state{name="app2.service",state="inactive",type="simple"} 0
mbigras commented 4 years ago

Looking more closely at another comment

https://github.com/systemd/systemd/issues/5063#issuecomment-518524231

A service will stay loaded if it is wanted/required/etc by something...

I was able to keep a reference to an inactive unit by creating a webapp.service and a webapp.target that wants the webapp.service

This looks like it works!

systemctl cat webapp.{service,target}
# /etc/systemd/system/webapp.service
[Service]
User=webapp
ExecStart=/etc/systemd/system/webapp
Restart=on-failure
RemainAfterExit=true

# /etc/systemd/system/webapp.target
[Unit]
Wants=webapp.service
vagrant@srv0:~/pystemd$ curl -s localhost:9100/metrics | grep webapp
node_supervisord_exit_status{group="webapp",name="webapp"} 0
node_supervisord_state{group="webapp",name="webapp"} 0
node_supervisord_up{group="webapp",name="webapp"} 0
node_systemd_unit_state{name="webapp.service",state="activating",type="simple"} 0
node_systemd_unit_state{name="webapp.service",state="active",type="simple"} 0
node_systemd_unit_state{name="webapp.service",state="deactivating",type="simple"} 0
node_systemd_unit_state{name="webapp.service",state="failed",type="simple"} 0
node_systemd_unit_state{name="webapp.service",state="inactive",type="simple"} 1
node_systemd_unit_state{name="webapp.target",state="activating",type=""} 0
node_systemd_unit_state{name="webapp.target",state="active",type=""} 1
node_systemd_unit_state{name="webapp.target",state="deactivating",type=""} 0
node_systemd_unit_state{name="webapp.target",state="failed",type=""} 0
node_systemd_unit_state{name="webapp.target",state="inactive",type=""} 0
B-Lukas commented 4 years ago

I am having the same Problem. It ist quite difficult to write a proper target unit without Side effects.

mbigras commented 4 years ago

Yes, I haven't found a way to get systemd to "keep a reference to the unit" without enabling foo.service or creating a target that wants foo.service, in both cases foo.service will get started when I don't want.

NikNikM commented 3 years ago

I used this workaround to configure the alerting if a desired service in not running on one or more instances: expr: sum(node_systemd_unit_state{name="my_service.service", state="active"}) < sum(node_uname_info)

hoffie commented 3 years ago

expr: sum(node_systemd_unit_state{name="my_service.service", state="active"}) < sum(node_uname_info)

I suspect that this will not fire if you are hitting the behavior described in this Github issue. sum(non_existing) will not return anything (not even 0).

NikNikM commented 3 years ago

You're right, thanks. I didn't consider it but we can change the expression:

(sum(node_systemd_unit_state{name="my_service.service", state="active"}) or vector(0)) < sum(node_uname_info)

ky-tt commented 3 years ago

I wanted to keep the instance label so I used

count by(instance) (node_systemd_version) unless count by(instance) (node_systemd_unit_state{name="my_service.service",state="active"} == 1)

For me the metrics luckily did not show up in in the first place, since the service was disabled and is only started on demand. I would have had a hard time debugging alerts when they disappear. I see the problems with adding them, though.

erolg commented 3 years ago

i think it would be very useful if we have "reloading" state label. We will gonna use textfile to get "how many reload occurred over time?"

cray2015 commented 2 years ago

This is the exact case. the service needs to be enabled in order to be reported. because it was the intention to keep the service alive. i tested with nginx service. it was disabled so i was not getting alerts for it. but I did not get success alert when the service was made up again. weird thing.

Celso-19 commented 1 year ago

I had the same problem and i send what i did for guys that may face the same in the future. In my case i just want to know when some systemd service stop working, so for that as workaround i created this expression on prometheus:

max_over_time(node_systemd_unit_state{name="your_service_name.service",state="active"}[6h]) unless node_systemd_unit_state{name="your_service_name.service",state="active"}

basically compares the services now with the services that existed in the last 6h, and raise an alert if some disappear

malcolm77 commented 1 year ago

@SuperQ First let me apologize if this is a known issue/requirement but the difference is that the custom defined unit was specifically "enabled" on the working server and "disabled" by default on the non-working server. My limited understanding of SystemD is that this simply means whether the unit was set to start at runtime or not. So the "STATE" of the until file itself plays a role in what node_exporter can see. Once I set the custom service to enable: sudo systemctl enable haproxy.service Once that command is issued the service will still be returned from node_explorer after being issued a stop command.

This worked for me, once i enabled the 'custom' service, it appeared in the metrics list.

kazz-s commented 12 months ago

5 years later and still opened :/

Based on all the explanations here, I'd say this is counterintuitive, but still expected behavior.

In our project, we encountered this problem when monitoring 3rd party services that are manged by init.d scripts, and they choose not to mark them as enabled for whatever reason.

If you can't or won't alter the service configuration, what I would do, if I want to have a rule for "the service is down", is to define an expression like this:

node_systemd_unit_state{name="keepalived.service", state=~"inactive|failed"} == 1
or
absent(node_systemd_unit_state{name="keepalived.service"})