How is the "Monitoring Status" calculated?

kristofver commented 5 years ago

Allthough in the host "Monitoring" tab all services are shown as OK, the "Monitoring Status" shows Critical. In Icinga2, everything is green. I'm using FQDNs both in Icinga and The Foreman. I can see 2 things that might influence this:

I'm not using hostalive as check_command but cluster-zone which results in a status message like "Zone '***' is connected. Log lag: less than 1 millisecond"
I have a servername.domain and servername.oob.domain dns record. The host is known to Foreman and Icinga as servername.domain but I'm using the IP-address linked to servername.oob.domain in Icinga for the monitoring over an out-of-band interface. Could this confuse the monitoring smart proxy? Why would it only influence the monitoring status and not the service statuses? Is the monitoring proxy using interfaces information from Foreman in some way?

Any ideas on this? kind regards, Kristof

timogoebel commented 5 years ago

The monitoring status is actually calculated in the foreman, see https://github.com/theforeman/foreman_monitoring/blob/ef894d9c11bdaa4dd4693a758aedca6005f3066f/app/models/host_status/monitoring_status.rb#L12-L22.

kristofver commented 5 years ago

Given the to_status function, can u see a reason why a host with below monitoring results could still have Monitoring Status Critical?

foreman=# select * from monitoring_results where host_id = 15;
 id  | host_id |                service                 | result | downtime | acknowledged |         timestamp
-----+---------+----------------------------------------+--------+----------+--------------+----------------------------
 293 |      15 | mem                                    |      0 | f        | f            | 2019-01-15 08:59:38.675023
 176 |      15 | Host Check                             |      0 | f        | f            | 2019-01-15 09:00:28.794356
 190 |      15 | **redacted**                           |      0 | f        | f            | 2019-01-15 09:00:11.487482
 195 |      15 | **redacted**                           |      0 | f        | f            | 2019-01-15 09:00:26.921551
 221 |      15 | ping4                                  |      0 | f        | f            | 2019-01-15 08:59:56.651465
 232 |      15 | ntp                                    |      0 | f        | f            | 2019-01-15 09:00:28.403995
 234 |      15 | disk                                   |      0 | f        | f            | 2019-01-15 09:00:06.08392
 263 |      15 | ssh                                    |      0 | f        | f            | 2019-01-15 08:59:57.859941
 272 |      15 | load                                   |      0 | f        | f            | 2019-01-15 09:00:04.407032
(9 rows)

timogoebel commented 5 years ago

Can you use foreman-rake console and try the following?

Host::Managed.find_by(name: 'my-super.host.com').get_status(HostStatus::MonitoringStatus).to_status

kristofver commented 5 years ago

That returns 0.

I've put some debug statements in the to_status function but it doesn't get called when refreshing the host properties page in the foreman. The function to_label does get called and returns Critical.

I did some more troubleshooting. The HostStatus::MonitoringStatus record is a lot older than the monitoring_results records.

foreman=# select * from monitoring_results where host_id = 26;
 id  | host_id |                       service                        | result | downtime | acknowledged |         timestamp
-----+---------+------------------------------------------------------+--------+----------+--------------+----------------------------
 286 |      26 | load                                                 |      0 | f        | f            | 2019-01-15 08:59:49.232184
 298 |      26 | disk                                                 |      0 | f        | f            | 2019-01-15 08:59:33.737705
 175 |      26 | Host Check                                           |      0 | f        | f            | 2019-01-15 09:00:24.100271
 192 |      26 | ***                                                  |      0 | f        | f            | 2019-01-15 09:00:17.077651
 196 |      26 | ***                                                  |      0 | f        | f            | 2019-01-15 08:59:59.033319
 215 |      26 | ntp                                                  |      0 | f        | f            | 2019-01-15 09:00:28.406188
 264 |      26 | ping4                                                |      0 | f        | f            | 2019-01-15 08:59:54.60494
 265 |      26 | mem                                                  |      0 | f        | f            | 2019-01-15 09:00:15.267184
 274 |      26 | ssh                                                  |      0 | f        | f            | 2019-01-15 08:59:45.435173
(9 rows)

foreman=# select * from host_status where host_id = 26;
 id  |               type                | status | host_id |        reported_at
-----+-----------------------------------+--------+---------+----------------------------
 365 | ForemanOpenscap::ComplianceStatus |      0 |      26 | 2019-01-15 09:01:12.734863
 359 | Katello::PurposeUsageStatus       |      0 |      26 | 2019-01-15 09:01:12.851821
 345 | HostStatus::BuildStatus           |      0 |      26 | 2019-01-15 09:01:12.729576
 373 | HostStatus::ExecutionStatus       |      0 |      26 | 2019-01-15 09:01:12.755233
 358 | Katello::PurposeRoleStatus        |      0 |      26 | 2019-01-15 09:01:12.82413
 363 | HostStatus::ConfigurationStatus   |      0 |      26 | 2019-01-16 18:33:42
 455 | HostStatus::MonitoringStatus      |      2 |      26 | 2019-01-13 11:12:31.693512
 356 | Katello::ErrataStatus             |      0 |      26 | 2019-01-15 09:01:12.759535
 355 | Katello::SubscriptionStatus       |      0 |      26 | 2019-01-15 09:01:12.767729
 357 | Katello::PurposeSlaStatus         |      0 |      26 | 2019-01-15 09:01:12.795652
 360 | Katello::PurposeAddonsStatus      |      0 |      26 | 2019-01-15 09:01:12.879249
 361 | Katello::PurposeStatus            |      0 |      26 | 2019-01-15 09:01:12.906951
 456 | Katello::TraceStatus              |      0 |      26 | 2019-01-15 09:01:12.911751

I then did a reboot of the monitored server. The monitoring_results records where updated, but the HostStatus::MonitoringStatus record is still the old one from 2019-01-13.

foreman=# select * from monitoring_results where host_id = 26;
 id  | host_id |                       service                        | result | downtime | acknowledged |         timestamp
-----+---------+------------------------------------------------------+--------+----------+--------------+----------------------------
 286 |      26 | load                                                 |      0 | f        | f            | 2019-01-16 19:02:44.485407
 298 |      26 | disk                                                 |      0 | f        | f            | 2019-01-16 19:02:44.50284
 175 |      26 | Host Check                                           |      0 | f        | f            | 2019-01-16 19:02:32.77024
 196 |      26 | ***                                                  |      0 | f        | f            | 2019-01-16 19:02:42.91751
 264 |      26 | ***                                                  |      0 | f        | f            | 2019-01-16 19:02:33.475878
 265 |      26 | mem                                                  |      0 | f        | f            | 2019-01-16 19:02:43.258151
 274 |      26 | ssh                                                  |      0 | f        | f            | 2019-01-16 19:02:28.217909
 192 |      26 | ***                                                  |      0 | f        | f            | 2019-01-15 09:00:17.077651
 215 |      26 | ntp                                                  |      0 | f        | f            | 2019-01-15 09:00:28.406188
(9 rows)

foreman=# select * from host_status where host_id = 26;
 id  |               type                | status | host_id |        reported_at
-----+-----------------------------------+--------+---------+----------------------------
 365 | ForemanOpenscap::ComplianceStatus |      0 |      26 | 2019-01-15 09:01:12.734863
 359 | Katello::PurposeUsageStatus       |      0 |      26 | 2019-01-15 09:01:12.851821
 345 | HostStatus::BuildStatus           |      0 |      26 | 2019-01-16 18:23:30.864883
 373 | HostStatus::ExecutionStatus       |      0 |      26 | 2019-01-15 09:01:12.755233
 358 | Katello::PurposeRoleStatus        |      0 |      26 | 2019-01-15 09:01:12.82413
 363 | HostStatus::ConfigurationStatus   |      2 |      26 | 2019-01-16 19:02:18
 455 | HostStatus::MonitoringStatus      |      2 |      26 | 2019-01-13 11:12:31.693512
 356 | Katello::ErrataStatus             |      0 |      26 | 2019-01-15 09:01:12.759535
 355 | Katello::SubscriptionStatus       |      0 |      26 | 2019-01-15 09:01:12.767729
 357 | Katello::PurposeSlaStatus         |      0 |      26 | 2019-01-15 09:01:12.795652
 360 | Katello::PurposeAddonsStatus      |      0 |      26 | 2019-01-15 09:01:12.879249
 361 | Katello::PurposeStatus            |      0 |      26 | 2019-01-15 09:01:12.906951
 456 | Katello::TraceStatus              |      0 |      26 | 2019-01-15 09:01:12.911751
(13 rows)

Any clues? When would the host_status record normally be updated? Is this something the monitoring smart proxy does?

BTW, thanks for your support!

timogoebel commented 5 years ago

I suspect that the status is not properly refreshed. Can you try if this helps?

Host::Managed.find_by(name: 'my-super.host.com').get_status(HostStatus::MonitoringStatus).refresh!

timogoebel commented 5 years ago

btw: 0 means ok as a result, see https://github.com/theforeman/foreman_monitoring/blob/9cecbc007bef2bf6a8f839d1cda2cc017c20ad45/app/models/host_status/monitoring_status.rb#L3-L6.

theforeman / smart_proxy_monitoring

How is the "Monitoring Status" calculated? #24