osism / issues

This repository is used for bug reports that are cross-project or not bound to a specific repository (or to an unknown repository).
https://www.osism.tech
1 stars 1 forks source link

Provide health check for all containers #433

Closed nerdicbynature closed 3 months ago

nerdicbynature commented 1 year ago

Hi,

any chance to add health check to all containers deployed by OSISM? Especially would it we very very helpful if the health status also includes the state of the connection to rabbitmq of the service that runs within the container.

Background: If 1 out of 3 controller nodes is rebootet also 1/3 of all RabitMQ container get restartet and a lot of services do not recover their broken connections to RabbitMQ. As a result many onnoticed problem are detected late.

Kind regards, André.

berendt commented 1 year ago

The healthchecks for the OpenStack services come from the Kolla-Ansible project. It is necessary to create a list of which health checks are currently missing and which need to be added.

nerdicbynature commented 1 year ago

In my opinion all containers should have a working health check. Many already have, but not all of them.

Also, a very huge improvement would be to also consider the RabbitMQ connection state into the health as currently only logs show that there is a problem after one of the RabbitMQ have been restarted.

fkr commented 1 year ago

In my opinion all containers should have a working health check. Many already have, but not all of them.

Do you have a list from your environment? That would come in handy.

nerdicbynature commented 1 year ago

As for scs1 these are the containers on the controllers:

4317ed27d592   quay.io/osism/redis-sentinel:5.0.7.20230125                                    "dumb-init --single-…"   2 days ago    Up 2 days (healthy)             redis_sentinel
867515d7ac42   quay.io/osism/redis:5.0.7.20230125                                             "dumb-init --single-…"   2 days ago    Up 2 days (healthy)             redis
3be7fec27b42   quay.io/osism/ovn-controller:22.03.0.20230125                                  "dumb-init --single-…"   2 days ago    Up 2 days                       ovn_controller
f371798105ca   quay.io/osism/openvswitch-vswitchd:2.17.3.20230125                             "dumb-init --single-…"   2 days ago    Up 2 days (healthy)             openvswitch_vswitchd
d5ed087280f3   quay.io/osism/openvswitch-db-server:2.17.3.20230125                            "dumb-init --single-…"   2 days ago    Up 2 days (healthy)             openvswitch_db
e0c7cefdc59c   quay.io/osism/prometheus-cadvisor:0.38.7.20230125                              "dumb-init --single-…"   2 days ago    Up 2 days                       prometheus_cadvisor
a7059f741468   quay.io/osism/prometheus-memcached-exporter:0.6.0.20230125                     "dumb-init --single-…"   2 days ago    Up 2 days                       prometheus_memcached_exporter
be3ae6541f88   quay.io/osism/prometheus-haproxy-exporter:0.10.0.20230125                      "dumb-init --single-…"   2 days ago    Up 2 days                       prometheus_haproxy_exporter
44bf99f4d613   quay.io/osism/prometheus-mysqld-exporter:0.12.1.20230125                       "dumb-init --single-…"   2 days ago    Up 2 days                       prometheus_mysqld_exporter
bc1d827a3985   quay.io/osism/prometheus-node-exporter:0.18.1.20230125                         "dumb-init --single-…"   2 days ago    Up 2 days                       prometheus_node_exporter
1b3a9bc1b026   quay.io/osism/grafana:9.3.4.20230125                                           "dumb-init --single-…"   2 days ago    Up 2 days                       grafana
c23dfac163e8   quay.io/osism/octavia-worker:10.0.0.20230125                                   "dumb-init --single-…"   2 days ago    Up 2 days (healthy)             octavia_worker
c839e66c91a9   quay.io/osism/octavia-housekeeping:10.0.0.20230125                             "dumb-init --single-…"   2 days ago    Up 2 days (healthy)             octavia_housekeeping
bf483ca8e93f   quay.io/osism/octavia-health-manager:10.0.0.20230125                           "dumb-init --single-…"   2 days ago    Up 2 days (healthy)             octavia_health_manager
0a5b5bfdfe2a   quay.io/osism/octavia-driver-agent:10.0.0.20230125                             "dumb-init --single-…"   2 days ago    Up 2 days                       octavia_driver_agent
5eff2456adfd   quay.io/osism/octavia-api:10.0.0.20230125                                      "dumb-init --single-…"   2 days ago    Up 2 days (healthy)             octavia_api
3d1017b382bd   quay.io/osism/designate-sink:14.0.1.20230125                                   "dumb-init --single-…"   2 days ago    Up 2 days (healthy)             designate_sink
b72f3d90f8fc   quay.io/osism/designate-worker:14.0.1.20230125                                 "dumb-init --single-…"   2 days ago    Up 2 days (healthy)             designate_worker
c141b3e8bf7e   quay.io/osism/designate-mdns:14.0.1.20230125                                   "dumb-init --single-…"   2 days ago    Up 2 days (healthy)             designate_mdns
c262371dd337   quay.io/osism/designate-producer:14.0.1.20230125                               "dumb-init --single-…"   2 days ago    Up 2 days (healthy)             designate_producer
c7bb2e96ffff   quay.io/osism/designate-central:14.0.1.20230125                                "dumb-init --single-…"   2 days ago    Up 2 days (healthy)             designate_central
dbf7d0f5f190   quay.io/osism/designate-api:14.0.1.20230125                                    "dumb-init --single-…"   2 days ago    Up 2 days (healthy)             designate_api
b3b534515901   quay.io/osism/designate-backend-bind9:14.0.1.20230125                          "dumb-init --single-…"   2 days ago    Up 2 days (healthy)             designate_backend_bind9
ba9953cb9e1d   quay.io/osism/nova-novncproxy:25.0.1.20230125                                  "dumb-init --single-…"   2 days ago    Up 2 days (healthy)             nova_novncproxy
2b5ea5a70dd6   quay.io/osism/nova-conductor:25.0.1.20230125                                   "dumb-init --single-…"   2 days ago    Up 2 days (healthy)             nova_conductor
a5444ef048cf   quay.io/osism/nova-api:25.0.1.20230125                                         "dumb-init --single-…"   2 days ago    Up 2 days (healthy)             nova_api
5af13b1dce99   quay.io/osism/nova-scheduler:25.0.1.20230125                                   "dumb-init --single-…"   2 days ago    Up 2 days (healthy)             nova_scheduler
f5c12ee3ef3f   quay.io/osism/barbican-worker:14.0.2.20230125                                  "dumb-init --single-…"   2 days ago    Up 2 days (healthy)             barbican_worker
a4d824326da9   quay.io/osism/barbican-keystone-listener:14.0.2.20230125                       "dumb-init --single-…"   2 days ago    Up 2 days (healthy)             barbican_keystone_listener
018522b15721   quay.io/osism/barbican-api:14.0.2.20230125                                     "dumb-init --single-…"   2 days ago    Up 2 days (healthy)             barbican_api
6c0896060107   quay.io/osism/placement-api:7.0.0.20230125                                     "dumb-init --single-…"   2 days ago    Up 2 days (healthy)             placement_api
02cf27898c78   quay.io/osism/heat-engine:18.0.0.20230125                                      "dumb-init --single-…"   2 days ago    Up 2 days (healthy)             heat_engine
b9bbca17edba   quay.io/osism/heat-api-cfn:18.0.0.20230125                                     "dumb-init --single-…"   2 days ago    Up 2 days (healthy)             heat_api_cfn
74b99fe0b3d0   quay.io/osism/heat-api:18.0.0.20230125                                         "dumb-init --single-…"   2 days ago    Up 2 days (healthy)             heat_api
cb111a87a8ac   quay.io/osism/neutron-server:20.2.0.20230125                                   "dumb-init --single-…"   2 days ago    Up 2 days (healthy)             neutron_server
ac4818919ea9   quay.io/osism/ovn-northd:22.03.0.20230125                                      "dumb-init --single-…"   2 days ago    Up 2 days                       ovn_northd
f7c8aff9bf98   quay.io/osism/ovn-sb-db-server:22.03.0.20230125                                "dumb-init --single-…"   2 days ago    Up 2 days                       ovn_sb_db
8941f7698ac4   quay.io/osism/ovn-nb-db-server:22.03.0.20230125                                "dumb-init --single-…"   2 days ago    Up 2 days                       ovn_nb_db
6d3b8231cc7c   quay.io/osism/cinder-backup:20.0.1.20230125                                    "dumb-init --single-…"   2 days ago    Up 2 days (healthy)             cinder_backup
e40197dd60a4   quay.io/osism/cinder-volume:20.0.1.20230125                                    "dumb-init --single-…"   2 days ago    Up 2 days (healthy)             cinder_volume
340fa7e1e863   quay.io/osism/cinder-scheduler:20.0.1.20230125                                 "dumb-init --single-…"   2 days ago    Up 2 days (healthy)             cinder_scheduler
c1353756d586   quay.io/osism/cinder-api:20.0.1.20230125                                       "dumb-init --single-…"   2 days ago    Up 2 days (healthy)             cinder_api
ce258e0443fa   quay.io/osism/glance-api:24.1.0.20230125                                       "dumb-init --single-…"   2 days ago    Up 2 days (healthy)             glance_api
2d05a8a8b30e   quay.io/osism/keystone:21.0.0.20230125                                         "dumb-init --single-…"   2 days ago    Up 2 days (healthy)             keystone
399e6cbe9ac4   quay.io/osism/keystone-fernet:21.0.0.20230125                                  "dumb-init --single-…"   2 days ago    Up 2 days (healthy)             keystone_fernet
5d034d58b0c0   quay.io/osism/keystone-ssh:21.0.0.20230125                                     "dumb-init --single-…"   2 days ago    Up 2 days (healthy)             keystone_ssh
1607370baf8e   quay.io/osism/rabbitmq:3.10.14.20230125                                        "dumb-init --single-…"   2 days ago    Up 2 days (healthy)             rabbitmq
aee92ace3fc2   quay.io/osism/mariadb-clustercheck:10.6.11.20230125                            "dumb-init --single-…"   2 days ago    Up 2 days                       mariadb_clustercheck
19476c337374   quay.io/osism/mariadb-server:10.6.11.20230125                                  "dumb-init -- kolla_…"   2 days ago    Up 2 days                       mariadb
50c1faaa2672   quay.io/osism/memcached:1.5.22.20230125                                        "dumb-init --single-…"   2 days ago    Up 2 days (healthy)             memcached
28e05b147632   quay.io/osism/keepalived:2.0.19.20230125                                       "dumb-init --single-…"   2 days ago    Up 2 days                       keepalived
05d0a54a9088   quay.io/osism/haproxy:2.2.26.20230125                                          "dumb-init --single-…"   2 days ago    Up 2 days (healthy)             haproxy
317a841f1fdb   quay.io/osism/cron:3.0pl1.20230125                                             "dumb-init --single-…"   3 days ago    Up 3 days                       cron
6eaf9c3f55e0   quay.io/osism/kolla-toolbox:14.8.1.20230125                                    "dumb-init --single-…"   3 days ago    Up 3 days                       kolla_toolbox
22b5b116bd3f   quay.io/osism/fluentd:4.4.2.20230125                                           "dumb-init --single-…"   3 days ago    Up 3 days                       fluentd
9b07242e40e8   quay.io/osism/ceph-daemon:pacific                                              "/usr/bin/ceph-crash"    8 weeks ago   Up 8 weeks                      ceph-crash-control1-scs1-az0
086e68c102d7   quay.io/osism/ceph-daemon:pacific                                              "/opt/ceph-container…"   8 weeks ago   Up 8 weeks                      ceph-mgr-control1-scs1-az0
36bee1de162f   quay.io/osism/ceph-daemon:pacific                                              "/opt/ceph-container…"   8 weeks ago   Up 8 weeks                      ceph-rgw-control1-scs1-az0-rgw0
74e2c81828c9   quay.io/osism/ceph-daemon:pacific                                              "/opt/ceph-container…"   8 weeks ago   Up 8 weeks                      ceph-mon-control1-scs1-az0

List from one compute node:

CONTAINER ID   IMAGE                                                      COMMAND                  CREATED        STATUS                PORTS     NAMES
a6f388f4f7b6   quay.io/osism/prometheus-libvirt-exporter:4.2.0.20230125   "dumb-init --single-…"   2 days ago     Up 2 days                       prometheus_libvirt_exporter
353c051bfd22   quay.io/osism/prometheus-cadvisor:0.38.7.20230125          "dumb-init --single-…"   2 days ago     Up 2 days                       prometheus_cadvisor
cbac13fa3d26   quay.io/osism/prometheus-node-exporter:0.18.1.20230125     "dumb-init --single-…"   2 days ago     Up 2 days                       prometheus_node_exporter
4cde69fa591d   quay.io/osism/nova-compute:25.0.1.20230125                 "dumb-init --single-…"   2 days ago     Up 2 days (healthy)             nova_compute
f338d93870f7   quay.io/osism/nova-libvirt:8.0.0.20230125                  "dumb-init --single-…"   2 days ago     Up 2 days (healthy)             nova_libvirt
5ad985c08aa5   quay.io/osism/nova-ssh:25.0.1.20230125                     "dumb-init --single-…"   2 days ago     Up 2 days (healthy)             nova_ssh
cd7adc3620e5   quay.io/osism/neutron-metadata-agent:20.2.0.20230125       "dumb-init --single-…"   2 days ago     Up 2 days (healthy)             neutron_ovn_metadata_agent
92d30d879aff   quay.io/osism/ovn-controller:22.03.0.20230125              "dumb-init --single-…"   2 days ago     Up 2 days                       ovn_controller
de0fc1aad5b8   quay.io/osism/openvswitch-vswitchd:2.17.3.20230125         "dumb-init --single-…"   2 days ago     Up 2 days (healthy)             openvswitch_vswitchd
b3c45871158a   quay.io/osism/openvswitch-db-server:2.17.3.20230125        "dumb-init --single-…"   2 days ago     Up 2 days (healthy)             openvswitch_db
97e9e3b5fde3   quay.io/osism/cron:3.0pl1.20230125                         "dumb-init --single-…"   3 days ago     Up 3 days                       cron
b450097ac211   quay.io/osism/kolla-toolbox:14.8.1.20230125                "dumb-init --single-…"   3 days ago     Up 3 days                       kolla_toolbox
b3f59a1115a9   quay.io/osism/fluentd:4.4.2.20230125                       "dumb-init --single-…"   3 days ago     Up 3 days                       fluentd
15cd03aadd72   quay.io/osism/ceph-daemon:pacific                          "/usr/bin/ceph-crash"    2 months ago   Up 2 months                     ceph-crash-compute1-scs1-az1
c6e8154dac6f   quay.io/osism/ceph-daemon:pacific                          "/opt/ceph-container…"   2 months ago   Up 2 months                     ceph-osd-8
1bf2bea7ec66   quay.io/osism/ceph-daemon:pacific                          "/opt/ceph-container…"   2 months ago   Up 2 months                     ceph-osd-4
d3aa3c33fd0a   quay.io/osism/ceph-daemon:pacific                          "/opt/ceph-container…"   2 months ago   Up 2 months                     ceph-osd-25
7d2539fc27c5   quay.io/osism/ceph-daemon:pacific                          "/opt/ceph-container…"   2 months ago   Up 2 months                     ceph-osd-20
1eadc5815311   quay.io/osism/ceph-daemon:pacific                          "/opt/ceph-container…"   2 months ago   Up 2 months                     ceph-osd-15
e685666838c7   quay.io/osism/ceph-daemon:pacific                          "/opt/ceph-container…"   2 months ago   Up 2 months                     ceph-osd-10

However, "healthy" is missleading in many cases. For example "nova-compute" is broken after a RabbitMQ restart, but still reports "healthy", because the health check only checks for an opened port but not whether the process is actually working.

To test this: restart all RabbitMQ instances and watch health check for nova* containers and see logs in /var/log/kolla/nova or try to schedule a VM on that hypervisor.

mauhau commented 4 months ago

Currently the health check shows only basic information about the Python process:

neutron@neutron-api-b456cdbf8-2b7jn:/$ curl -X GET -i -H "Accept: application/json" http://localhost:8080/healthcheck ; echo
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 62
Date: Thu, 22 Feb 2024 10:17:23 GMT

{
    "detailed": false,
    "reasons": [
        "OK"
    ]
}

Probably it is possible to add/extend this checks via a middleware plugin: https://opendev.org/openstack/oslo.middleware/src/branch/master/oslo_middleware/healthcheck

berendt commented 3 months ago

Closing this. There are health check now for most of the Kolla containers. We'll work on the improve of the health checks itself in the linked issued.