osism / issues

This repository is used for bug reports that are cross-project or not bound to a specific repository (or to an unknown repository).
https://www.osism.tech
1 stars 1 forks source link

Make OpenStack health checks more useful #1046

Open berendt opened 2 months ago

berendt commented 2 months ago
          Currently the health check shows only basic information about the Python process:
neutron@neutron-api-b456cdbf8-2b7jn:/$ curl -X GET -i -H "Accept: application/json" http://localhost:8080/healthcheck ; echo
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 62
Date: Thu, 22 Feb 2024 10:17:23 GMT

{
    "detailed": false,
    "reasons": [
        "OK"
    ]
}

Probably it is possible to add/extend this checks via a middleware plugin: https://opendev.org/openstack/oslo.middleware/src/branch/master/oslo_middleware/healthcheck

Originally posted by @mauhau in https://github.com/osism/issues/issues/433#issuecomment-2084729990

varkeen commented 1 month ago

A local OpenStack environment has been deployed to speed up researching healthcheck implementation.

Now evaluating how to plugin into https://github.com/openstack/oslo.middleware to add specific checks, e.g. connectivity to RabbitMQ.

varkeen commented 1 month ago

A first successful RabbitMQ check could be implemented (only locally for now)

stack@elmore:~/neutron/etc$ curl -X GET -i -H "Accept: application/json" http://localhost:9696/networking/healthcheck
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 103
Date: Tue, 09 Jul 2024 09:49:17 GMT

{
    "detailed": false,
    "reasons": [
        "Connection to RabbitMQ@localhost:5672 is ok"
    ]
}
varkeen commented 1 month ago

Evolved from hackiness to integration into a "real" plugin.

Now also testing functionality of rabbitmq (not only connection):

stack@elmore:~$ curl -X GET -i -H "Accept: application/json" http://localhost:9696/networking/healthcheck
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 269
Date: Fri, 12 Jul 2024 08:13:48 GMT

{
    "detailed": false,
    "reasons": [
        [
            "RabbitMQ instance localhost:5672 is connectable",
            "RabbitMQ instance localhost:5672 can receive messages",
            "RabbitMQ instance localhost:5672 can deliver messages"
        ]
    ]
}

This is what the config look like:

stack@elmore:~$ cat /etc/neutron/api-paste.ini 

...
[app:healthcheck]
paste.app_factory = oslo_middleware:Healthcheck.app_factory
backends = check_rabbit_mq
check_rabbit_mq_instances = localhost:5672
detailed = False
varkeen commented 1 month ago

Ideas for further improvements:

berendt commented 1 month ago

@varkeen Is this usable only for API services like nova-api or glance-api or is it also possible to use this e.g. in the nova-scheduler service or neutron-metadata-agent service?

varkeen commented 1 month ago

@berendt I am going to look into this.

Also @mauhau gave me a hint to check if I am using the correct way to perform e.g. the rabbit MQ check.

Looking into this, too.

varkeen commented 1 month ago

As for the integration into neutron-metadata-agent/nova-scheduler:

I was able to add the healthcheck to the metadata-agent here (this is the output from within the VM):

$ curl -X GET -i http://169.254.169.254/healthcheck ; echo
HTTP/1.1 200 OK
content-type: text/plain: charset=UTF-8
content-length: 165
date: Thu, 18 Jul 2024 07:43:09 GMT

[’RabbitMQ instance localhost:5672 is connectable’, ’RabbitMQ instance localhost :5672 can receive messages’, ’RabbitMQ instance localhost:5672 can deliver messages’ ]

(Note that the rabbitMQ output is currently "faked" as I am still looking into oslo_messaging Package on how to connect to RabbitMQ the "openstack-way")

The healthcheck here was added via nova's api-paste.ini config (as the metadata-agent is run by nova)

As for the nova-scheduler: as this is part of nova, i do not see why you shouldn't be able to attach a healthcheck to that, too.

But I am still looking into how I can access openstack components within the healthcheck plugins itself. (e.g. using oslo_messaging as noted above)

berendt commented 1 month ago

I was able to add the healthcheck to the metadata-agent here (this is the output from within the VM):

$ curl -X GET -i http://169.254.169.254/healthcheck ; echo
HTTP/1.1 200 OK
content-type: text/plain: charset=UTF-8
content-length: 165
date: Thu, 18 Jul 2024 07:43:09 GMT

[’RabbitMQ instance localhost:5672 is connectable’, ’RabbitMQ instance localhost :5672 can receive messages’, ’RabbitMQ instance localhost:5672 can deliver messages’ ]

This SHOULD NOT be possible. This leaks details about the internal status of the cluster to customers.

berendt commented 1 month ago

As for the nova-scheduler: as this is part of nova, i do not see why you shouldn't be able to attach a healthcheck to that, too.

Normally the nova-scheduler has no port binded. I think this is only the case for the API services. So at the moment I'm not sure how to access the URL on the nova-scheduler for a healthcheck.

varkeen commented 1 month ago

This SHOULD NOT be possible. This leaks details about the internal status of the cluster to customers.

Sorry - I merely wanted to show that /healthcheck is generally possible at neutron-metadata-agent.

So the dummy output of RabbitMQ was just an example, albeit a bad one.

berendt commented 1 month ago

This SHOULD NOT be possible. This leaks details about the internal status of the cluster to customers.

Sorry - I merely wanted to show that /healthcheck is generally possible at neutron-metadata-agent.

So the dummy output of RabbitMQ was just an example, albeit a bad one.

Yes. But it should not be possible to reach the /healtcheck URL via the 169.254.169.254 address from inside a running VM.

varkeen commented 1 month ago

Yes. But it should not be possible to reach the /healtcheck URL via the 169.254.169.254 address from inside a running VM.

Okay - it's probably because in this case it is configured into api-paste.ini without any restriction-filters:

############
# Metadata #
############
[composite:metadata]
use = egg:Paste#urlmap
/: meta
/healthcheck: healthcheck

[pipeline:meta]
pipeline = cors http_proxy_to_wsgi metaapp

[app:metaapp]
paste.app_factory = nova.api.metadata.handler:MetadataRequestHandler.factory

[app:healthcheck]
paste.app_factory = oslo_middleware:Healthcheck.app_factory
backends = check_rabbit_mq
check_rabbit_mq_instances = localhost:5672
detailed = False

But that can probably be adjusted by wsgi filters.

As I said, I just wanted to show that it's generally possible.

Did not mean to spook you :)

varkeen commented 3 weeks ago

Current status:

I was able to "talk" to the openstack messaging bus via oslo_messaging inside the healthcheck: (in this example I am receiving notifications from other components from the message bus)

Jul 22 13:03:59 elmore neutron-server[3913055]: INFO oslo_middleware.healthcheck.check_rabbit_mq [-] notification received image.localhost:image.activate
Jul 22 13:03:59 elmore neutron-server[3913055]: }
Jul 22 13:03:59 elmore neutron-server[3913055]:     "event_type": "image.upload"
Jul 22 13:03:59 elmore neutron-server[3913055]:     "publisher_id": "image.localhost",
Jul 22 13:03:59 elmore neutron-server[3913055]:     },
Jul 22 13:03:59 elmore neutron-server[3913055]:         "os_glance_failed_import": []
Jul 22 13:03:59 elmore neutron-server[3913055]:         "os_glance_importing_to_stores": [],
Jul 22 13:03:59 elmore neutron-server[3913055]:         "deleted_at": null,
Jul 22 13:03:59 elmore neutron-server[3913055]:         "deleted": false,
Jul 22 13:03:59 elmore neutron-server[3913055]:         "tags": [],
Jul 22 13:03:59 elmore neutron-server[3913055]:         },
Jul 22 13:03:59 elmore neutron-server[3913055]:             "owner_specified.openstack.sha256": ""
Jul 22 13:03:59 elmore neutron-server[3913055]:             "owner_specified.openstack.object": "images/cirros-0.6.2-x86_64-disk",
Jul 22 13:03:59 elmore neutron-server[3913055]:             "owner_specified.openstack.md5": "",
Jul 22 13:03:59 elmore neutron-server[3913055]:             "hw_rng_model": "virtio",
Jul 22 13:03:59 elmore neutron-server[3913055]:         "properties": {
Jul 22 13:03:59 elmore neutron-server[3913055]:         "visibility": "public",
Jul 22 13:03:59 elmore neutron-server[3913055]:         "is_public": true,
Jul 22 13:03:59 elmore neutron-server[3913055]:         "virtual_size": 117440512,
Jul 22 13:03:59 elmore neutron-server[3913055]:         "size": 21430272,
Jul 22 13:03:59 elmore neutron-server[3913055]:         "container_format": "bare",
Jul 22 13:03:59 elmore neutron-server[3913055]:         "disk_format": "qcow2",
Jul 22 13:03:59 elmore neutron-server[3913055]:         "owner": "8cee013a5b0c44cc8719ce40d1c4b3e0",
Jul 22 13:03:59 elmore neutron-server[3913055]:         "checksum": "c8fc807773e5354afe61636071771906",
Jul 22 13:03:59 elmore neutron-server[3913055]:         "protected": false,
Jul 22 13:03:59 elmore neutron-server[3913055]:         "min_ram": 0,
Jul 22 13:03:59 elmore neutron-server[3913055]:         "min_disk": 0,
Jul 22 13:03:59 elmore neutron-server[3913055]:         "updated_at": "2024-07-09T08:49:49Z",
Jul 22 13:03:59 elmore neutron-server[3913055]:         "created_at": "2024-07-09T08:49:48Z",
Jul 22 13:03:59 elmore neutron-server[3913055]:         "status": "active",
Jul 22 13:03:59 elmore neutron-server[3913055]:         "name": "cirros-0.6.2-x86_64-disk",
Jul 22 13:03:59 elmore neutron-server[3913055]:         "id": "10e76d2c-84a9-49d3-97d2-73eebe09712d",
Jul 22 13:03:59 elmore neutron-server[3913055]:     "payload": {

Now continuing to implement this in a more "useful" way.

varkeen commented 3 weeks ago

Update:

The plugin now uses oslo_messaging to send a test payload via its own messaging.Target. It also reads the same message from the message queue.

Everything is taken from the configuration context where the healthcheck is running in, so no further MQ configuration is required for the plugin itself.

Currently I am cleaning up the still somewhat hacky code, while also decoupling the plugin from the oslo_messaging repository.

After testing the plugin in other environments, I will look into the next step, which is testing DB connections.

varkeen commented 1 week ago

Timeout handling was a bit tricky as starting the oslo_messaging server is blocking (with no obvious way to add a timeout to that call).

But now it seems to work, if e.g. the MQ Port is blocked via iptables:

stack@elmore:~/data/venv/lib/python3.12/site-packages$ curl -X GET -i -H "Accept: application/json" http://localhost:9696/networking/healthcheck
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 431
Date: Tue, 06 Aug 2024 09:18:43 GMT

{
    "detailed": false,
    "reasons": [
        [
            {
                "messaging": {
                    "message_sent": {
                        "notification_time": 1722935922436243588
                    },
                    "result": "Notification successfully received via messaging",
                    "transport_url": "rabbit://***:***@192.168.23.233:5672/"
                }
            }
        ]
    ]

stack@elmore:~/data/venv/lib/python3.12/site-packages$ sudo iptables -A INPUT -p tcp --destination-port 5672 -j DROP

stack@elmore:~/data/venv/lib/python3.12/site-packages$ curl -X GET -i -H "Accept: application/json" http://localhost:9696/networking/healthcheck
HTTP/1.1 503 Service Unavailable
Content-Type: application/json
Content-Length: 217
Date: Tue, 06 Aug 2024 09:19:04 GMT

{
    "detailed": false,
    "reasons": [
        [
            {
                "messaging": {
                    "result": "Notification timed out after 5 seconds"
                }
            }
        ]
    ]
artificial-intelligence commented 1 week ago

mhm, would be nice if we could have a PR linked here @varkeen

berendt commented 1 week ago

mhm, would be nice if we could have a PR linked here @varkeen

At the moment we only have some POC code and sample outputs.

varkeen commented 1 week ago

I am currently working on separating the code from oslo.middleware repository. Also I am testing the check in other environments I have access to.

I guess once that is done, a PR can be made in a repository "somewhere" in osism (probably a new one?).

I will get in touch with @berendt once I am ready to do this.

varkeen commented 4 days ago

Current state of plugin can be viewed here:

https://github.com/osism/openstack-health-middleware/tree/feature/1046-more_useful_healthchecks

varkeen commented 4 days ago

Here is the PR: https://github.com/osism/openstack-health-middleware/pull/5

I believe some meta files are still missing. Also I am not sure if I am using setup.py correctly here (at least a pip install worked on my devstack).

So probably focus more on the plugin itself, thanks :]