Health Checks for OpenStack

akrzos commented 6 years ago

We should implement a few health checks on the OpenStack Cluster prior to moving to the next step in the pipeline to ensure we have a functional overcloud.

I would suggest at a Minimum we check:

Token issue - Get a valid token from OpenStack Keystone
Ceph status - HEALTH_OK
Upload Test Image into Glance
Create Network/Subnet/Floating IP/Security Group/Keypair
Boot instance, associate floating ip, ping instance, ssh into instance and execute true or uptime etc
Clean up above and then hand off

Does anyone else have any thoughts or health check ideas?

mbruzek commented 6 years ago

I am +1 for some minimum checks. The first 2 points look like great things to add.

Upload Test Image into Glance

The first thing the OpenShift automation does it finds the latest OCP image copies them and uploads to glance.

Create Network/Subnet/Floating IP/Security Group/Keypair

The second thing the OpenShift automation does is these same steps to create the "ansible-host" VM.

We should focus on adding a verification role after the install step has completed. Then we could put in tasks like this. It would be wasteful to repeat tasks that would "fail fast" in the next step of the automation. I have not seen a lot of problems getting images to upload or creating the first VM (not saying it couldn't happen). The trouble seems to be either getting to that point such as OpenStack install problems or OpenShift install problems (since we are automation on a moving codebase).

jeremyeder commented 6 years ago

Is there already an OSP QE test suite that we could take whole-sale rather than writing our own?

akrzos commented 6 years ago

Is there already an OSP QE test suite that we could take whole-sale rather than writing our own?

Yes, the one I am aware of is called tempest but I am not sure if it can be run against a large cluster or if it is only meant to run against something like devstack. I think that would be overkill but would have to investigate further to come to that conclusion.

The idea was since we use official ga-ed puddles we wouldn't necessarily need to run a full qe test suite but rather just sanity/health check with a few OpenStack commands and validate instance booting and connectivity. This is mostly because deploying OpenStack is making sure you have all of the configuration items set correctly which is easy to make mistakes in.

Most of the commands are already actually run at some point in the automation. Using a very small image we should be able to accomplish this with only ~maybe 2-5 minutes of time added to this job. I figured it would be easier to have a stop-gap prior to the next job in the pipeline so the build can be failed before it gets further on for attention.

redhat-performance / scale-ci-tripleo

Health Checks for OpenStack #39