metal3-io / baremetal-operator

Bare metal host provisioning integration for Kubernetes
Apache License 2.0
568 stars 247 forks source link

Ironic health checks do not check against useful requests #1528

Open dankingtech opened 7 months ago

dankingtech commented 7 months ago

What steps did you take and what happened:

By any method, cause the connection to the database to fail. Even though any requests to do actual work will fail, the deployment will be seen as live and ready because the health checks are only checking to see that the base URL is responding which it can do even if the internal connections are down. For instance, if one were to attempt to connect to http://127.0.0.1:6385/v1/nodes/ or other such endpoints, there may be an error but Kubernetes does not know about it.

What did you expect to happen:

Ideally, the liveness probe should detect that at least some other endpoint is successful which relies upon the database connection note that the Ironic instance is not healthy.

Anything else you would like to add:

Unfortunately, I have noticed that there are various occasions when Ironic, for various reasons, may fail to be able to connect to the database. In the past I have seen this caused by the database itself having issues as well as other issues related directly to the running instance of the Ironic API. In most cases, simply restarting Ironic has resolved the issue. Regardless, if the backend is unavailable, Ironic serves little utility. Therefore, I recommend changing the livenessProbe to check /v1/nodes/ rather than just /. The same may be true of the Inspector as well, by adding /v1/rules.

Environment:

/kind bug

dankingtech commented 7 months ago

Actually, it looks like /v1/conductors/ would be a better endpoint than /v1/nodes/ as it would usually require less work than the later.

dtantsur commented 7 months ago

To repeat what I mentioned on IRC: any meaningful endpoint will require authentication. So we need to make sure the healtcheck script can use it.

metal3-io-bot commented 4 months ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

dtantsur commented 3 months ago

/remove-lifecycle stale /triage accepted

With https://github.com/metal3-io-bot/ironic-image/commit/e44c4f731bdfb37e7cd3e2c239c57f43d58c51fb, we now have a path forward. We also need to finish the discussion around https://github.com/metal3-io/baremetal-operator/pull/1685 since it affects how we get the credentials.

dtantsur commented 3 months ago

/kind feature /help

metal3-io-bot commented 3 months ago

@dtantsur: This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to [this](https://github.com/metal3-io/baremetal-operator/issues/1528): >/kind feature >/help Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
metal3-io-bot commented 4 weeks ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

tuminoid commented 4 weeks ago

/remove-lifecycle stale