siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.59k stars 526 forks source link

cluster health check: Ceph/Rook health #5557

Closed smira closed 2 months ago

smira commented 2 years ago

Look for Cluster Rook CRDs (we can use Kubernetes dynamic client), check the .status field for readiness, display the failure reason if any.

smira commented 2 years ago

See also https://github.com/siderolabs/capi-utils/blob/master/pkg/capi/check.go#L32

smira commented 2 years ago

(test it by deploying Rook cluster to QEMU with --extra-disks)

flokli commented 2 years ago

As much as this might be useful - Talos doesn't come with rook-ceph pre-installed, so I'm not sure if adding something like that would be a good idea.

smira commented 2 years ago

If Rook/Ceph is not installed, check will succeed.

flokli commented 2 years ago

Yeah, but where do you draw the line? #5556 and #5555 are for things that are managed by Talos itself (lifecycle.talos.dev/healthCheck), but this is an entirely separate piece of software. Do you want to include custom healthchecks for all sorts of other software? Let's say, interpreting if the Ingress controller of choice, cert-manager and external-dns are properly configured, too?

I personally think the healthcheck should only cover things managed by Talos itself, and "determining the state of the rest of the applications deployed in the cluster" is a task that should be left to things like Monitoring/Alerting.

See also this set of alerting rules. There's some alerts for rook-ceph too: https://monitoring.mixins.dev/ceph/

In the field of "ad hoc healthchecks", there's also https://github.com/emirozer/kubectl-doctor worth mentioning.

smira commented 2 years ago

Rook/Ceph is the recommended storage solution for Talos, and it's a thing which is hard to manage during cluster upgrades, so this makes sense. Other storage solutions might use lifecycle labels or other custom health checks.

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 2 months ago

This issue was closed because it has been stalled for 7 days with no activity.