openebs / mayastor

Dynamically provision Stateful Persistent Replicated Cluster-wide Fabric Volumes & Filesystems for Kubernetes that is provisioned from an optimized NVME SPDK backend data storage stack.
Apache License 2.0
722 stars 105 forks source link

Prometheus Exporter: Provide metrics for Mayastor node status (online, cordoned, etc) #1698

Open adamcharnock opened 1 month ago

adamcharnock commented 1 month ago

Is your feature request related to a problem? Please describe.

I just spent quite a long time trying to debug an issue, only to find that the cause was that I had left two Mayastor nodes cordoned (oops).

Describe the solution you'd like

It would be great to expose this as a metric so I could alert against it. For example, maybe mayastor_node_status, similar to disk_pool_status.

Describe alternatives you've considered

Tattooing "don't leave Mayastor nodes cordoned" on my forehead

Additional context

There are three entities that need to be happy for a replica to be scheduled on a pool:

  1. The k8s node should be ready and not cordoned
  2. The mayastor node should be ready and not cordoned
  3. The disk pool should be online

Currently I can alert against 1 & 3, but not 2.

tiagolobocastro commented 1 month ago

This seems a very reasonable ask. I'd probably add to this, and say we probably can also expose upgrade status? So you'd be able to tell via prometheus if upgrade was complete or not. (Assuming you are upgrading with the plugin's help?)

Other than this, could there be anything else we can do for folks who are not using prometheus?

adamcharnock commented 1 month ago

I'd probably add to this, and say we probably can also expose upgrade status? So you'd be able to tell via prometheus if upgrade was complete or not.

Certainly fine by me, although I already have some alerting for that via monitoring of DeamonSet rollout status. So no strong views from me on this.

Other than this, could there be anything else we can do for folks who are not using prometheus?

I guess they'll either be using some other kind of metrics gathering system, or using the CLI. If we are talking about the latter then I would suggest:


Additional: It would be really nice if the volume/replica metrics included labels for the PV's spec.claimRef.name and spec.claimRef.namespace fields (maybe pvc_name, pvc_namespace). Have the PVC name and namespace available would make it much easier to recognise particular volumes/replicas in dashboard & alerts, and also make per-namespace reporting possible. Happy to open another issue for this if you like.

tiagolobocastro commented 1 month ago
* `kubectl mayastor get pools` – Could indicate if the pool's node node has been cordoned (even if the pool still shows as `Online`)

Another great suggestion, thanks!

* `kubectl mayastor get volumes` – I'd love to see more health visibility here. Perhaps the `Replicas` column could show `{total_online}/{total_desired}`. (I can also see that `volume-replica-topologies` provides more details here which is great)

And another one, indeed this would be neat :)

Additional: It would be really nice if the volume/replica metrics included labels for the PV's spec.claimRef.name and spec.claimRef.namespace fields (maybe pvc_name, pvc_namespace). Have the PVC name and namespace available would make it much easier to recognise particular volumes/replicas in dashboard & alerts, and also make per-namespace reporting possible.

Ah this one wouldn't be straightforward tbh. Today we don't store any pvc information at all. Also the export of io metrics is done from the data-plane itself, which would have no knowledge of pvc information neither. Not so say this can't be done, but would be a much larger change. If we were to export pvc and mayastor volume "linkage information", would it be possible to somehow stitch this up to existing metrics?

Happy to open another issue for this if you like.

That would be great, thanks again

adamcharnock commented 1 month ago

Great! And I've opened #1702 - "Prometheus Exporter: Include labels for PVC name and namespace in exported metrics"