Open adamcharnock opened 1 month ago
This seems a very reasonable ask. I'd probably add to this, and say we probably can also expose upgrade status? So you'd be able to tell via prometheus if upgrade was complete or not. (Assuming you are upgrading with the plugin's help?)
Other than this, could there be anything else we can do for folks who are not using prometheus?
I'd probably add to this, and say we probably can also expose upgrade status? So you'd be able to tell via prometheus if upgrade was complete or not.
Certainly fine by me, although I already have some alerting for that via monitoring of DeamonSet rollout status. So no strong views from me on this.
Other than this, could there be anything else we can do for folks who are not using prometheus?
I guess they'll either be using some other kind of metrics gathering system, or using the CLI. If we are talking about the latter then I would suggest:
kubectl mayastor get pools
– Could indicate if the pool's node node has been cordoned (even if the pool still shows as Online
)kubectl mayastor get volumes
– I'd love to see more health visibility here. Perhaps the Replicas
column could show {total_online}/{total_desired}
. (I can also see that volume-replica-topologies
provides more details here which is great)Additional: It would be really nice if the volume/replica metrics included labels for the PV's spec.claimRef.name
and spec.claimRef.namespace
fields (maybe pvc_name
, pvc_namespace
). Have the PVC name and namespace available would make it much easier to recognise particular volumes/replicas in dashboard & alerts, and also make per-namespace reporting possible. Happy to open another issue for this if you like.
* `kubectl mayastor get pools` – Could indicate if the pool's node node has been cordoned (even if the pool still shows as `Online`)
Another great suggestion, thanks!
* `kubectl mayastor get volumes` – I'd love to see more health visibility here. Perhaps the `Replicas` column could show `{total_online}/{total_desired}`. (I can also see that `volume-replica-topologies` provides more details here which is great)
And another one, indeed this would be neat :)
Additional: It would be really nice if the volume/replica metrics included labels for the PV's
spec.claimRef.name
andspec.claimRef.namespace
fields (maybepvc_name
,pvc_namespace
). Have the PVC name and namespace available would make it much easier to recognise particular volumes/replicas in dashboard & alerts, and also make per-namespace reporting possible.
Ah this one wouldn't be straightforward tbh. Today we don't store any pvc information at all. Also the export of io metrics is done from the data-plane itself, which would have no knowledge of pvc information neither. Not so say this can't be done, but would be a much larger change. If we were to export pvc and mayastor volume "linkage information", would it be possible to somehow stitch this up to existing metrics?
Happy to open another issue for this if you like.
That would be great, thanks again
Great! And I've opened #1702 - "Prometheus Exporter: Include labels for PVC name and namespace in exported metrics"
Is your feature request related to a problem? Please describe.
I just spent quite a long time trying to debug an issue, only to find that the cause was that I had left two Mayastor nodes cordoned (oops).
Describe the solution you'd like
It would be great to expose this as a metric so I could alert against it. For example, maybe
mayastor_node_status
, similar todisk_pool_status
.Describe alternatives you've considered
Tattooing "don't leave Mayastor nodes cordoned" on my forehead
Additional context
There are three entities that need to be happy for a replica to be scheduled on a pool:
Currently I can alert against 1 & 3, but not 2.