Closed piontec closed 4 months ago
If it helps ...
https://zotregistry.dev/latest/articles/monitoring/#metrics
https://grafana.com/grafana/dashboards/20501-zot/ ^ a user/community contribution
After this one is merged, we can start monitoring zot's behaviour, and then come up with some alerts: https://github.com/giantswarm/giantswarm-management-clusters/pull/587
Grafana dashboards can already be found on each clusters: for example, snail: https://grafana.snail.gaws.gigantic.io/d/zot-container-registry/in-cluster-container-registry-zot
TODO:
Currently, it seems like zot's metrics can be rather informative. As I see, there are two important metrics for us
Storage usage can be gathered from the kubelet metrics, and I guess since we operate on the pvc level, we should take those into account.
Traffic, currently, I'm not sure
Since it's really hard to distinguish between use and abuse in our case, (we don't know if zot is used by our clusters and if it's used as expected), I think that maybe we should get metrics from current Azure reg usage (price for outgoing traffic), and then get metrics for zots' outgoing traffic and compare pricing. Then at least we will be able to see whether we spend more or not. (@piontec what do you think?)
Outgoing traffic directly from zot metrics doesn't mean a lot, because it can be (in the current case) AWS internal tarffic that is free (if I'm not being mistaken), or external that we are being charged for. So I guess, again, it's not about zot's metrics, but rather cloud provider metrics
So now, when zot seem to be running and being pretty stable, we should be able to gather metrics and understand how to set up alerting.
Since zot's being dead is not a critical issue for us (it's just a cache, if it's not available, containerd will go to the upstream), I don't think we should schedule it for oncall at all. We just should be aware that it's broken, it would be enough.
Today we verified that our dashboard for Zot on the management cluster shows valid data overall. As this is done, I'm unassigning myself.
The next step is to create alerts, once we are out of the testing phase.
Done via merge of ops recipe and alerts
We have to