Prepare monitoring and alerting for `zot`

piontec commented 11 months ago

We have to

[x] check what metrics are exposed, how do they change, if we think they are sufficient
[ ] define alerting rules
[x] find/create grafana dashboards

rchincha commented 8 months ago

If it helps ...

https://zotregistry.dev/latest/articles/monitoring/#metrics

https://grafana.com/grafana/dashboards/20501-zot/ ^ a user/community contribution

allanger commented 6 months ago

After this one is merged, we can start monitoring zot's behaviour, and then come up with some alerts: https://github.com/giantswarm/giantswarm-management-clusters/pull/587

Grafana dashboards can already be found on each clusters: for example, snail: https://grafana.snail.gaws.gigantic.io/d/zot-container-registry/in-cluster-container-registry-zot

allanger commented 6 months ago

TODO:

[ ] Fix the storage panel in the grafana dashboard

allanger commented 6 months ago

Useful metrics

Currently, it seems like zot's metrics can be rather informative. As I see, there are two important metrics for us

Storage usage
Amount of outgoing traffic

Storage usage can be gathered from the kubelet metrics, and I guess since we operate on the pvc level, we should take those into account.

Traffic, currently, I'm not sure

[ ] Figure out about traffic metrics

Since it's really hard to distinguish between use and abuse in our case, (we don't know if zot is used by our clusters and if it's used as expected), I think that maybe we should get metrics from current Azure reg usage (price for outgoing traffic), and then get metrics for zots' outgoing traffic and compare pricing. Then at least we will be able to see whether we spend more or not. (@piontec what do you think?)

Outgoing traffic directly from zot metrics doesn't mean a lot, because it can be (in the current case) AWS internal tarffic that is free (if I'm not being mistaken), or external that we are being charged for. So I guess, again, it's not about zot's metrics, but rather cloud provider metrics

So now, when zot seem to be running and being pretty stable, we should be able to gather metrics and understand how to set up alerting.

Since zot's being dead is not a critical issue for us (it's just a cache, if it's not available, containerd will go to the upstream), I don't think we should schedule it for oncall at all. We just should be aware that it's broken, it would be enough.

marians commented 5 months ago

Today we verified that our dashboard for Zot on the management cluster shows valid data overall. As this is done, I'm unassigning myself.

The next step is to create alerts, once we are out of the testing phase.

mproffitt commented 4 months ago

Done via merge of ops recipe and alerts

giantswarm / roadmap

Prepare monitoring and alerting for `zot` #3068