python-discord / infra

Infrastructure for Python Discord
https://docs.pydis.wtf
MIT License
12 stars 4 forks source link

Automatic deployment of Prometheus alerts #266

Open jchristgit opened 5 months ago

jchristgit commented 5 months ago

Right now, changes to our Prometheus alerts need to be deployed manually.

We should incorporate a deployment for this into GitHub actions on the main branch such that any changes are automatically rolled out without requiring to know the local setup.

jchristgit commented 5 months ago

@jb3 Do you have an idea for how best to do this? Right now I'm not even sure how to deploy alerts to Prometheus in Kubernetes in the first place. I think for the documentation I will make a separate issue though.

jb3 commented 5 months ago

Noting to self, we can set the config map prefs to always query the apiserver for the latest changes, hence nullifying the propagation delay of changes.

jb3 commented 5 months ago

I lied, this is a kubelet option, we cannot set this per configmap, we will have to do some smart in-pod detection at Prometheus that the reload has gone through.

There is however always a timestamp in the mounted directory, we just need to keep checking this timestamp (probably with a recurring kubectl exec).

jb3 commented 5 months ago
/prometheus $ ls -la /opt/pydis/prometheus/alerts.d/
total 12
drwxrwsrwx    3 root     2000          4096 Apr 30 19:24 .
drwxr-xr-x    3 root     root          4096 Apr 26 21:41 ..
drwxr-sr-x    2 root     2000          4096 Apr 30 19:24 ..2024_04_30_19_24_46.1524242850
lrwxrwxrwx    1 root     2000            32 Apr 30 19:24 ..data -> ..2024_04_30_19_24_46.1524242850
lrwxrwxrwx    1 root     2000            24 Apr 26 21:39 alertmanager.yaml -> ..data/alertmanager.yaml
lrwxrwxrwx    1 root     2000            24 Apr 26 21:39 certificates.yaml -> ..data/certificates.yaml
lrwxrwxrwx    1 root     2000            19 Apr 26 21:39 coredns.yaml -> ..data/coredns.yaml
lrwxrwxrwx    1 root     2000            15 Apr 26 21:39 cpu.yaml -> ..data/cpu.yaml
lrwxrwxrwx    1 root     2000            18 Apr 26 21:39 django.yaml -> ..data/django.yaml
lrwxrwxrwx    1 root     2000            16 Apr 26 21:39 etcd.yaml -> ..data/etcd.yaml
lrwxrwxrwx    1 root     2000            16 Apr 26 21:39 jobs.yaml -> ..data/jobs.yaml
lrwxrwxrwx    1 root     2000            18 Apr 26 21:39 memory.yaml -> ..data/memory.yaml
lrwxrwxrwx    1 root     2000            17 Apr 26 21:39 nginx.yaml -> ..data/nginx.yaml
lrwxrwxrwx    1 root     2000            17 Apr 26 21:39 nodes.yaml -> ..data/nodes.yaml
lrwxrwxrwx    1 root     2000            16 Apr 26 21:39 pods.yaml -> ..data/pods.yaml
lrwxrwxrwx    1 root     2000            20 Apr 26 21:39 postgres.yaml -> ..data/postgres.yaml
lrwxrwxrwx    1 root     2000            22 Apr 26 21:39 prometheus.yaml -> ..data/prometheus.yaml
lrwxrwxrwx    1 root     2000            17 Apr 26 21:39 redis.yaml -> ..data/redis.yaml
jb3 commented 5 months ago

Another related issue for a potential future feature kubernetes/kubernetes#22368 (open for 7 years though, yikes!)

shtlrs commented 3 months ago

Can't we check for the git diffs when the ci runs, and if we find configmap files (that we will identify following some rule/logic), we apply them ?

jchristgit commented 3 months ago

Can't we check for the git diffs when the ci runs, and if we find configmap files (that we will identify following some rule/logic), we apply them ?

This is a good idea. But from my understanding, the issue was that we don't really know when Kubernetes has rolled out the configmaps.

We could simply sleep for 10 seconds and then apply it. If eventual consistency isn't consistent in 10 seconds, then I guess I'm done.

jb3 commented 3 months ago

Unfortunately the settling of configmap updates cannot be guaranteed on live pods during that window, it's a scheduled job on the kubelet from memory.

The Kubernetes solution is just to have a sidecar container running something like inotify or whatever the modern equivalents are and then upon detecting a change it can call out via the HTTP management API to Prometheus or (I think) send a signal to the process, I can't remember if sidecars share the same process namespace.

I'll investigate this one later today.

jchristgit commented 3 months ago

This is a sound idea. inotifywait in a bash script should be sufficient.

I do think that containres in the same pod share the same process namespace, if not maybe we can configure it, if not we can use the HTTP management API, but we need to make sure this is locked down externally.

However, with automated reloads like this we should ensure we have an alert in case of config reload failures. We do not have this yet, do we?

jb3 commented 3 months ago

However, with automated reloads like this we should ensure we have an alert in case of config reload failures. We do not have this yet, do we?

We should be able to add an alert for this yes, I'll include it when I PR this feature in. prometheus_config_last_reload_successful should handle it.