stackhpc / ansible-slurm-appliance

A Slurm-based HPC workload management environment, driven by Ansible.
51 stars 26 forks source link

Add blackbox exporter to k3s monitoring stack with probe for OOD #466

Open wtripp180901 opened 3 weeks ago

wtripp180901 commented 3 weeks ago

Installs blackbox exporter into k3s cluster and adds Prometheus scrape job with probe for OOD + grafana dashboard. Also adds blackbox alerting rules from https://github.com/azimuth-cloud/capi-helm-charts/blob/2355ba6151289f35bba9a9e8bd7c372930c323a4/charts/cluster-addons/templates/monitoring/blackbox-exporter.yaml#L98

NB: probes are slower than typical probes in capi helm charts, presumably due to indirection from accessing OOD from inside k3s cluster. About 1.2s seems to be the average

wtripp180901 commented 3 weeks ago

https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/11702928693