Open alanbchristie opened 2 months ago
@tdudgeon has implemented prometheus on a dev cluster but there are some issues that need resolving. to be discussed next meeting
Some metrics targets are broken when you do a vanilla install of Prometheus. I raised this issue to get some assistance with this: https://github.com/rancher/rancher/issues/45363#issuecomment-2096585286
One of the STFC users (James Adams) suggested using this to monitor network connectivity within the cluster nodes: https://oss.oetiker.ch/smokeping/doc/smokeping_master_slave.en.html
@tdudgeon thinks that SmokePing will aid in monitoring network connectivity, but needs to discuss with @alanbchristie whether we need this or if Prometheus covers this node-level connectivity monitoring. Likely 1-2 days work to implement SmokePing if it's needed.
(Alan back on Monday)
@tdudgeon says it's firing false alerts - misconfigured out of the box.
Hoping for others to fix the bugs - decide at next meeting what the actions are.
@tdudgeon says the false alerts are an artefact of being on such an old version of everything: kubernetes, rancher, longhorn, etc.
Please scope out the work.
Immediate action: fire up test cluster with all the upgraded things, see if the monitioring still has issues.
With prometheus and grafana installed we can deploy the stack (using deployment playbooks from the fragalysis-stack-kubernetes repository tagged 2024.12
or later). This deploys a ServiceMonitor
definition (and an adjusted Service
definition) to export the metrics to prometheus.
Once done you can then add the generic django dashboard to grafana by navigating to its Dashboards -> Import
and then Import via grafana.com
, and enter the dashboard ID 17658
. Change the Name and folder if you with and then select the Prometheus instance and then click Import. The dashboard should then be displayed.
Here are some overnight "out of the box" metrics from the latest staging stack: -
Interesting, this shows that there are a lot of very long response times and some very large response payload. There are some clear endpoint culprits (response times). These all appear to take longer than 10 seconds for example: -
And these, appear to take more than 25 seconds: -
@tdudgeon says:
There is a lot of good output from the monitoring as implemented, e.g. as @alanbchristie has shown above. However filesystem/volume mounting is very unstable, frequently (but inconsistently) r/w volumes are incorrectly re-mounted as read-only, potentially due to network connectivity issues.
We have now deployed the SmokePing utility (mentioned by STFC). This now runs on each node in the DEV cluster, generating ping performance figures for all the other nodes in the cluster (excluding etcd).
The playbooks and documentation for the utility can be found in our new Ansible repo that is used to deploy the container image and related material: -
This allows us to see metrics generated by each node: -
The expectation is that if networking is a cause of our resilience issues we might see something in the metrics being generated.
@alanbchristie says that SmokePing has already been useful to help diagnose issues this morning.
@tdudgeon says we now have performance monitoring, so this ticket is done.
Main purpose: want to understand reslience issues.
Work to add monitoring (prometheus, grafana, sentry) to the deployed stacks. Even minimal work should give us access to kubernetes and, by installing a package in the Fragalysis django app should give us access to basic API performance.
See the google-doc that describes options: -
https://docs.google.com/document/d/1V3D0-dAMQvscYN3bUhcWvHpEWN7k7ZjkwlDAkDeKBqs/edit?usp=sharing