Add performance monitoring tools to understand resilience issues

alanbchristie commented 2 months ago

Main purpose: want to understand reslience issues.

Work to add monitoring (prometheus, grafana, sentry) to the deployed stacks. Even minimal work should give us access to kubernetes and, by installing a package in the Fragalysis django app should give us access to basic API performance.

See the google-doc that describes options: -

https://docs.google.com/document/d/1V3D0-dAMQvscYN3bUhcWvHpEWN7k7ZjkwlDAkDeKBqs/edit?usp=sharing

mwinokan commented 2 months ago

@tdudgeon has implemented prometheus on a dev cluster but there are some issues that need resolving. to be discussed next meeting

tdudgeon commented 1 month ago

Some metrics targets are broken when you do a vanilla install of Prometheus. I raised this issue to get some assistance with this: https://github.com/rancher/rancher/issues/45363#issuecomment-2096585286

tdudgeon commented 1 month ago

One of the STFC users (James Adams) suggested using this to monitor network connectivity within the cluster nodes: https://oss.oetiker.ch/smokeping/doc/smokeping_master_slave.en.html

mwinokan commented 1 month ago

@tdudgeon thinks that SmokePing will aid in monitoring network connectivity, but needs to discuss with @alanbchristie whether we need this or if Prometheus covers this node-level connectivity monitoring. Likely 1-2 days work to implement SmokePing if it's needed.

(Alan back on Monday)

phraenquex commented 1 month ago

@tdudgeon says it's firing false alerts - misconfigured out of the box.

Hoping for others to fix the bugs - decide at next meeting what the actions are.

phraenquex commented 1 month ago

@tdudgeon says the false alerts are an artefact of being on such an old version of everything: kubernetes, rancher, longhorn, etc.

Please scope out the work.

Immediate action: fire up test cluster with all the upgraded things, see if the monitioring still has issues.

alanbchristie commented 1 month ago

With prometheus and grafana installed we can deploy the stack (using deployment playbooks from the fragalysis-stack-kubernetes repository tagged 2024.12 or later). This deploys a ServiceMonitor definition (and an adjusted Service definition) to export the metrics to prometheus.

Once done you can then add the generic django dashboard to grafana by navigating to its Dashboards -> Import and then Import via grafana.com, and enter the dashboard ID 17658. Change the Name and folder if you with and then select the Prometheus instance and then click Import. The dashboard should then be displayed.

alanbchristie commented 1 month ago

Here are some overnight "out of the box" metrics from the latest staging stack: -

Interesting, this shows that there are a lot of very long response times and some very large response payload. There are some clear endpoint culprits (response times). These all appear to take longer than 10 seconds for example: -

And these, appear to take more than 25 seconds: -

mwinokan commented 1 month ago

@tdudgeon says:

There is a lot of good output from the monitoring as implemented, e.g. as @alanbchristie has shown above. However filesystem/volume mounting is very unstable, frequently (but inconsistently) r/w volumes are incorrectly re-mounted as read-only, potentially due to network connectivity issues.

alanbchristie commented 1 month ago

We have now deployed the SmokePing utility (mentioned by STFC). This now runs on each node in the DEV cluster, generating ping performance figures for all the other nodes in the cluster (excluding etcd).

The playbooks and documentation for the utility can be found in our new Ansible repo that is used to deploy the container image and related material: -

https://github.com/InformaticsMatters/smokeping-prober-ansible

This allows us to see metrics generated by each node: -

The expectation is that if networking is a cause of our resilience issues we might see something in the metrics being generated.

mwinokan commented 1 month ago

@alanbchristie says that SmokePing has already been useful to help diagnose issues this morning.

mwinokan commented 4 weeks ago

@tdudgeon says we now have performance monitoring, so this ticket is done.

m2ms / fragalysis-frontend

Add performance monitoring tools to understand resilience issues #1427