vexxhost / atmosphere

Simple & easy private cloud platform featuring VMs, Kubernetes & bare-metal
102 stars 28 forks source link

Create an alert monitoring logs for list failures #490

Closed mnaser closed 3 months ago

mnaser commented 1 year ago

We have seen a major issue with ports disappearing from Magnum clusters when the cell is not responding "for some reason".

We should use Loki's alert rules and issue an alert if we see something like this message:

2023-07-15 09:06:44.022 23 WARNING nova.compute.multi_cell_list [req-52e0b57e-7d9c-456c-98cd-69755cc2b15e 1b23ea79cd454800094ea06e0c319d2eb2585879745d43a2787a2f004bd86950 27f0c0244502435bbde078259cea6201 - b9b8e05b25c64b03abc10069c08b6217 b9b8e05b25c64b03abc10069c08b6217] Cell 1dd95335-7be1-47df-a076-a4fac1cba152 is not responding and hence is being omitted from the results

ref: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/issues/1612

satrik commented 9 months ago

@mnaser this sounds somehow like what we discussed yesterday. would be nice to have a way to create custom alerts based on Loki logs - not only for this specific case

mnaser commented 7 months ago

@satrik Indeed. There are some challenges with this solution, I'll leave commentary on the PR.