scylladb / scylla-monitoring

Simple monitoring of Scylla with Grafana
https://scylladb.github.io/scylla-monitoring/
Apache License 2.0
225 stars 138 forks source link

Repeating Alerts for a task failing on Scylla Manager #2307

Open noellymedina opened 4 weeks ago

noellymedina commented 4 weeks ago

When a backup or repair task on Scylla Manager fails, it automatically retries right after, and usually it succeeds. The thing is that for every failed run, one alert will be triggered. Is that possible to configure the alert to be sent only once when the task fails definitely and there are no more attempts?

@amnonh

mykaul commented 3 weeks ago

Does it make sense though? Do you really do not want to be aware of a failed repair task? (perhaps it should be an option).

noellymedina commented 3 weeks ago

the point is, for every retry it will spam alerts, was thinking of an alternative to bring a single alert instead of several ones for the same task.

noellymedina commented 3 weeks ago

as an example: image every attempt of these would spam an alert while only one for this task would be enough to let us know that it requires investigation

mykaul commented 3 weeks ago

as an example: image every attempt of these would spam an alert while only one for this task would be enough to let us know that it requires investigation

We should fix the root cause. I don't think the alert can know it's due to the same issue or a different issue.

amnonh commented 3 weeks ago

Let's get the manager team involved, I think it's part of the bigger issue of what the manager reports, they have too many metrics on one hand, and it's hard to get significant information from them on the other

amnonh commented 1 week ago

@mykaul, can we assign someone from the manager team to get their input?

mykaul commented 1 week ago

@karol-kokoszka - can you look into this?

karol-kokoszka commented 6 days ago

Outside of the scope of alerts in this issue... The error that @noellymedina attached comes from the ScyllaDB back-end used by Scylla Manager. A simple INSERT operation failed. Something is wrong with this Scylla instance. Maybe there is not enough memory? It's worth checking the logs from the Scylla instance running on the same machine as the manager.

Can I see the full error message (not cut) ? We have known issue in SM https://github.com/scylladb/scylla-manager/issues/3884 , but I'm not sure if it's connected.