Open noellymedina opened 4 weeks ago
Does it make sense though? Do you really do not want to be aware of a failed repair task? (perhaps it should be an option).
the point is, for every retry it will spam alerts, was thinking of an alternative to bring a single alert instead of several ones for the same task.
as an example:
every attempt of these would spam an alert while only one for this task would be enough to let us know that it requires investigation
as an example:
every attempt of these would spam an alert while only one for this task would be enough to let us know that it requires investigation
We should fix the root cause. I don't think the alert can know it's due to the same issue or a different issue.
Let's get the manager team involved, I think it's part of the bigger issue of what the manager reports, they have too many metrics on one hand, and it's hard to get significant information from them on the other
@mykaul, can we assign someone from the manager team to get their input?
@karol-kokoszka - can you look into this?
Outside of the scope of alerts in this issue... The error that @noellymedina attached comes from the ScyllaDB back-end used by Scylla Manager. A simple INSERT operation failed. Something is wrong with this Scylla instance. Maybe there is not enough memory? It's worth checking the logs from the Scylla instance running on the same machine as the manager.
Can I see the full error message (not cut) ? We have known issue in SM https://github.com/scylladb/scylla-manager/issues/3884 , but I'm not sure if it's connected.
When a backup or repair task on Scylla Manager fails, it automatically retries right after, and usually it succeeds. The thing is that for every failed run, one alert will be triggered. Is that possible to configure the alert to be sent only once when the task fails definitely and there are no more attempts?
@amnonh