louislam / uptime-kuma

A fancy self-hosted monitoring tool
https://uptime.kuma.pet
MIT License
60.35k stars 5.4k forks source link

Tolerance/debounce for network errors #5235

Closed luizkowalski closed 1 month ago

luizkowalski commented 1 month ago

📑 I have found these related issues/pull requests

-

🏷️ Feature Request Type

API / automation options, Settings

🔖 Feature description

Sometimes, there are network errors, small hiccups that immediately trigger a notification. They cause false positives and bring down the SLO, and the problem is resolved in the next check:

image

✔️ Solution

I see two possible solutions: either an option to ignore certain error types or a change in the algorithm: when a check fails, instead of immediately triggering an error, it is registered as a warning. the warning does not affect the SLO. if the next check fails, then an error is triggered and the SLO is affected, otherwise, the warning is cleared.

❓ Alternatives

No response

📝 Additional Context

No response

CommanderStorm commented 1 month ago

ENOTFOUND

Consider enabling NSCD in the settings to make the DNS resolver less likely to jeject you as spemmy. This has the downside of caching dns queries, which you might not want..

when a check fails, instead of immediately triggering an error, it is registered as a warning

If you set Retry >= 1. This is not the default as users kept asking why suddenly what they expect to be down is not.

=> does this resolve the issue?

the warning does not affect the SLO

That reaaaly depends on how your SLIs are defined.. I don't know if retries are accounted as down/up, but in the end of the day both are sane choices imo 🤷🏻‍♂️. Idk..

luizkowalski commented 1 month ago

these are good tips, I will play around with these configs.

edit: I updated the Retry option and will monitor. NSCD was already enabled

CommanderStorm commented 1 month ago

In that case please check the TTL of the DNS record. If the TTL is really low, this might also happen