louislam / uptime-kuma

A fancy self-hosted monitoring tool
https://uptime.kuma.pet
MIT License
52.46k stars 4.72k forks source link

Push monitors' `PENDING`-status not respecting retries #4785

Open thielj opened 1 month ago

thielj commented 1 month ago

📑 I have found these related issues/pull requests

🏷️ Feature Request Type

API / automation options, Change to existing monitor

🔖 Feature description

Sending status=pending&msg=backup%20started would immediately set the monitor to 'retry mode' without waiting for the usual period to expire.

✔️ Solution

There have been several mentions in the past.

Another example is monitoring processes that occasionally exit or need to be deliberately stopped but are expected to be restarted and become available within the retry period (think systemd unit).

Or a daily job where I generally expect UP once a day. When the job is being started I send PENDING, after which the monitor would go into retry mode and expect the UP to arrive within say 1h instead of waiting a full day before raising a notification (think remote job dying or stalling or blocking somehow).

❓ Alternatives

For the above-mentioned examples, I couldn't find alternatives short of implementing my own "pending logic" somehow. None of these would add a suitable event record to Uptime Kuma either.

📝 Additional Context

No response

CommanderStorm commented 1 month ago

Where you are entirely correct is that the current way we are communicating this and how our retry/.. logic for this monitor works is super weird.

[!NOTE] As context PENDING means that a monitor either

  • has not had a push,
  • has failed in the past and is currently retrying or
  • is in some other transitionary step between UP or DOWN (such as docker containers starting up).

=> I don't know how setting a monitor to PENDING SHOULD behave. Our accounting around this is a bit messy and the behaviour is entirely undocumented. Likely this should skip one retry but behave as if DOWN for other purposes, but unsure.. => That setting a monitor to DOWN does not trigger the correct retries is definitively a bug..

[!TIP] If you want to use the retry logic in the current system, you should instead not send a push => let the push-monitor time out and go into the PENDING-state independently

thielj commented 1 month ago

@CommanderStorm As I mentioned already, and others have mentioned before, letting something go into the pending state isn't really a solution if you have e.g. a job running once a day or are transitioning through a unit or container restart.

PENDING for me is - in the context of push notifications at least - that something isn't fully up or completed yet, but due shortly and long before the regular period expires. Most important, if it doesn't come fully UP within the retry period, I want it to be considered DOWN and notified immediately.

Everything else either delays notifications unnecessarily or creates too many false positives.

It's not that different from something actively monitored by U-K, except that retries and retry periods just pass without actively retrying.

CommanderStorm commented 1 month ago

letting something go into the pending state isn't really a solution

You are explicitly setting it to PENDING, so how can you not want the pending state?? I think something was left in translation here. Frank is confused ^^

It's not that different from something actively monitored by U-K, except that retries and retry periods just pass without actively retrying

I am going to repeat myself as i am 5% unshure if my last communication was clear (no offense intended, just trying to not mis-communicate ^^)

[!TIP] If you want to use the retry logic in the current system, you should NOT send a push in the interval. This lets the push-monitor time-out and go into the PENDING-state. The retry logic is triggered via this path.