louislam / uptime-kuma

A fancy self-hosted monitoring tool
https://uptime.kuma.pet
MIT License
60.21k stars 5.38k forks source link

Services falsely reported as offline during a system overload #724

Open MAXOUXAX opened 3 years ago

MAXOUXAX commented 3 years ago

Description of the bug When my server is overloaded, Uptime Kuma can't communicate with my services, so it considers them offline. My services are not hosted on the same server, so they work fine, but my status page shows a reduced uptime.

(I want to specify that I voluntarily overloaded my server in order to fine-tune my Anti-DDoS protection)

To Reproduce Steps to reproduce the behavior:

  1. Overload the system and/or network that hosts your status page.
  2. Wait a few minutes
  3. Notice that your services are considered offline and have lost uptime.

Expected behavior The uptime shouldn't be affected at all.

Info Uptime Kuma Version: 1.8.0 Using Docker?: Yes Docker Version: 20.10.8 OS: Debian 10 Browser: Brave V1.30.89

Possible fix When the service has been queried, and an error has been retrieved, execute an action that is supposed to run quickly and check its execution time. If this execution time is greater than a certain limit, ignore the error.

gaby commented 3 years ago

So you are DDoS the uptime-kuma server, and want the server to keep up?

How is this related to uptime-kuma?

louislam commented 3 years ago

I think a good network connection is a hidden requirement here.

PopcornPanda commented 3 years ago

I think that cross-check could be handy for such case. Sometimes a host with uptime-kuma could have problems, not a monitored service. There is already a feature request for such solution: #84 Cross-checking is quite handy and would be a nice addition to kuma. Tag service as unavailable only if 2 of 3 (it's just an example, but it has to be quorum) detect a problem with the service.

MAXOUXAX commented 3 years ago

So you are DDoS the uptime-kuma server, and want the server to keep up?

How is this related to uptime-kuma?

Well, essentially, there's always a way to take a website down, and I don't want attackers DDoS'ing my status page AND causing my services to report offline. Even though my status page would be down during the attack, I don't want my services to be shown as degraded and my uptime as really low after the attack, because, well, my services were just fine. That's an edge case, but still.

I think a good network connection is a hidden requirement here.

Good network connection doesn't mean invulnerable ^^

gaby commented 3 years ago

So you are DDoS the uptime-kuma server, and want the server to keep up? How is this related to uptime-kuma?

Well, essentially, there's always a way to take a website down, and I don't want attackers DDoS'ing my status page AND causing my services to report offline. Even though my status page would be down during the attack, I don't want my services to be shown as degraded and my uptime as really low after the attack, because, well, my services were just fine. That's an edge case, but still.

I think a good network connection is a hidden requirement here.

Good network connection doesn't mean invulnerable ^^

Yes, but it has nothing to do with uptime-kuma. These are networking/firewall concerns. You can use ufw, fail2ban, cloudflare, and a properly configured NGINX to mitigate ddos.

MAXOUXAX commented 3 years ago

So you are DDoS the uptime-kuma server, and want the server to keep up? How is this related to uptime-kuma?

Well, essentially, there's always a way to take a website down, and I don't want attackers DDoS'ing my status page AND causing my services to report offline. Even though my status page would be down during the attack, I don't want my services to be shown as degraded and my uptime as really low after the attack, because, well, my services were just fine. That's an edge case, but still.

I think a good network connection is a hidden requirement here.

Good network connection doesn't mean invulnerable ^^

Yes, but it has nothing to do with uptime-kuma. These are networking/firewall concerns. You can use ufw, fail2ban, cloudflare, and a properly configured NGINX to mitigate ddos.

Yes it does? Having a good firewall is one thing. Being invulnerable is another. I have protections such as Cloudflare and fail2ban, as I said, I was fine-tuning my protections when I noticed the issue, but it'll never make me invulnerable to other type of attacks I did not think of, botnets, and potential other issues.

deefdragon commented 3 years ago

I think that this is at-least partially a Kuma problem. Fundamentally, the service is up, but Kuma is failing to detect it as so.

That doesn't mean that it is an easy problem to solve, or one that should be tackled right now however. I believe that @NixNotCastey is on to a potential solution, as separation of the reporters/collectors and the display would mean that the collectors would be unaffected by a DOS. Something to explore in the future with 84.

markdesilva commented 3 years ago

@MAXOUXAX ah so what you want is like what GSA has, an "override" feature so you can tell UK that "hey, this isn't a server outage, its actually UK that was having connection issues so please put my % back to 100%".

So like when they click the "DOWN" pill in the dashboard, a pop up shows up with an on/off button for "override" and a text box so you can fill in the reason for the override and when you submit, the reason for the override replaces the "No heartbeat in the time window" or "connect ECONNREFUSED " or "timeout of 48000ms exceeded", etc messages.

Yeah I think it's a good thing to have, especially when optics are important to upper management. They won't look at the production servers directly, they will look at your stats which UK provides. So it would be good for them to be able to see that the service has been 100% up rather than down just because UK couldn't connect to the services and not because the services were actually down. Doesn't have to be a DDoS on the UK, it could be something innocent like "tripped over UK server network cable and it came out" or "UK NIC faulty, had to replace".

In fact in this situation, it would be good then to suggest a "select reports range" (display reports within certain date and/or time range) and then "download reports" (to pdf) function.

My 2 cents worth.

louislam commented 3 years ago

Ultimately, I think one possible solution is completely sperating the core and the status page into two different projects.

So if someone attack your status page, it wont take down the core too.

gaby commented 3 years ago

Ultimately, I think one possible solution is completely sperating the core and the status page into two different projects.

  • Host the core inside a private network and dont expose.
  • Host the status page in another server and expose the page to the Internet. Sync data with like a private tunnel etc.

So if someone attack your status page, it wont take down the core too.

Status page should be internal to your network. Not exposed to the internet.

louislam commented 3 years ago

However, due to such a big amount of efforts, it won't happen shortly.

If you are using Cloudflare, setting Page Rule with Cache Everything for 5mins and disabling WebSocket is a way to go too.

Use your internal address to access the dashboard.

deefdragon commented 3 years ago

Status page should be internal to your network. Not exposed to the internet.

That depends on what you are using the status page for. I use my status page to show of the current state of the different APIs that my site uses. Similar to www.cloudflarestatus.com for cloudflare.

louislam commented 3 years ago

Status page should be internal to your network. Not exposed to the internet.

That depends on what you are using the status page for. I use my status page to show of the current state of the different APIs that my site uses. Similar to www.cloudflarestatus.com for cloudflare.

Agree if dont expose to the Internet, op's problem is not a problem.