jhuckaby / performa

A multi-server monitoring system with a web based UI.
Other
424 stars 20 forks source link

Question - Alert Offline/Reboot #22

Open smaramwbc opened 6 months ago

smaramwbc commented 6 months ago

Hi @jhuckaby,

I've set up your great performa with alerts, and they're working fine. However, I'm having trouble setting up an alert for when a server goes offline or reboots. I can't find a solution for this common scenario.

jhuckaby commented 6 months ago

Hey there! Thank you for using Performa!

So, I can meet you halfway here. First, the good news. Alerting on reboots is fairly straightforward. All you need is a monitor that monitors the "uptime" of your servers, then you can alert on that number being too low. Here's how to set it up:

First, create this monitor:

Save the monitor, and wait a few minutes for it to gather some samples. You should now see a graph on your servers that looks somewhat like this:

Screenshot 2024-03-09 at 9 50 42 AM

The uptime is the number of seconds since the server was last rebooted. So now that you have this monitor, all you need is an alert that triggers on the value being too low. Like, say, less than 3600 (1 hour). Here is an example alert configuration:

And that's it!

Now for the bad news. Detecting if a server goes entirely offline is, unfortunately, out of scope for Performa, because the master (primary) server doesn't ever contact your other servers. The process all happens in reverse. Your servers all send a request to the primary server, once per minute. It's a "push" architecture.

What this means is, the primary server doesn't really "know" if a server goes offline. It just doesn't receive any metrics for that server.

There is hope on the horizon, however. I am currently working on something called "Orchestra", which is a modern remake of both my Cronicle and Performa tools, rolled into one, plus a bunch of awesome new features. Orchestra maintains a persistent connection to all servers, and can easily alert you if a server goes down. Orchestra is coming out later this year (2024).

Hope this helps!

smaramwbc commented 6 months ago

I just wanted to circle back and express my gratitude for your guidance on setting up alerts for server reboots using Performa. Your solution to monitor the uptime was spot-on and worked like a charm. It’s now seamlessly integrated into our system, enhancing our monitoring capabilities significantly.

I came across a warning scenario that I’m hoping to integrate as an alert. The warning indicates that a server has not submitted any data in over 10 minutes, suggesting it may have gone offline. I am curious if there's a way, perhaps with some custom coding, to translate this warning into an active alert. Do you have any advice or guidelines on how this could be implemented within Performa?

image

Furthermore, I’m genuinely excited about your upcoming project "Orchestra." The features you mentioned, especially the ability to maintain persistent connections to all servers, promise a substantial leap forward in server management and monitoring. It's an ambitious undertaking, and I’m eager to see how it revolutionizes our workflows and improves operational efficiency.

jhuckaby commented 6 months ago

Alas, I'm sorry to report, the warning banner you show there is just a "trick". It's just client-side JavaScript code that runs in the browser page, comparing the timestamp of the latest server data with your current PC clock time. If the delta is over 10 minutes, it shows the banner.

The important thing to note here is, this is just a cosmetic "UI" feature running in the browser. The Performa server itself doesn't "know" that the server's data has gone stale, so it cannot generate an alert (without a significant redesign).

I'm afraid it would be too much work to retrofit Performa v1 with a feature that does this, which is why I'm focusing all my efforts on v2.

smaramwbc commented 6 months ago

Got it! I am eagerly anticipating the debut of Orchestra.

GuglielmoFelici commented 4 months ago

@jhuckaby I also want to thank you a lot for this awesome tool! Can I donate somehow?

Anyway, I see your

What this means is, the primary server doesn't really "know" if a server goes offline. It just doesn't receive any metrics for that server.

But I wonder, is there any way to set up an alert if the master didn't receive any metrics from a server in the last N minutes?

Btw, @smaramwbc, if you have ssh access you could achieve it like this:

# Cron job run on the slave server
ssh master touch /run/server_ping # or any writable path

# Custom Performa command run on master server
# Returns 1 if file was modified within 10 minutes
[ -f "/run/server_ping" ] && find "/run/server_ping" -mmin -10 -print | grep -q . && echo 1 || echo 0

Then, setup an alert on [commands/server_alive] < 1