Alert for high server `network`/`ram`/`disk`/`cpu` usage "heartbeat" monitoring

OryonMax commented 3 years ago

🏷️ Feature Request Type

New Monitor

🔖 Feature description

Please add Heartbeat Monitoring just like in HetrixTools.

✔️ Solution

Add a new monitor type which shows server's network usage, ram usage, disk and cpu usage and gives alert when usage is close to 90% so people know it's time to upgrade or add a new node.

❓ Alternatives

HetrixTools

📝 Additional Context

No response

👀 Have you spent some time to check if this feature request has been raised before?

[X] I checked and didn't find similar feature request

deefdragon commented 3 years ago

I feel this is out of scope of UK (for now at-least). For remote servers, if the usage stats are that important, something more tuned to metrics tracking (Prometheus/grafana etc.) should likely be employed.

(You could look into making a push monitor and writing a script yourself like this if you really need this feature for something).

OryonMax commented 3 years ago

Everybody needs that nowadays and most Status Pages are Paid, I hope to see this feature in Uptime Kuma.

markdesilva commented 3 years ago

As @deefdragon said, this might be out out of scope for what UK does. However, using push and command line utilities like "mpstat" and "free" and if the UK developers allow users to change the units and description of whats monitored as I asked for in #749, then UK could give you individual graphs of cpu and memory utilization, only thing is you won't have one page with all the metrics together.

OryonMax commented 3 years ago

Not allowed in UK?

markdesilva commented 3 years ago

Didn't say its not allowed, but as mentioned, not in the scope of what UK does, So even if the developer decides to put it in, probably it will not be a priority. Unless you want to code that portion yourself and then make a pull request for your work to be included in a future release.

ImmaZoni commented 3 years ago

@OrefaSol to provide some insight on why this is not really in scope.

A typical "status" service does nothing other than Ping a server. basically "Hey you there?" and it responds or doesn't. yes or no. This ping does not provide data on CPU, RAM, Storage, anything. all you get is yes I'm alive, and it took this long.

Due to this Uptime Kuma is a permissionless service, meaning I don't need to approve uptime Kuma talking to my website or any website really.

If Louis were to try and implement something like this it would require a separate program/script that would go on the website you want to test and send extra info over to Uptime Kuma. So it requires direct access to every service you want to test and get this data on.

HetrixTools offers various services, one being a status service, and another being a server monitoring service.

Uptime Kuma (In its current form at least) is strictly a status service.

OryonMax commented 3 years ago

Nope, HetrixTool has Hearbeat Monitoring Under Uptime Monitoring Product.

markdesilva commented 3 years ago

Sounds like maybe you should be using HetrixTools then as you’re obviously a fan.

rihards-simanovics commented 3 years ago

Nope, HetrixTool has Hearbeat Monitoring Under Uptime Monitoring Product.

What you are asking here is out of scope (perhaps for now), full stop.

Unless you are willing to code it yourself, wait until the UK developer does it.

Surely you understand that everyone who develops and contributes to a free and open-source product dedicates their free time to do so. The feature that you are asking for is from a paid product, there is a reason why it's paid, the money goes to a developer for their hard work.

louislam commented 3 years ago

Everyone relax🐻.

Just follow one rule. If you love the suggestion, give a 👍.

Ignore it if you don't like it.

zimbres commented 2 years ago

I use https://www.netdata.cloud/

rihards-simanovics commented 2 years ago

@zimbres, this is amazing and Open Source. Hmm, I think I might have this running for my client reports... I will keep using the UK for internal services as those don't require reports generated. Thanks for sharing the tool name!

PS: Perhaps the UK might use some of the source code or take inspiration from that tool as it looks quite nice.

ririko5834 commented 2 years ago

This should work like hetrixtools works, that you can get also stats about ram, CPU, disk, network, etc. displayed in graph on status page.

markdesilva commented 2 years ago

Here we go again with Hetrix tools.

I wonder if all these folks suggesting UK work like Hetrixtools just want the HT functions cos they want the unlimited HT functionality without paying for it.

Sounds like it doesn’t it? 🤷🏻‍♂️

rihards-simanovics commented 2 years ago

@ririko5834 the basic answer to your request is "maybe in the future".

UptimeKuma is a relatively simple uptime monitoring application running on NodeJS. not saying that NodeJS is a bad language. Still, I am nearly pointing out that a different language is more favourable due to performance requirements for what you are asking.

As @markdesilva pointed out, and I agree with them, if you favour Hetrix Tools, you need to support the developer by getting a paid plan. UptimeKuma may be an open-source project for now, but I'm sure that when the time comes, the author will also want to have their own paid plans alongside open-source for those people who don't want to have a hustle of setting one up themselves.

That being said, keep in mind that, as pointed out by @ImmaZoni, to know the server's hardware status, the author will require developing and requesting to installing of a separate "companion" app on the server, which will push the CPU, RAM, etc. information to UptimeKuma. If the last stable release (v1.14.0) is anything to go by, the author wants this application to just run without any additional hoops to jump through (ref. to Cloudflare proxy functionality).

EDIT: Almost forgot, @zimbres also noted that there is another open-source tool called netdata that you can use to monitor server hardware status.

InSelfControll commented 1 year ago

Hey you don't have to install anything on the server just need to let uptime-kuma the option to connect via ssh to every server and get these info then parse it to the status page.

I have made a bash script that send details like this to my email once it pass the 75% disk usage same goes to the ram and CPU.

I just need to find the right way to send the data to uptime-kuma now for it to send it to me via telegram.

rihards-simanovics commented 1 year ago

@InSelfControll

Hey you don't have to install anything on the server just need to let uptime-kuma the option to connect via ssh to every server and get these info then parse it to the status page.

Do you even realise how dangerous this is? Openly allow an application (of all things) to have access to a server via SSH? It's almost as if security holes don't exist. So now the hacker instead of hacking six of my servers only need to hack one of my servers and get SSH access to all the other servers.

The best and most secure way is to have a dedicated client application that would receive a request (be it via the web URL or else), process it and send a response JSON to an API on the UK side, or alternately just send the JSON data with an interval, so there is only one way communication from server to UK.

What you've proposed breaks the security best practises on so many levels.

InSelfControll commented 1 year ago

This user doesn't need any permissions except df -h, free -m commands you always can minimized the commands of a user to only 1/2 commands or give the user limited ssh access to only send this commands via ssh @rihards-simanovics, you don't have to give fully login access to ssh so no security issues. Today I have it automatically send to my email / telegram from each server.

rihards-simanovics commented 1 year ago

This user doesn't need any permissions except df -h, free -m commands you always can minimized the commands of a user to only 1/2 commands or give the user limited ssh access to only send this commands

I did think of that, that being said it is still a very junky solution (hence why I didn't mention it). Besides, it's already been mentioned in this discussion that much better paid applications are available. If you have enough servers to warrant an advanced system like that, perhaps it's time to get the wallet out?

This user doesn't need any permissions except df -h, free -m commands you always can minimized the commands of a user to only 1/2 commands or give the user limited ssh access to only send this commands via ssh @rihards-simanovics, you don't have to give fully login access to ssh so no security issues. Today I have it automatically send to my email / telegram from each server.

Again, it's almost as if security holes don't exist. You are playing an extremely dangerous game by even allowing the potential hacker to login. Look, I'm no security expert, but I can guarantee you, gaining access as a "limited user" is a first step, to a full blown hack, so let's not.

InSelfControll commented 1 year ago

The other option is to fix the push passive monitor, and let users to send custom messages in it.

Now the only message it sends is "ok" kinda useless message.

I want to send the script output via curl into the push passive monitor instead just receiving "ok" message.

Now each of my VMS runs the script all the time and if the disk usage is more then 85% I receive an email with the status and details about the disk usage.

On Tue, Dec 20, 2022, 07:58 Rihards Simanovičs @.***> wrote:

This user doesn't need any permissions except df -h, free -m commands you always can minimized the commands of a user to only 1/2 commands or give the user limited ssh access to only send this commands

I did think of that, that being said it is still a very junky solution. Besides, it's already been mentioned in this discussion that much better paid applications are available. If you have enough servers to warrant am advanced system like that perhaps it's time to get the wallet out.

This user doesn't need any permissions except df -h, free -m commands you always can minimized the commands of a user to only 1/2 commands or give the user limited ssh access to only send this commands via ssh @rihards-simanovics https://github.com/rihards-simanovics, you don't have to give fully login access to ssh so no security issues. Today I have it automatically send to my email / telegram from each server.

Again, it's almost as if security holes don't exist, you are playing an extremely dangerous game by even allowing the potential hacker to login. Look I'm no security expert but I can guarantee you gaining access as a "limited user" is a first step to a full blown hack, so lets not.

— Reply to this email directly, view it on GitHub https://github.com/louislam/uptime-kuma/issues/819#issuecomment-1358876746, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEZOWTAC7PIYMXREERY3JFTWOFDI7ANCNFSM5G4F6O4Q . You are receiving this because you were mentioned.Message ID: @.***>

markdesilva commented 1 year ago

@InSelfControll

Maybe I don't quite understand your description, but the status msg can take any message. It does not accept the messages in quotes like "Service is up" but it will take URL spaces as in %20 as in Server%20is%20up. Eg:

attl=`/usr/bin/ping -c 1 <UK server IP> | tail -1 | /usr/bin/cut -d"/" -f 5`

/usr/bin/curl -k "https://<UK server IP>:3001/api/push/XXXXXXXXX?msg=Service%20is%20up,%20ping%20time%20is%20$attl&ping=$attl"

As you can see, you can even pass variables (in this case the ping time to the UK server).

Then your UK will show this:

uk-push-service-msg

If you're using a linux machine, normal users (non root, no sudo) have access to to cat /proc/cpuinfo and cat /proc/meminfo, as well as df and can use cut, sed, awk, grep whatever info they need and pass it into the status message, no giving ssh access to UK or sudo or whatever so no security concerns. For windows I think there are equivalent Powershell commands normal users can use to get the values for cpu usage, memory usage and disk usage (Get-Volume).

Hope it works for you.

Cheers!

The other option is to fix the push passive monitor, and let users to send custom messages in it. Now the only message it sends is "ok" kinda useless message. I want to send the script output via curl into the push passive monitor instead just receiving "ok" message. Now each of my VMS runs the script all the time and if the disk usage is more then 85% I receive an email with the status and details about the disk usage.

rihards-simanovics commented 1 year ago

@markdesilva this seems like a better solution, but would this generate a notification? @InSelfControll needs this to see what the status of the hardware is on their telegram.

InSelfControll commented 1 year ago

@markdesilva this seems like a better solution, but would this generate a notification? @InSelfControll needs this to see what the status of the hardware is on their telegram.

With push notifications I get it directly to my teams / telegram as it should be. I'll keep testing it and update.

The issue now that's the monitor get the heart bit but never send the message more than once.

Example: (for the test I did a check that check if the disk_usage is higher than 1% it should send critical alert)

#!/bin/bash

# Get disk usage
disk_usage=`/usr/bin/df -h | /usr/bin/grep "fedora" | /usr/bin/awk 'END {print $5}' | /usr/bin/tr -d "%"`
disk_usage1=`/usr/bin/df -h | /usr/bin/grep "fedora" | /usr/bin/awk 'END {print $5}'`

# Check if disk usage is higher than 85%
if [ $disk_usage -gt 1 ]; then
  # Send push notification
  /usr/bin/curl -k "http://1.1.1.1:3001/api/push/********?msg=Disk%20Usage%20is%20high:${disk_usage}%25"
fi

Look at the script and the picture.

markdesilva commented 1 year ago

@InSelfControll

Hi so sorry for the late response, I was out. Let me take a look at the script and what you want to do and get back to you.

markdesilva commented 1 year ago

@InSelfControll,

With push notifications I get it directly to my teams / telegram as it should be. I'll keep testing it and update.

The issue now that's the monitor get the heart bit but never send the message more than once.

Example: (for the test I did a check that check if the disk_usage is higher than 1% it should send critical alert)

Hi, so from what I understand UK only reports on either status up or down. Once a status is reported, unless it changes (from up to down or down to up) it will not report. UK works that way for all reports. The idea behind this is so UK won't spam you multiple times via email. telegram etc. while you away from your system and can't check to rectify the error.

The only way to keep reporting the critical error is to keep keep flipping the status between up and down.

When you first report it down, store that some where, when it next reports check the previous status, and if its "down", change your url status to "up" and replace the stored status. The next time it checks the stored status, it will be "up" so then it will change the status to "down" and so on. For your code:

#!/bin/bash

# Get disk usage
disk_usage=`/usr/bin/df -h | /usr/bin/grep "extmedia2" | /usr/bin/awk 'END {print $5}' | /usr/bin/tr -d "%"`

# Check status file and flip status for continuous notices
if [ -f /tmp/du.status ]; then
   if [ `cat /tmp/du.status` == "up" ]; then
        echo "down" > /tmp/du.status
   else
        echo "up" > /tmp/du.status
   fi
else
   echo "up" > /tmp/du.status
fi

udstatus=`cat /tmp/du.status`

# Check if disk usage is higher than 85%
if [ $disk_usage -gt 1 ]; then
  # Send push notification
  /usr/bin/curl -k "https://1.1.1.1:3001/api/push/**********?status=$udstatus&msg=Disk%20Usage%20is%20high:${disk_usage}%25"
fi

Your UK will look like this:

uk-push_flipstatus

Take note, this will keep spamming you until you pause the monitor or disable the cron for the script.

Honestly I think the default of only sending the message once is the right way to go. Hope this helps.

Cheers!

InSelfControll commented 1 year ago

Hi, Thanks for your reply. I think this option should be in UK that it'll repeat it every X times if the issue didn't fixed.

Let's say it repeats 4 times each time after 1 minute after 3 times the report will changed into critical and the fourth time will be sent via email by the script. UK will keep reporting every time till it fixed only on issues.

markdesilva commented 1 year ago

@InSelfControll

Silly me, there is already an option in the monitor for sending messages on consecutive heartbeats missed.

uk_retries

This will send the msg to your telegram every minute, but it will NOT reflect in the status on UK multiple times, only once. The only way I can find to have it send to telegram and to show on the UK status multiple times is what I said in my previous post.

markdesilva commented 1 year ago

Hi, Thanks for your reply. I think this option should be in UK that it'll repeat it every X times if the issue didn't fixed.

Let's say it repeats 4 times each time after 1 minute after 3 times the report will changed into critical and the fourth time will be sent via email by the script. UK will keep reporting every time till it fixed only on issues.

Right, so you can sort of do this by setting the resend notifications if Down X times consequently to "4". But like I said, it will not update the UK status, but only keep sending to your alert (telegram, etc).

InSelfControll commented 1 year ago

Hi, Thanks for your reply. I think this option should be in UK that it'll repeat it every X times if the issue didn't fixed.

Let's say it repeats 4 times each time after 1 minute after 3 times the report will changed into critical and the fourth time will be sent via email by the script. UK will keep reporting every time till it fixed only on issues.

Right, so you can sort of do this by setting the resend notifications if Down X times consequently to "4". But like I said, it will not update the UK status, but only keep sending to your alert (telegram, etc).

The issue that if you mark it as down in the url so it'll not send the correct message it just sending "no heartbeat" instead my message.

markdesilva commented 1 year ago

Yes, you are right. In the alert (eg: telegram) message, it will only say "No heartbeat in the time window". For your own message to appear in the alert message, you will need to use my modifications to your script, just that the status will keep flipping between up and down.

The issue that if you mark it as down in the url so it'll not send the correct message it just sending "no heartbeat" instead my message.

InSelfControll commented 1 year ago

Yes, you are right. In the alert (eg: telegram) message, it will only say "No heartbeat in the time window". For your own message to appear in the alert message, you will need to use my modifications to your script, just that the status will keep flipping between up and down.

Hey look I have another script that works really nice and with a custom status=down message, but the only issue now that, for some reason, it doesn't get %25 (%) symbol at the end.

#!/bin/bash

# This script monitors RAM, CPU, and disk usage and sends an alert if disk usage is higher than 85%.

# Get current disk usage
DISKUSAGE=`/usr/bin/df -h | /usr/bin/awk '$NF=="/"{printf "%s\t\t", $5}' | /usr/bin/tr -d "%"`
# Get current RAM and CPU usage
RAM=`free -m | awk 'NR==2{printf "Memory Usage: %s/%sMB (%.2f%%)\n", $3,$2,$3*100/$2 }'`
CPU=`top -bn1 | grep load | awk '{printf "CPU Load: %.2f\n", $(NF-2)}'`
# Check if disk usage is higher than 85%
if [[ ${DISKUSAGE%?} -gt 21 ]]; then
  echo "High Disk Usage: $DISKUSAGE"
  echo "$RAM"
  echo "$CPU"
  # Send alert
  curl -s "https://uk.***com/api/push/******?status=down&msg=Disk%20usage%20is%20high:${DISKUSAGE}%25"
  else
          curl -s "https://uk.****.com/api/push/******?status=up&msg=Disk%20usage%20is%20Fixed:${DISKUSAGE}%25"
fi

I would be very happy if we can fix it together :)

markdesilva commented 1 year ago

Hey look I have another script that works really nice and with custom status=down message but the only issue now that for some reason it doesn't get %25 (%) symbol at the end. I would be very happy if we can fix it together :) @InSelfControll

The problem is that there is a trailing tabs (/t) on your DISKUSAGE variable.

if you put [ ] in front and behind the DISKUSAGE variable when you echo it (line 12), you will see.

echo "High Disk Usage: [$DISKUSAGE]"

If you echo the curl url in your code you will also see:

echo "https://uk.***com/api/push/******?status=down&msg=Disk%20usage%20is%20high:${DISKUSAGE}%25"

Also the first line should be #!/bin/bash not !/bin/bash

Cheers!

InSelfControll commented 1 year ago

Also the first line should be #!/bin/bash not !/bin/bash

It didn't copied the first line 😆 my script have it already with #!/bin/bash,

markdesilva commented 1 year ago

It didn't copied the first line 😆 my script have it already with #!/bin/bash,

Oh ok. Also its trailing tabs not newline, my mistake.

InSelfControll commented 1 year ago

Oh ok. Also its trailing tabs not newline, my mistake.

Yeah printf adding tabs, I removed it but still curl doesn't get %25 ASCII

markdesilva commented 1 year ago

@InSelfControll

Yeah printf adding tabs, I removed it but still curl doesn't get %25 ASCII

Funny, it works for me.

This is my modified version of your script:

#!/bin/bash

# This script monitors RAM, CPU, and disk usage and sends an alert if disk usage is higher than 85%.

# Get current disk usage
DISKUSAGE=`/usr/bin/df -h | /usr/bin/awk '$NF=="/" {printf "%s", $5}' | /usr/bin/tr -d "%"`
# Get current RAM and CPU usage
RAM=`free -m | awk 'NR==2{printf "Memory Usage: %s/%sMB (%.2f%%)\n", $3,$2,$3*100/$2 }'`
CPU=`top -bn1 | grep load | awk '{printf "CPU Load: %.2f\n", $(NF-2)}'`
# Check if disk usage is higher than 85%

if [ ${DISKUSAGE} -gt 1 ]; then
  echo "High Disk Usage: [$DISKUSAGE]"
  echo "$RAM"
  echo "$CPU"
  # Send alert
  curl -s "https://uk.***com/api/push/******?status=up&msg=Disk%20usage%20is%20high:${DISKUSAGE}%25"
else
  curl -s "https://uk.***com/api/push/******??status=up&msg=Disk%20usage%20is%20Fixed:${DISKUSAGE}%25"
fi

Did you forget your 3001 or are you running your UK server on 443?

InSelfControll commented 1 year ago

Did you forget your 3001 or are you running your UK server on 443?

This is my private server, the other one is stg at work with port 3001. This specific URL runs behind treafik proxy on docker, :) so no port. I fixed the disk usage now working on CPU and memory that I have a little issue with.

Trying to figure out how to get only the percentage of CPU and memory for doing the test for it to

markdesilva commented 1 year ago

Ah ok. In anycase the %25 works for me.

For CPU % used, have you tried using iostat (apt get sysstat) and piping that into bc?

CPUP_IDLE=`iostat  | grep -A1 "avg-cpu" | awk {'print $6'} | tail -1`
CPUP_USED=`echo "100 - $CPUP_IDLE" | bc`

For memory, free -g take the available/total and also pipe into bc.

This is my private server the other one is stg at work with port 3001. This specific url run behind treafik proxy on docker :) so no port.

InSelfControll commented 1 year ago

Ah ok. In anycase the %25 works for me.

For CPU % used, have you tried using iostat (apt get sysstat) and piping that into bc?
CPUP_IDLE=`iostat  | grep -A1 "avg-cpu" | awk {'print $6'} | tail -1`
CPUP_USED=`echo "100 - $CPUP_IDLE" | bc`
For memory, free -g take the available/total and also pipe into bc.

This is my private server the other one is stg at work with port 3001. This specific url run behind treafik proxy on docker :) so no port.

Done! Thanks for your help. I think that UK need to get this feature that send critical messages on hosts if there are issues other than the server is down. Like that the http test that can monitor string and still get code:200 but mark as missing string.

I would love to see UK get this feature in the next release 😎

InSelfControll commented 1 year ago

I just found an issue while running it via docker and on a VM with nodejs. It keep losing connection every time and then sending service down - "no heartbeat" because the connection is "down" even when the service is up and running correctly.

BasToTheMax commented 1 year ago

Any updates? Would love to have this.

CommanderStorm commented 1 year ago

Quick reminder for everybody: Issues are for discussing what needs to be done how by whom. We use 👍🏻 on issues to prioritise work, as always: Pull Requests welcome.

You can currently implement such a monitor without any modification needed via the post monitor as stated above. If you want to simplify this process, adding a monitor via a PR is a possibility (see our contribution guide for additional details).

milzamsz commented 6 months ago

found existing solution https://github.com/msgbyte/tianji

MichaelBelgium commented 6 months ago

@milzamsz If i'm correct, on tianji you're just having a list of servers. I don't see the possiblity to, for example, set alerts when cpu usage > x %

Neither add a server to a status page? And it doesn't support mysql/mariadb like v2 of uptime-kuma?

milzamsz commented 6 months ago

@MichaelBelgium yes but it's better for me who doesn't need advanced monitoring like netdata/prometheus. im installing with easypanel and it come with postgres

AquaMCU commented 5 months ago

Hi. Id like to join the conversation ;)

Other idea: How about KUMA just adds a new monitor, that requests a page from the server to be monitored. Here you guys can knock yourself out and implement your GO_JS_PHP_BASH_FORTRAN or whatever module, that just responds with a percentage and does not respond when it s critical.

As for KUMA, for this Monitor, just display the data and make it NOK when it is not responding.

... easy to do for the KUMA team and nice and hackable for the rest of us ;)

Oliver

andrijs29 commented 2 months ago

is this will make uptime kuma like hetrix tools? for me its good if UK can have something like cpu/network/ram/disk/ usage like hetrix tools

im currently using hetrix and its help to show the current VPS usage without login to my SSH what happened before im login to my terminal

apio-sys commented 2 months ago

This is about Uptime as the name of the project. Not resource monitoring. Those are different issues and have their own tools. You might use Uptime to show to a client that SLA is respected and you might have internal perf monitoring tools so you can adjust power as needed to customer sees their SLA respected. That's just MHO and I really think you shouldn't mix this up or you'll end up with a tool that really doesn't fit any need anymore.

louislam / uptime-kuma