hetrixtools / agent

HetrixTools Server Monitoring Agent (Linux)
https://hetrixtools.com/uptime-monitor/
88 stars 29 forks source link

Standardized protocol #31

Open foxycode opened 4 years ago

foxycode commented 4 years ago
foxycode commented 4 years ago

I made a mistake, I was missing base64 after system update and that was why script wasn't working, sorry. Still, it would be nice to have standardized protocol.

hetrixtools commented 4 years ago

Hello,

I can confirm that the back-end code for each version does not change when newer agent versions are released, meaning that all of our older agents are still fully compatible and working.

We'll work on a standardized protocol for future agent versions.

Thanks for the feedback.

sholwe commented 4 years ago

Can we get a documented API? I'd like to extend this outside of Linux as well. It was trivial to make work with Alpine, but some of this is kind of Linuxish to support BSDs. I'd prefer to wait for something documented before writing a compatible posting tool.

Thanks!

foxycode commented 4 years ago

I'd like documented API too. Right now, some features like RAID health are parsed and processed on backend, which isn't ideal state.

hetrixtools commented 4 years ago

Hello,

We'll be working on a standardized protocol for our agent in a future release, along with documentation info regarding this.

Thank you for the feedback.

foxycode commented 4 years ago

Do you have any release date?

hetrixtools commented 4 years ago

Do you have any release date?

Unfortunately not at this time.

sholwe commented 4 years ago

Most of this was done by hand, since I can't always use Linuxisms. This works for version 1.59; I've implemented (most of) it for OpenBSD.

POSTDATA="v=$VERSION&s=$SID&d=$OS|$Uptime|$CPUModel|$CPUSpeed|$CPUCores|$CPU|$IOW|$RAMSize|$RAM|$SwapSize|$Swap|$DISKs|$NICS|$ServiceStatusString|$RAID|$DH|$RPS1|$RPS2|$IOPS|$CONN|$DISKi"

v= current version string - 1.59 (may be decimal 2 precision)
s= Local system string hash (Site ID)
d= String [see below, all terminated with pipes]
OS (b)= String - Shortname or $(uname -s)$(uname -r)"|"$(uname -r)"|"RequiresReboot INT (1 true or 0)
Uptime = seconds since boot
CPUModel (b) = string
CPUSpeed (b) = speed of CPU (int)
CPUCores= int number of cores
CPU = Average of CPUSpeed for post period
IOW = IOWait decimal 2 precision
RAMSize = Complete RAM size (MB)
RAM = used RAM (MB) in percentage
SwapSize = Total (MB)
Swap = Used (MB) in percentage 
NICS (gb) = (array) "|"interface";"inbytes";"outbytes";""|"interface";"inbytes";"outbytes";'... 
DISKs (gb) =  (array) mount point, totalsize (bytes), available(bytes)
RAID (gb) = {{have no implemented}}
DH (gb) = {{have not implemented}} (array) {lsblk name"|{smartctl -H}|"...}
RPS1 = unimplemented
RPS2 = unimplemented
IOPS (gb) =  {{have not implemented}}
CONN (b) = (array) "PortNumber"|"NumberOfConnectionsToPort";"
DISKi (gb) = (array) mountpoint, total inodes, used inodes, available inodes";"

(g) noted is encoded to post with: gzip -cf (b) noted as base64 encoded with base64prep() (in script)

Yes, this is really brief, and an enormous mess. The biggest issue I ran into is with their "base64prep" function which is nonstandard as well - it just changes things to post without bring escaped by the webservice. "+" is converted to "%2B" and "\" is rewritten to "%2F" - kind of a mini htmlspecialchars().

The way the script gets average network data is one of the most bizarre things I've ever seen to date. It makes an array and loops several times to increment over the period of time that it expects to run (roughly a minute). Since I can rely on getting pretty normalized data over a period of time, I take a snapshot when it first runs, then count the bytes sent/received before I have the script echo roughly 52 seconds later. Still a cheat, but accurate enough for a 0.01 release.

hetrixtools commented 4 years ago

@sholwe thank you for putting in the time to write all of this down.

We know that the agent data aggregation is quite messy at this time, the person who coded it did not do it justice; however, the collected stats are on par with many other tested tools.

We'll work on a standardized protocol, along with more code cleanup/optimization, in the next major agent release version.

Thanks again for your time and effort.

sholwe commented 4 years ago

Hi @hetrixtools -

As @foxycode has stated, your service seems to take much of this raw data and decide what to do with it when it's parsed on your end. That means we'll need to adapt any specific information for the RAID, etc, and hope that it's handled correctly. Can we get a basic post system for you to store and aggregate without whatever logic is being used there?

Thanks - when I clean it up, I'll submit my OBSD code to you; I haven't got a FreeBSD box at the moment, but since it's primarily sysctl/netstat based, shouldn't take much effort.

foxycode commented 4 years ago

Since it's relevant, I'll add link to my SmartOS/Solaris fork: https://github.com/sunfoxcz/hetrixtools-agent-smartos/tree/smartos

sholwe commented 4 years ago

Yaay! I miss Solaris. 2.6 5/98 will forever be in my heartworms. Here's an OpenBSD "functional" version.

https://github.com/sholwe/hetrixtools-agent-openbsd

foxycode commented 4 years ago

@hetrixtools Maybe add forks links to repository README would be nice?

hetrixtools commented 4 years ago

@foxycode added.

Thank you everyone for your contributions.

foxycode commented 2 years ago

@hetrixtools Any progress with standardized protocol? My agent implementation won't show SMART status after upgrading to last SmartOS version and I once again don't have idea why and can't debug thing.

sholwe commented 2 years ago

@foxycode It's going to be here-

if [ "$CheckDriveHealth" -gt 0 ] then if [ -x "$(command -v smartctl)" ] #Using S.M.A.R.T. (for regular HDD/SSD) then for i in $(diskinfo -cH | grep -v "\?\?R" | awk '{ print $2 }') do DHealth=$(smartctl -A /dev/rdsk/$i) if grep -q 'Attribute' <<< $DHealth then DHealth=$(smartctl -H /dev/rdsk/$i)"\n$DHealth" DH="$DH|1\n$i\n$DHealth\n" fi done fi if [ -x "$(command -v nvme)" ] #Using nvme-cli (for NVMe) then for i in $(lsblk -l | grep 'disk' | awk '{ print $1 }') do DHealth=$(nvme smart-log /dev/$i) if grep -q 'NVME' <<< $DHealth then if [ -x "$(command -v smartctl)" ] then DHealth=$(smartctl -H /dev/${i%??})"\n$DHealth" fi DH="$DH|2\n$i\n$DHealth\n" fi done fi fi

I'm afraid I haven't touched SmartOS in ages. Check to see if smartctl has been deprecated or the format has changed for the output. You can still use my above reverse engineered POST data to roll your own.

foxycode commented 2 years ago

@sholwe I already fixed it, but problem is, that smartctl output is analyzed on hextrixtools side, which is bad concept. Noone can implement it's own disk check. If you don't have proper smartctl on you machine, you have bad luck.

sholwe commented 2 years ago

Yikes. I noticed they were doing this with other data back for the 1.59 release. I saw you were based on 1.58, but wasn't sure what might have been changed.