sensu-plugins / sensu-plugins-memory-checks

This plugin provides native memory instrumentation for monitoring and metrics collection, including: memory usage via `free` and `vmstat`, including metrics. Note that this plugin may have cross-platform issues.
http://sensu-plugins.io
MIT License
15 stars 50 forks source link

Check-memory-percent.sh not returning correct values for Linux systems #82

Open tfalkowski-nsidc opened 5 years ago

tfalkowski-nsidc commented 5 years ago

FWIW, I am using both Ubuntu 16/18 and CentOS 7 systems.

So near as I can tell this issues is caused by an incorrect parsing of the values returned by the free -m command as well as not calculating the free memory correctly.

Using the following output from the free -m command as an example,

                total        used        free      shared  buff/cache   available
     Mem:        3951        2915         224          32         812         731
     Swap:       1951         404        1547

The following code does not make much sense:

elif [ $os = "Linux" ]; then
  # Get total memory available on machine
  TotalMem=$(free -m | grep Mem | awk '{ print $2 }')
  # Determine amount of free memory on the machine
  set -o pipefail
  FreeMem=$(free -m | grep buffers/cache | awk '{ print $4 }')
  if [ $? -ne 0 ]; then
    FreeMem=$(free -m | grep Mem | awk '{ print $7 }')
  fi
fi

The first issue is where the code attempts to set the value for FreeMem.

This command:

FreeMem=$(free -m | grep buffers/cache | awk '{ print $4 }')
# does not actually return anything which causes the program bail out to the line after the "if [ $? -ne 0 ]; then" statement

# It probably ought to  be:
FreeMem$(free -m | grep Mem | awk '{ print $6 }')

However, this is also an issue as is it not the full picture of the free memory on the system.

FreeMem should be defined as:

Free Memory = Free + Buffers + Cached

So, using the the example data above we would have Free Memory = Free (224 MB) + Buffers & Cache (812) = 1,036 MB.

A possible solution might look something like this:

elif [ $os = "Linux" ]; then`
  # Get total memory available on machine
  TotalMem=$(free -m | grep Mem | awk '{ print $2 }')
  # Determine amount of free memory on the machine
  set -o pipefail
  FreeMem=$(free -m | grep Mem | awk '{ print $4 }')
  BuffCache=$(free -m | grep Mem | awk '{ print $6 }')
 if [ $? -ne 0 ]; then
   UsedMem=$(free -m | grep Mem | awk '{ print $3 }')
 fi
fi
# Get percentage of free memory
TotalFree=$(($FreeMem+$BuffCache))
FreePer=$(awk -v total="$TotalMem" -v free="$TotalFree" 'BEGIN { printf("%-10f\n", (free / total) * 100) }' | cut -d. -f1)

Thank you for your effort on this project, -Thomas

EDIT BY @majormoses: used triple backticks to form code blocks so its easier to read

xiongchiamiov commented 5 years ago

Even that isn't really correct. A while back the kernel developers got tired of people calculating free memory with calculations like these, and so they added info into /proc/meminfo specifically for these scenarios. The current version of this plugin is using that for the metrics scripts, but check-memory-percent.rb is still calling out to free and doing a bunch of calculation work; it should probably be using the info from meminfo instead.

I'm not sure why that wasn't done when the metrics were changed in #68. @jpoizat @majormoses can you comment?

xiongchiamiov commented 5 years ago

(If the reason is just "because we didn't", btw, then I might find some time to port this over since it's something that's bugging me.)

majormoses commented 5 years ago

I have not had a time to fully review the code but I suspect some of this is because any time we try to manually parse output from a tool which does not historically cared about maintaining the consistency of the output from version to version. free is available on many linux distributions but there is no guarantee which version and therefore how to parse the output. To illustrate even across two hosts with Ubuntu the output differs. on 14.04:

$ free -m
             total       used       free     shared    buffers     cached
Mem:          7984       5978       2005          0        321       4796
-/+ buffers/cache:        860       7123
Swap:            0          0          0

on 18.04:

$ free -m
              total        used        free      shared  buff/cache   available
Mem:          15920       12119         984        1439        2816        2093

Here we see the string that we are searching for is different, data in a row moved to a column, when no swap is present the line is omitted, etc making it difficult to handle all of the various versions. On top of this the OS (kernel) changes how it reports memory, this is something that we see also with /proc/meminfo output as well. While we like to pretend that monitoring memory consumption is easy, its very dependent on the system its being run on. I would suggest if you have a new enough kernel (anything 4.x should be good) and you are allowed to have a compiler present to use this check: https://github.com/sensu-plugins/sensu-plugins-memory-checks/blob/master/bin/check-ram.rb as it's the most accurate I have seen cross platform and uses more native OS methods than manually parsing the output of thefree command.

I am all for improving the existing check and if there is a bug we should fix it. We just need to fix it so that it does not break systems where it is properly reporting correctly. It's easy to say its broken for me so it must be broken for everyone but let's remember to take a step back and check if its working for other systems/platforms before deciding that the code is fundamentally flawed. There is certainly more work that could be done to handle all those differences in platform, kernel, version of free, etc and can try to find some time to look more closely at this but I make no promises on my end other than being able to review changes being proposed.

majormoses commented 5 years ago

@xiongchiamiov I can't speak to why it was not done for the check script and only the metric script as I was not the submitter but I can assume he did not use this check. I did ask if he wanted to roll out similar functionality but got no response. I am more than happy to accept another contribution to get that updated but I would not turn away a perfectly good contribution just because I want to keep each of the x number of check/metric scripts for memory in sync. In fact I would argue the fact that we have multiple checks means that different people have different preferences and needs. I personally use https://github.com/sensu-plugins/sensu-plugins-memory-checks/blob/master/bin/check-ram.rb as it uses some lower level hooks than simply parsing a file or stdout/stderr but requires a compiler as it leverages a c extension for the heavy lifting and provides the best xplatform compatibility I have seen. That being said requiring a compiler on a production system is a non starter for some orgs and they prefer to do the parsing at a file/command layer despite that it is more prone to errors and incorrect information. If I change the output of the free one someone is gonna complain that it does not match what they expect. I am all for doing it but we need to balance everyone's needs. I also do not work for sensu and do not receive any compensation for doing this so I can't exactly take on every feature that someone wants as I have to be tactical with how I spend my time maintaining such a large project.

If someone wants to tackle switching from free output to using /proc/meminfo we can release it as a major version bump so that people who properly pin their plugin versions can opt into the new behavior or can choose to using their own fork.

majormoses commented 5 years ago

@xiongchiamiov it looks like someone attempted to try it but then did not respond to my feedback about them breaking systems where it was not needed: https://github.com/sensu-plugins/sensu-plugins-memory-checks/pull/60 If you are willing to work with me I am more than happy to get it over the line.

xiongchiamiov commented 5 years ago

Thanks for all the info. I'm going to poke at using check-ram.rb and see whether it fixes our use case, if not then I'll try to carve out some time for this and will let you know if I do.