socialpoint / check_elasticache

Nagios plugin to check AWS ElastiCache status
MIT License
4 stars 11 forks source link

"UNK Unable to get ElastiCache details and statistics" when using "memory" metric #4

Open mvolhontseff opened 9 years ago

mvolhontseff commented 9 years ago

When executing the following, the the script returns "UNK Unable...":

check_elasticache.py --region us-east-1 -i cluster1 -m memory -w 10 -c 5

I traced this to an issue with the metrics dict (used by the function get_cluster_stats):

metrics = {'status': 'ElastiCache availability', 'cpu': 'CPUUtilization', 'memory': 'BytesUsedForCache', <==== Problem 'swap': 'SwapUsage'}

I wasn't able to find a metric entitled "BytesUsedForCache" within the list of available Cloudwatch metrics.

I did find BytesUsedForCacheItems and FreeableMemory, however.

fr3nd commented 9 years ago

This metric only makes sense in Redis ElasticCache instances. If you're using memcached it's not available.

mvolhontseff commented 9 years ago

OK, thanks for the clarification; this was for memcache.

Since this plugin is designed to be used for both Redis and memcached, does it make sense to use the Host-Level metric "FreeableMemory" instead?

http://docs.aws.amazon.com/AmazonElastiCache/latest/UserGuide/CacheMetrics.HostLevel.html

DdPerna commented 7 years ago

I am having a similar issue using Redis ElasticCache instances.

The difference is that the check does work but periodically will show "UNK Unable to get ElastiCache details and statistics" which results in nagios alerting. It seems to happen randomly about twice a week on a replica node. when this happens i run the command

check_elasticache.py -r us-east-1 -i example-redis-002 -m memory -w 80 -c 90

and it bounces between returning a result and showing unknown for a couple minutes then returns to normal. Im not sure if the issue is with the AWS api, or if the start and end time for getting the metric is the cause. I noticed that the cpu metric takes into account a delay for the metric updating on CloudWatch and was wondering if adding this for the memory metric would fix the issue?