Closed davidschober closed 6 years ago
Based on the finding above, I have implemented a new ElastiCache alert in CloudWatch named stack-p-avr-elasticache-freeablememory which will alert us via OpsGenie if/when FreeableMemory drops below 1GB for 10 minutes. @davidschober @MANorth @mbklein
With that as an alert and a quick howto from @mbklein to explain which ElastiCache entries can be flushed, and how to do so, we should be alerted before anything goes down and be able to remedy the problem if it should arise again.
@Toputnal thanks! That's awesome.
I have looked at AVR's commit history and found that the commit to avoid this situation was never deployed to production. We need a deployment window to roll out the change.
If we need to wait more than a day, can you please do a quick write-up in the wiki @mbklein so that if this happens again before the deployment window no one needs to call you? :-)
@mbklein Molly OK'd deployment Friday, Monday or Tuesday 8-10 am. Let me do so me final testing on the canvas integration.
Kicking off new issue to track deployment.
What happened
09/04/18. Avalon frontend and workers went down hard. After a reboot there was a momentary recovery, then the system went down hard again. Core infrastructure remained operational (SOLR, FcRepo.
Why do we think it happened?
Redis was showing OOM errors. We could not find cloudwatch stats on that so we had to search the logs.
Did something make it worse?
What are we doing now?
What can we do later?
Create better monitoring. (@Toputnal can you brainstorm here)
How we fixed it
@mbklein can you note what we did.
Issue References
Cloudwatch metrics