Avalon outage 09/04/2018

davidschober commented 6 years ago

What happened

What was the date, duration and business impact of the incident?

09/04/18. Avalon frontend and workers went down hard. After a reboot there was a momentary recovery, then the system went down hard again. Core infrastructure remained operational (SOLR, FcRepo.

Why do we think it happened?

What was the root cause of the incident?

Redis was showing OOM errors. We could not find cloudwatch stats on that so we had to search the logs.

Did something make it worse?

What other events or actions contributed to the incident’s severity or duration? We could use better monitoring on our redis

What are we doing now?

What immediate actions are being taken to minimize the causes recurring?

What can we do later?

What long term action is required to reduce the frequency or effect of the root cause and contributing causes? Indicate any barriers to pursuing these items.

Create better monitoring. (@Toputnal can you brainstorm here)

How we fixed it

Describe the technical steps that were taken: from initial detection of the failure, through identification of the problem, to restoration of service.

@mbklein can you note what we did.

Issue References

Cloudwatch metrics

cloudwatch management console 2018-09-04 16-21-56

Toputnal commented 6 years ago

Based on the finding above, I have implemented a new ElastiCache alert in CloudWatch named stack-p-avr-elasticache-freeablememory which will alert us via OpsGenie if/when FreeableMemory drops below 1GB for 10 minutes. @davidschober @MANorth @mbklein

With that as an alert and a quick howto from @mbklein to explain which ElastiCache entries can be flushed, and how to do so, we should be alerted before anything goes down and be able to remedy the problem if it should arise again.

davidschober commented 6 years ago

@Toputnal thanks! That's awesome.

mbklein commented 6 years ago

I have looked at AVR's commit history and found that the commit to avoid this situation was never deployed to production. We need a deployment window to roll out the change.

Toputnal commented 6 years ago

If we need to wait more than a day, can you please do a quick write-up in the wiki @mbklein so that if this happens again before the deployment window no one needs to call you? :-)

davidschober commented 6 years ago

@mbklein Molly OK'd deployment Friday, Monday or Tuesday 8-10 am. Let me do so me final testing on the canvas integration.

davidschober commented 6 years ago

Kicking off new issue to track deployment.

nulib / avalon