Open bqbn opened 4 years ago
There are more logs shared in this folder: https://drive.google.com/open?id=1G1z11bq3kWUC4jd9hTUO5ZGyovID2ZDe
These logs are from the remaining instances of 5/7/2020 push. The other 5 instances were auto terminated by AWS because they failed ASG health check.
I have the feeling that we have an uncaught error in the saga layer (related to the fetchSiteStatus
saga/action). It's a good candidate for causing all sort of problems, including OOMs.
We were able to reproduce the issue in -stage env. We captured a core dump file and uploaded it in https://drive.google.com/drive/folders/1G1z11bq3kWUC4jd9hTUO5ZGyovID2ZDe.
Old Jira Ticket: https://mozilla-hub.atlassian.net/browse/ADDFRNT-176
Describe the problem and steps to reproduce it:
What happened?
At around 11 pm PDT on 5/13/2020, the AMO website started to return 5XX and users can't access the website.
When we logged onto the AWS console and started to investigate, we noticed that 9 out of 10 instances were in unhealthy state and were taken out of ELB.
The auto scaling group (ASG) had already spawned a few new instances due to health check failure, but the application on them crashed too.
We changed ASG health check from ELB to EC2 to prevent it from keeping spawning new instances and killing old ones. Then we logged onto the remaining old instances and found the following errors.
What did you expect to happen?
We should try to find out why the frontend app crashed and fix the issue.
Anything else we should know?
We also found some errors on the addons-server, which happened during the incident time. They may be related.
┆Issue is synchronized with this Jira Task