ministryofjustice / cloud-platform

Documentation on the MoJ cloud platform
MIT License
86 stars 44 forks source link

Claim for crown court defence prod EC2 instance failure #931

Closed jsugarman closed 5 years ago

jsugarman commented 5 years ago

Service name

Claim for crown court defence - CCCD

Service environment

Impact on the service

Unable to deploy app or update stack (via template deploy). Only 3 of the 4 instances has the app on it and am not sure if load balancer will be sending traffic its way.

Problem description

I was unable to (fab) update the prod stack yesterday due to one instance having problems (other environments updated fine. I terminated the offending production (gamma) instance but the new instance, once up, had even more problems.

I have duplicated the problem on our disaster environment too now. By terminating an instance on disaster the new instance is suffering the same problems.

The primary cause appears to be:

[34.253.228.61] out: ----------
[34.253.228.61] out:           ID: docker-dependencies
[34.253.228.61] out:     Function: pkg.installed
[34.253.228.61] out:       Result: False
[34.253.228.61] out:      Comment: The following packages failed to install/update: linux-image-extra-3.13.0-170-generic. The following packages were already installed: ca-certificates, procps, pciutils.
[34.253.228.61] out:      Started: 07:05:54.263877
[34.253.228.61] out:     Duration: 275.746 ms
[34.253.228.61] out:      Changes:

I note that newly spun up instances are using a later version of Ubuntu trusty tahr Ubuntu 14.04.6 LTS (GNU/Linux 3.13.0-170-generic x86_64) as opposed to functioning instances on Ubuntu 14.04.5 LTS (GNU/Linux 3.13.0-141-generic x86_64)

Contact person

Joel Sugarman

pwyborn commented 5 years ago

Investigated the new instance. Found that Docker not running 1 instance "out of service", 3 instances "In service" Tried "salt-call update" on instance Fabric update - each time giving the same message that Joel received regarding the package "linux-modules-extra-3.13.0-170-generic" Tried updating "docker-deploy" and "moj-docker-deploy" formulas. Still same message on 1 of the instances - the other 3 ok. Still 1 instance "out of service", 3 instances "In service"

At 11:29 Alert that all of the instances now "out of service" Investigated this. Tried Fab updates etc At 11:59 Colin bruce managed to bring up 3 of the instances by running "sudo service advocatedefencepayments_container restart" 12:17 Lukasz fixed the last instance manually ( the dpkg issue ), running update once again to make sure instance have everything installed.