k8s master nodes going down

davidread commented 6 years ago

Yesterday and this morning we had a k8s master node have problems, which caused pods to not schedule successfully. We fixed it on both occasions by recreating the node, but it suggests an underlying issue.

MikeHmoj commented 6 years ago

I am and have been since my first login attempt 10am ish today unable to get beyond the Analytical Platform login screen with the following errors, which Robin believes are related to this issue.

The error I get most persistently is this:

Analytical Platform Control Panel • Signed in as mike.hallard@justice.gov.uk • Sign out

Internal Error

GET /k8s/apis/apps/v1beta2/namespaces/user-mikehmoj/deployments was not permitted From https://cpanel-master.services.alpha.mojanalytics.xyz/

Two other errors I have got are:

Analytical Platform Control Panel • Signed in as mike.hallard@justice.gov.uk • Sign out

Internal Error

Error: socket hang up

From https://cpanel-master.services.alpha.mojanalytics.xyz/verify-email

Analytical Platform Control Panel • Signed in as mike.hallard@justice.gov.uk • Sign out ? Updated email address

Internal Error

GET /k8s/api/v1/namespaces/user-mikehmoj/pods was not permitted

From https://cpanel-master.services.alpha.mojanalytics.xyz/

davidread commented 6 years ago

@MikeHmoj Thanks for reporting. In light of today's developments, I'm not sure it is connected to this issue after all, so I've opened a fresh ticket: https://github.com/ministryofjustice/analytics-platform/issues/65

davidread commented 6 years ago

Today 12.39pm we saw 'out of memory' on a master:

Events:
  Type     Reason                   Age                From                                                      Message
  ----     ------                   ----               ----                                                      -------
  Normal   NodeNotReady             24m                kubelet, ip-192-168-14-183.eu-west-1.compute.internal     Node ip-192-168-14-183.eu-west-1.compute.internal status is now: NodeNotReady
  Warning  SystemOOM                23m (x6 over 6h)   kubelet, ip-192-168-14-183.eu-west-1.compute.internal     System OOM encountered

So we think the master machines need more ram to cope with the growing load.

So we've changed them from t2.medium to m4.xlarge using kops.

davidread commented 6 years ago

We've not had any more problems with the master nodes since then.

Also, etcd seems happy, judging by the logs in the past 24h:

stern  ".*" -n kube-system -c etcd-container --since=24h

Closing.

ministryofjustice / analytics-platform