gamma.dev upgrades not working

Hareet commented 1 year ago

When we click upgrade on gamma.dev, it says Error triggering upgrade (which was a known issue), but the container image never changes when we reload the page. Nothing displayed in the upgrade-service-kubernetes logs.

@henokgetachew Please confirm your access to EKS and gamma.dev containers

henokgetachew commented 1 year ago

Re: access, should I have received a message elsewhere? Is the kubeconfig shared somewhere?

Hareet commented 1 year ago

Bottom of this comment: https://github.com/medic/medic-infrastructure/issues/538#issuecomment-1384729760

For admin cluster access, you can just run the aws eks update-kubeconfig command.

henokgetachew commented 1 year ago

@Hareet I'm getting a Cannot assume STS role error on this.

henokgetachew commented 1 year ago

The error here was caused by how you deployed the upgrade-service deployment. You have to expose the upgrade service to the cluster by deploying a kubernetes service. Look at the error log here:

See here that there is no service for the upgrade service:

I have created the service for it and it is now working.

Confirm that upgrade is working:

Closing this as taken care of.

Hareet commented 1 year ago

@henokgetachew The k8s service resource was initially deployed and deleted when you happened to look at it.

When the node fails over, the upgrade-service will fail to bind on the correct port, putting the upgrade-service pod into CrashLoopBackOff. The error from the pod states the ip address/port is already in use. The ip address is the address of the k8s service resource. Eventually the upgrade-service pod is able to keep restarting and binds to the port correctly, but from my experience this has taken upwards of 1 hour.

This means failovers could lead to the upgrade service having a larger downtime. You can test this by:

Terminating the eks instance that hosts the upgrade-service. A new instance should pop up, pods will move over there but the upgrade-service will error for a decent amount of time.

I'm not sure what the solution would be right now, but the other pods (API, CouchDB) don't end up unable in this pod to service resource conflict.

medic / upgrade-service-kubernetes

gamma.dev upgrades not working #13