medic / upgrade-service-kubernetes

Upgrade service for k8s
0 stars 0 forks source link

gamma.dev upgrades not working #13

Open Hareet opened 1 year ago

Hareet commented 1 year ago

When we click upgrade on gamma.dev, it says Error triggering upgrade (which was a known issue), but the container image never changes when we reload the page. Nothing displayed in the upgrade-service-kubernetes logs.

@henokgetachew Please confirm your access to EKS and gamma.dev containers

henokgetachew commented 1 year ago

Re: access, should I have received a message elsewhere? Is the kubeconfig shared somewhere?

Hareet commented 1 year ago

Bottom of this comment: https://github.com/medic/medic-infrastructure/issues/538#issuecomment-1384729760

For admin cluster access, you can just run the aws eks update-kubeconfig command.

henokgetachew commented 1 year ago

@Hareet I'm getting a Cannot assume STS role error on this.

henokgetachew commented 1 year ago

The error here was caused by how you deployed the upgrade-service deployment. You have to expose the upgrade service to the cluster by deploying a kubernetes service. Look at the error log here:

image

See here that there is no service for the upgrade service:

image

I have created the service for it and it is now working.

image

Confirm that upgrade is working:

image

Closing this as taken care of.

Hareet commented 1 year ago

@henokgetachew The k8s service resource was initially deployed and deleted when you happened to look at it.

When the node fails over, the upgrade-service will fail to bind on the correct port, putting the upgrade-service pod into CrashLoopBackOff. The error from the pod states the ip address/port is already in use. The ip address is the address of the k8s service resource. Eventually the upgrade-service pod is able to keep restarting and binds to the port correctly, but from my experience this has taken upwards of 1 hour.

This means failovers could lead to the upgrade service having a larger downtime. You can test this by:

I'm not sure what the solution would be right now, but the other pods (API, CouchDB) don't end up unable in this pod to service resource conflict.