vexxhost / atmosphere

Simple & easy private cloud platform featuring VMs, Kubernetes & bare-metal
99 stars 25 forks source link

Failed HelmRelease is not redeployed #479

Open okozachenko1203 opened 1 year ago

okozachenko1203 commented 1 year ago

Context

I faced a failure with the following error during atmosphere converge(without any tagging, just ran the full playbook) poetry run molecule converge -s default.

TASK [vexxhost.atmosphere.neutron : Deploy Helm chart] *************************
fatal: [ctl1]: FAILED! => {"changed": false, "command": "/usr/bin/helm upgrade -i --reset-values --create-namespace -f=/tmp/tmpxovxqef8.yml neutron /usr/local/src/neutron", "msg": "Failure when executing Helm command. Exited 1.\nstdout: Release \"neutron\" does not exist. Installing it now.\n\nstderr: Error: secrets \"neutron-etc\" already exists\n", "stderr": "Error: secrets \"neutron-etc\" already exists\n", "stderr_lines": ["Error: secrets \"neutron-etc\" already exists"], "stdout": "Release \"neutron\" does not exist. Installing it now.\n", "stdout_lines": ["Release \"neutron\" does not exist. Installing it now."]}

I jumped into ctl1 node and checked the HelmRelease status and it was failed.

root@ctl1:/home/ubuntu# helm status neutron
NAME: neutron
LAST DEPLOYED: Wed Jul  5 14:47:41 2023
NAMESPACE: openstack
STATUS: failed
REVISION: 1

I tried to redeploy neutron poetry run molecule converge -s default -- -- --tags neutron. This time the chart deployment task passed without any error but HelmRelease was neither updated or replaced. Therefore finally failed at Wait until network service ready task.

TASK [vexxhost.atmosphere.neutron : Deploy Helm chart] *************************
ok: [ctl1]

TASK [Create Ingress] **********************************************************

TASK [vexxhost.atmosphere.openstack_helm_ingress : Create certificate] *********
skipping: [ctl1]

TASK [vexxhost.atmosphere.openstack_helm_ingress : Set fact with wildcard certificate] ***
skipping: [ctl1]

TASK [vexxhost.atmosphere.openstack_helm_ingress : Add ClusterIssuer annotations] ***
ok: [ctl1]

TASK [vexxhost.atmosphere.openstack_helm_ingress : Create Ingress network] *****
changed: [ctl1]

TASK [vexxhost.atmosphere.neutron : Wait until network service ready] **********
# stuck here and timed out

Expected result

Neutron HelmRelease should be redeployed.

Workaround

Need to remove failed HelmRelease manually and rerun the playbook.

Solution?

Helm chart deployment task is performed by kubernetes.core.helm module. Need to get the status of HelmRelease if exists already and if it is under failed status, do redeploy or sth.

mnaser commented 1 year ago

@okozachenko1203 I often run into this too in a bad / failed deployment. I think one solution would be to run the Helm module with force if we detect the HelmRelease is in a bad state?

Alternatively, we can go and fix this upstream in the Helm module so that it can automatically handle this, since it'll be much faster to handle inside the module rather than adding multiple tasks.

We can also vendor the module into our code until the Kubernetes collection releases a new version.