open-horizon / anax

Horizon agent control system
https://open-horizon.github.io/docs/anax/docs/
Apache License 2.0
70 stars 99 forks source link

Bug: Auto-upgrade failed to rollback on k3s edge cluster #4005

Open dlarson04 opened 4 months ago

dlarson04 commented 4 months ago

Describe the bug.

Running some automated tests attempting to upgrade from an anax version to latest. For some reason, the auto-upgrade failed but then the rollback failed... The rollback failed because the agent pod was in imagePullBackoff as the initContainer image was attempting to pull

public.ecr.aws/docker/library/alpine:2.31.0-1495

cronjob logs

2024-02-12 13:12:05 VERBOSE: Dowloading agent deployment to yaml file...
2024-02-12 13:12:05 VERBOSE: Downgrading version from latest to 2.31.0-1495...
2024-02-12 13:12:05 VERBOSE: Deleting current agent deployment...
2024-02-12 13:12:06 VERBOSE: Creating new agent deployment from backup yaml file...
2024-02-12 13:12:07 Waiting up to 75 seconds for the agent deployment to complete...
error: timed out waiting for the condition
2024-02-12 13:13:22 VERBOSE: Setting status to "rollback failed"
jq: error: Could not open file /var/horizon/nmp/ieam-roks-stage-3/nmpAutoUpgrade2-edgecluster-auto-ubuntu-2004-amd64-1-k3s/status.json: No such file or directory
/usr/local/bin/auto-upgrade-cronjob.sh: line 143: /var/horizon/nmp/ieam-roks-stage-3/nmpAutoUpgrade2-edgecluster-auto-ubuntu-2004-amd64-1-k3s/status.json: No such file or directory
cat: /var/horizon/nmp/ieam-roks-stage-3/nmpAutoUpgrade2-edgecluster-auto-ubuntu-2004-amd64-1-k3s/status.json: No such file or directory
jq: error: Could not open file /var/horizon/nmp/ieam-roks-stage-3/nmpAutoUpgrade2-edgecluster-auto-ubuntu-2004-amd64-1-k3s/status.json: No such file or directory
/usr/local/bin/auto-upgrade-cronjob.sh: line 131: /var/horizon/nmp/ieam-roks-stage-3/nmpAutoUpgrade2-edgecluster-auto-ubuntu-2004-amd64-1-k3s/status.json: No such file or directory
2024-02-12 13:13:37 CRONJOB LOGS FOR JOB: auto-upgrade-cronjob-28462393-tgf9x
2024-02-12 13:13:32 cronjob under namesapce: openhorizon-agent

What appears to happen is that this code https://github.com/open-horizon/anax/blob/04ccc1ad399f47a4f1c7ba38a8c990839101af8d/anax-in-k8s/cronjobs/auto-upgrade-cronjob.sh#L276-L280

changed the alpine image from public.ecr.aws/docker/library/alpine:latest to public.ecr.aws/docker/library/alpine:2.31.0-1495 which doesn't exist so it failed to restore the agent

Describe the steps to reproduce the behavior.

No response

Expected behavior.

In the event an auto-upgrade fails, the agent should rollback successfully

Screenshots.

No response

Operating Environment

linux k3s

Additional Information

No response