k0sproject / k0s

k0s - The Zero Friction Kubernetes
https://docs.k0sproject.io
Other
3.63k stars 360 forks source link

Re-applying a failed `Plan` does not re-trigger signalling #4006

Open jnummelin opened 8 months ago

jnummelin commented 8 months ago

Before creating an issue, make sure you've checked the following:

Platform

Not relevant

Version

v1.28.5

Sysinfo

`k0s sysinfo`
➡️ Please replace this text with the output of `k0s sysinfo`. ⬅️

What happened?

Applying autopilot Planwith e.g. a wrong SHA for k0s binary makes it stall. Even if user re-applies with a correct SHA it's still "stuck" on failed state.

status:
  commands:
    - id: 0
      k0supdate:
        controllers:
          - lastUpdatedTimestamp: '2024-02-02T09:56:19Z'
            name: master1
            state: SignalPending
          - lastUpdatedTimestamp: '2024-02-02T09:56:19Z'
            name: master2
            state: SignalApplyFailed
          - lastUpdatedTimestamp: '2024-02-02T09:56:19Z'
            name: master3
            state: SignalPending
      state: ApplyFailed
  state: ApplyFailed

The steps to workaround:

  1. Delete the old plan object
  2. Delete the autopilot annotation on related node object(s), including ControlNodes

Docs do not really say anything how to recover from these kinds of cases. 😞

Steps to reproduce

  1. Apply Plan with a wrong SHA for k0s download
  2. Observe the Plan getting to failed state
  3. Re-apply with correct SHA
  4. Observe the Plan still being in failed state

Expected behavior

Re-applying with a correct SHA should make autopilot to proceed. Autopilot should be able to figure out the Plan is modified and re-trigger the signalling for failed nodes. Maybe we could add the plan resourceVersion into the signalling data so AP can figure out if it should re-trigger the failed signal?

Actual behavior

No response

Screenshots and logs

time="2024-02-02 10:22:41" level=error msg="Unable to download 'https://github.com/k0sproject/k0s/releases/download/v1.29.1+k0s.0/k0s-v1.29.1+k0s.0-amd64': checksum mismatch" component=a
utopilot controller=ControlNode leadermode=false object=ControlNode reconciler=downloading signalnode=master2

The

Additional context

No response

github-actions[bot] commented 7 months ago

The issue is marked as stale since no activity has been recorded in 30 days

github-actions[bot] commented 6 months ago

The issue is marked as stale since no activity has been recorded in 30 days

github-actions[bot] commented 5 months ago

The issue is marked as stale since no activity has been recorded in 30 days

github-actions[bot] commented 4 months ago

The issue is marked as stale since no activity has been recorded in 30 days

github-actions[bot] commented 3 months ago

The issue is marked as stale since no activity has been recorded in 30 days

github-actions[bot] commented 2 months ago

The issue is marked as stale since no activity has been recorded in 30 days

github-actions[bot] commented 3 weeks ago

The issue is marked as stale since no activity has been recorded in 30 days