How to debug when OutOfSync or WaitApplied shows up

rancher / fleet

Deploy workloads from Git to large fleets of Kubernetes clusters

https://fleet.rancher.io/

Apache License 2.0

1.47k stars 216 forks source link

How to debug when OutOfSync or WaitApplied shows up #192

Closed Shaked closed 1 year ago

Shaked commented 3 years ago

Hey there!

I'm using Rancher 2.5, already deploying to different IoT devices (Jetson Xavier mainly with one exception of Jetson TX2).

A few days ago my TX2 had an issue and since then I was seeing either WaitApplied or OutOfSync in Rancher's dashboard.

For some reason, up until I fixed my TX2, none of the deployments were happening and I only saw the statuses mentioned above.

When I my TX2 was back, it took few minutes and all of the deployments in the other devices (Xavier) started running.

Before I file a bug, I'd like to know if there's a way to debug/find logs in rancher/fleet when WaitApplied and OutOfSync happen.

Thank you for this great tool!

Shaked

djkube commented 3 years ago

You need to set a fleet.yaml file in your repo and change the default maximum of servers that can be unavailable during an update: For example, the following (when written in yaml indentation) will allow bundle installation even when most servers are down:

rolloutStrategy: maxUnavailable: 100%
Documentation is here: https://fleet.rancher.io/gitrepo-structure/#fleetyaml

Shaked commented 3 years ago

@djkube

I have tried this but for some reason I still end up with WaitApplied. If I stop the one device that is offline, it changes to OutOfSync and updates do not work.

namespace: example
rolloutStrategy:
  maxUnavailable: 100%
targetCustomizations:
- name: dev-master
  helm:
    values:
      replication: false

Any idea?

Also, is there a way to debug/find logs in this case?

Thank you

EDIT:

I have found that one cluster is not offline but does experience an error because it's missing a secret. I am wondering why this should delay all other clusters from being updated. Any idea?

djkube commented 3 years ago

@Shaked Did you try setting maxUnavailablePartitions too? You can run kubectl describe app appname on the affected cluster to see some info.

Shaked commented 3 years ago

@djkube

Did you try setting maxUnavailablePartitions too?

No, I haven't. I assumed that maxUnavailable is good enough.

You can run kubectl describe app appname on the affected cluster to see some info.

I see now, thanks.

I still don't fully understand why a failure of one cluster effects the rest of the clusters' deployment...

djkube commented 3 years ago

@Shaked By default maxUnavailablePartitions is 0, and the clusters are automatically partitioned. The default probably made sense in some use cases, but not in yours (or mine).

Shaked commented 3 years ago

@djkube

Interesting. Gonna test this and come back with my findings. Thank you!

kkaempf commented 1 year ago

no further reports