Closed Shaked closed 1 year ago
You need to set a fleet.yaml file in your repo and change the default maximum of servers that can be unavailable during an update: For example, the following (when written in yaml indentation) will allow bundle installation even when most servers are down:
rolloutStrategy: maxUnavailable: 100%
Documentation is here: https://fleet.rancher.io/gitrepo-structure/#fleetyaml
@djkube
I have tried this but for some reason I still end up with WaitApplied
. If I stop the one device that is offline, it changes to OutOfSync
and updates do not work.
namespace: example
rolloutStrategy:
maxUnavailable: 100%
targetCustomizations:
- name: dev-master
helm:
values:
replication: false
Any idea?
Also, is there a way to debug/find logs in this case?
Thank you
EDIT:
I have found that one cluster is not offline but does experience an error because it's missing a secret. I am wondering why this should delay all other clusters from being updated. Any idea?
@Shaked
Did you try setting maxUnavailablePartitions too?
You can run kubectl describe app appname
on the affected cluster to see some info.
@djkube
Did you try setting maxUnavailablePartitions too?
No, I haven't. I assumed that maxUnavailable
is good enough.
You can run kubectl describe app appname on the affected cluster to see some info.
I see now, thanks.
I still don't fully understand why a failure of one cluster effects the rest of the clusters' deployment...
@Shaked By default maxUnavailablePartitions is 0, and the clusters are automatically partitioned. The default probably made sense in some use cases, but not in yours (or mine).
@djkube
Interesting. Gonna test this and come back with my findings. Thank you!
no further reports
Hey there!
I'm using Rancher 2.5, already deploying to different IoT devices (Jetson Xavier mainly with one exception of Jetson TX2).
A few days ago my TX2 had an issue and since then I was seeing either
WaitApplied
orOutOfSync
in Rancher's dashboard.For some reason, up until I fixed my TX2, none of the deployments were happening and I only saw the statuses mentioned above.
When I my TX2 was back, it took few minutes and all of the deployments in the other devices (Xavier) started running.
Before I file a bug, I'd like to know if there's a way to debug/find logs in rancher/fleet when
WaitApplied
andOutOfSync
happen.Thank you for this great tool!
Shaked