Closed Dimss closed 3 years ago
operator-sdk Helm operator has the same problem with infinite reconciliation loop on kubernetes 1.18 - might share the same root issue?
I was able to recreate this thanks for the excellent project example. One thing I've noticed between the good and the bad logs is as follows.
good
The good has 2 playbook_on_stats
which prints out the following. Notice the changed
value. The first run changed 1.
--------------------------- Ansible Task Status Event StdOut -----------------
PLAY RECAP *********************************************************************
localhost : ok=2 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
On the second reconcile it changed 0:
--------------------------- Ansible Task Status Event StdOut -----------------
PLAY RECAP *********************************************************************
localhost : ok=2 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
bad
Contrast that to the bad. On its first run it changed 2. Nothing to worry about there. It is the first reconcile.
--------------------------- Ansible Task Status Event StdOut -----------------
PLAY RECAP *********************************************************************
localhost : ok=2 changed=2 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
Let's check out the second reconcile. Hrm looks like it too changed 2.
--------------------------- Ansible Task Status Event StdOut -----------------
PLAY RECAP *********************************************************************
localhost : ok=2 changed=2 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
Now it seems we have a huge problem :) On the SEVENTH reconcile it still changed 2.
--------------------------- Ansible Task Status Event StdOut -----------------
PLAY RECAP *********************************************************************
localhost : ok=2 changed=2 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
I'm still debugging the issue but wanted to add an update to what I've found so far.
So far I haven't found a bug in the ansible-operator itself, it is responding to events that are causing the reconcile to continue. There might be a problem with how the k8s modules are handling changes.
I bumped this to v1.5.0.
Bug Report
What did you do?
simple operator which is deploys two manifests: static pod and crd
What did you expect to see?
Operator successfully finish reconciliation loop
What did you see instead? Under which circumstances?
Operator stuck in infinite reconciliation loop
Environment
K8S v1.19.3 OperatorSDK 1.2.0
/language ansible
Kubernetes cluster type: vanilla
$ operator-sdk version
operator-sdk version: "v1.2.0", commit: "215fc50b2d4acc7d92b36828f42d7d1ae212015c", kubernetes version: "v1.18.8", go version: "go1.15.3", GOOS: "darwin", GOARCH: "amd64"
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-16T00:04:31Z", GoVersion:"go1.14.4", Compiler:"gc", Platform:"darwin/amd64"} Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-16T20:43:08Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"}
Possible Solution
toleration
or when I deploy CRD (by operator) withstatus
deploy bad CR and watch operator logs, the operator will stack in infinite reconciliation loop
kubectl create -f https://raw.githubusercontent.com/Dimss/infinite-loop-operator/main/config/samples/bad.yaml
remove bad CR
kubectl delete -f https://raw.githubusercontent.com/Dimss/infinite-loop-operator/main/config/samples/bad.yaml
manually remove test CRD
kubectl delete crd testcrds.mlops.cnvrg.io
deploy good CR
kubectl create -f https://raw.githubusercontent.com/Dimss/infinite-loop-operator/main/config/samples/good.yaml