operator-framework / operator-sdk

SDK for building Kubernetes applications. Provides high level APIs, useful abstractions, and project scaffolding.
https://sdk.operatorframework.io
Apache License 2.0
7.19k stars 1.74k forks source link

Ansible Operator stuck in infinite reconciliation loop on K8S >= 18.x #4258

Closed Dimss closed 3 years ago

Dimss commented 3 years ago

Bug Report

What did you do?

simple operator which is deploys two manifests: static pod and crd

What did you expect to see?

Operator successfully finish reconciliation loop

What did you see instead? Under which circumstances?

Operator stuck in infinite reconciliation loop

Environment

K8S v1.19.3 OperatorSDK 1.2.0

/language ansible

Kubernetes cluster type: vanilla

$ operator-sdk version

operator-sdk version: "v1.2.0", commit: "215fc50b2d4acc7d92b36828f42d7d1ae212015c", kubernetes version: "v1.18.8", go version: "go1.15.3", GOOS: "darwin", GOARCH: "amd64"

$ kubectl version

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-16T00:04:31Z", GoVersion:"go1.14.4", Compiler:"gc", Platform:"darwin/amd64"} Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-16T20:43:08Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"}

Possible Solution

  1. It's appears that on K8S version < 1.18, the same code works as expected.
  2. The problem happens (in my case) only when I deploying either static pod with toleration or when I deploy CRD (by operator) with status
  3. I've created sample operator which can easily reproduce that issue.
    
    # deploy operator 
    kubectl create -f https://raw.githubusercontent.com/Dimss/infinite-loop-operator/main/bundle.yaml

deploy bad CR and watch operator logs, the operator will stack in infinite reconciliation loop

kubectl create -f https://raw.githubusercontent.com/Dimss/infinite-loop-operator/main/config/samples/bad.yaml

remove bad CR

kubectl delete -f https://raw.githubusercontent.com/Dimss/infinite-loop-operator/main/config/samples/bad.yaml

manually remove test CRD

kubectl delete crd testcrds.mlops.cnvrg.io

deploy good CR

kubectl create -f https://raw.githubusercontent.com/Dimss/infinite-loop-operator/main/config/samples/good.yaml


#### Additional context

The exact same flow will works without any issue on K8S v1.17.13
pre commented 3 years ago

operator-sdk Helm operator has the same problem with infinite reconciliation loop on kubernetes 1.18 - might share the same root issue?

jmrodri commented 3 years ago

I was able to recreate this thanks for the excellent project example. One thing I've noticed between the good and the bad logs is as follows.

good

The good has 2 playbook_on_stats which prints out the following. Notice the changed value. The first run changed 1.

--------------------------- Ansible Task Status Event StdOut  -----------------

PLAY RECAP *********************************************************************
localhost                  : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

On the second reconcile it changed 0:

--------------------------- Ansible Task Status Event StdOut  -----------------

PLAY RECAP *********************************************************************
localhost                  : ok=2    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

bad

Contrast that to the bad. On its first run it changed 2. Nothing to worry about there. It is the first reconcile.

--------------------------- Ansible Task Status Event StdOut  -----------------

PLAY RECAP *********************************************************************
localhost                  : ok=2    changed=2    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0 

Let's check out the second reconcile. Hrm looks like it too changed 2.

--------------------------- Ansible Task Status Event StdOut  -----------------

PLAY RECAP *********************************************************************
localhost                  : ok=2    changed=2    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

Now it seems we have a huge problem :) On the SEVENTH reconcile it still changed 2.

--------------------------- Ansible Task Status Event StdOut  -----------------

PLAY RECAP *********************************************************************
localhost                  : ok=2    changed=2    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

I'm still debugging the issue but wanted to add an update to what I've found so far.

jmrodri commented 3 years ago

So far I haven't found a bug in the ansible-operator itself, it is responding to events that are causing the reconcile to continue. There might be a problem with how the k8s modules are handling changes.

jmrodri commented 3 years ago

I bumped this to v1.5.0.