operator-framework / ansible-operator-plugins

Experimental extraction/refactoring of the Operator SDK's ansible operator plugin
Apache License 2.0
7 stars 17 forks source link

Selector on watches.yaml not honoured #31

Open anupchandak opened 1 year ago

anupchandak commented 1 year ago

To control the scope of an operator in a multi-development environment, I have defined a selector at the watches.yaml level by referring here

The selector is defined as something like the below (presented below with equivalent dummy values)

- version: v1
  group: mytest.com
  kind: MyKind
  snakeCaseParameters: False
  playbook: playbooks/create.yml
  finalizer:
    name: myTest.com/finalizer
    playbook: playbooks/purge.yml
  selector:
    matchExpressions:
      - key: mytest.com/controller-namespace
        operator: In
        values: 
          - "my-test-na"

When I start my ansible runner then as expected, I see the following log at the start

{"level":"info","ts":1676367873.856818,"logger":"cmd","msg":"Watch namespaces not configured by environment variable WATCH_NAMESPACE or file. Watching all namespaces.","Namespace":""}

and I expect that my Operator will still not worry (watch) about CR defined with the label mytest.com/controller-namespace=your-test-na. But it does and reconciles it.

It is an ansible based operator and environment details are as below:

% ansible --version
/usr/local/lib/python3.9/site-packages/paramiko/transport.py:236: CryptographyDeprecationWarning: Blowfish has been deprecated
  "class": algorithms.Blowfish,
ansible [core 2.13.5]
  config file = /Users/anupchandak/ansible-profiler.cfg
  configured module search path = ['/Users/anupchandak/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /Users/anupchandak/Library/Python/3.9/lib/python/site-packages/ansible
  ansible collection location = /Users/anupchandak/.ansible/collections:/usr/share/ansible/collections
  executable location = /usr/local/bin/ansible
  python version = 3.9.14 (main, Sep  6 2022, 23:29:09) [Clang 13.1.6 (clang-1316.0.21.2.5)]
  jinja version = 3.1.2
  libyaml = True
varshaprasad96 commented 1 year ago

@jberkhahn The thread regarding this issue: https://mail.google.com/mail/u/0/#search/ansible/FMfcgzGrcXtllNJlqSFwVfvsJVrwzQjw

@anupchandak Could you please share your controller pod logs or the project, for us to able to run it locally and check the issue. The selectors should be working as expected by creating predicates, looking at the logs may help us dig into it more.

anupchandak commented 1 year ago

@varshaprasad96 - I tried creating a sample project using the Memcached example but was not able to reproduce the above issue.

I cannot share my work project for copyright restrictions.

Any pointer on how I can check what is all coming on the operator's watch list when it starts? And selector it is applying.

anupchandak commented 1 year ago

Any way to know what dependent resource was changed that triggered the operator's reconciliation loop?

varshaprasad96 commented 1 year ago

The other option is to add additional logs in ansible operator binary and try it out locally to see what is happening. Some pointers are:

  1. I'd start by looking if watches.yaml is being parsed as expected. In the sense if the selectors are being parsed and loaded from the watches file, which happens here: https://github.com/operator-framework/operator-sdk/blob/5cbdad9209332043b7c730856b6302edc8996faf/internal/ansible/watches/watches.go#L313
  2. This is where predicates are set up based on labels: https://github.com/operator-framework/operator-sdk/blob/d828db26e4c0377e8423bfbdafa36449a971f05a/internal/ansible/controller/controller.go#L115. Checking here if predicates are being created successfully would be helpful.
  3. The above two steps should help in digging the issue. If not, I would go a step further and try to replicate this method (https://github.com/kubernetes-sigs/controller-runtime/blob/b9940edaaafe3f0292d6be43b362852aab079369/pkg/predicate/predicate.go#L375), which is where predicates are created according to labels. That would help in checking if the labels are in the right format, and if the predicate func is appearing as expected.
  4. This is where the ansible controller's logic is written (https://github.com/operator-framework/operator-sdk/blob/d828db26e4c0377e8423bfbdafa36449a971f05a/internal/cmd/ansible-operator/run/cmd.go#L89), digging into logs to check the events which are being received and the requests triggering the reconciler would be helpful.

You may have to build the binary locally to test it out. The steps are here: https://sdk.operatorframework.io/docs/contribution-guidelines/developer-guide/.

Before all this, I would suggest to increase the log verbosity and check if there is anything suspicious indicating that labels haven't been set up as expected. Hope this helps!

anupchandak commented 1 year ago

@varshaprasad96 - Thank you so much for your detailed reply above.

Sorry for the late reply but I think I am able to reproduce the issue. I think it's because of the dependent resource CronJob created by the CR.

Please use the attached project and follow the below steps to reproduce the issue.

  1. Copy the project locally.
  2. Install CRD make install.
  3. Start the operator locally with ansible-operator run local --zap-devel=true.
  4. Create the first CR in the apple namespace. This will create a deployment object and a CronJob in the suspended state.
    kubectl create namespace apple
    kubectl config set-context --current --namespace=apple
    kubectl --namespace apple create -f config/samples/apple_sample.yaml
  5. Create the second CR in the banana namespace. This will create a deployment object and a CronJob in the non-suspended state.
    kubectl create namespace banana
    kubectl config set-context --current --namespace=banana
    kubectl --namespace banana create -f config/samples/banana_sample.yaml
  6. Now, stop the operator and modify the watches.yaml to only select resources from the apple namespace.
    selector:
      matchExpressions:
         - key: cache.example.com/controller-namespace
            operator: In 
            values: [apple]
  7. Restart the operator with ansible-operator run local --zap-devel=true.
  8. You should see that the operator will run whenever a CronJob in the banana namespace is triggered even though the Operator watches selector is configured to watch only resources from the apple namespace.

I have also attached logs from my local execution. Please note that to restrict the logs to only testing namespaces, I had set export WATCH_NAMESPACE=apple,banana.

memcached-operator.zip reconcile_log.txt

Thank you!

anupchandak commented 1 year ago

Team - Any comment/update on this issue?

anupchandak commented 1 year ago

Hi Team - Have you got a chance to look at this issue?

openshift-bot commented 10 months ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 9 months ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale