operator-framework / operator-sdk

SDK for building Kubernetes applications. Provides high level APIs, useful abstractions, and project scaffolding.
https://sdk.operatorframework.io
Apache License 2.0
7.1k stars 1.73k forks source link

Operator behaving differently running in cluster compared to out of cluster #6678

Closed coillteoir closed 3 weeks ago

coillteoir commented 4 months ago

Type of question

General operator-related help

Question

I am creating an operator to work with a CI/CD system. When I run it locally, it creates pods as expected. But when I deploy it to the cluster, it fails to check if a pod has already been created and will create multiple pods of the same "task".

Pipeline Spec: image

Locally using make run: image

In Cluster after pushing docker image and using make deploy: image

What did you do?

To run individual tasks in a pipeline, I wrote a function which uses DFS to go through a tree data structure and checks the status of child pods before generating a new pod for that The operator then loops over the generated list of pods and creates them in the cluster. image

What did you expect to see?

The correct amount of pods being created.

What did you see instead? Under which circumstances?

Multiple pods being created and the pipeline not being validated.

Environment

Operator type:

/language go

Kubernetes cluster type:

$ operator-sdk version

1.33

$ go version

1.22

$ kubectl version

1.29

Additional context

Current branch for bug: https://github.com/coillteoir/bramble/tree/develop In the execution group of controllers. It occurs in both Kind and MiniKube

coillteoir commented 4 months ago

Im unsure of where to start with this issue, in particular if it's a bug in my code or from an upstream library such as controller-runtime.

jberkhahn commented 4 months ago

So, reconciliation loops aren't really run in a deterministic manner - multiple controllers might pick up the same event and try to reconcile it, which is why it's always a good idea to check the state of the system before trying to modify it. It looks like you're just always firing off this function that tries to create a bunch of pods.

Not sure why you're experiencing different behavior on/off cluster, though. It might just be due to the increased latency meaning you're getting less controller loops firing or something.

coillteoir commented 4 months ago

Just curious, is the controller runtime synchronous or does it use goroutines under the hood? And if it does, would ther be a way to force my reconcile loop to wait for the controller to finish provisioning/getting resources before continuing?

openshift-bot commented 1 month ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 3 weeks ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

coillteoir commented 3 weeks ago

/close

openshift-ci[bot] commented 3 weeks ago

@coillteoir: Closing this issue.

In response to [this](https://github.com/operator-framework/operator-sdk/issues/6678#issuecomment-2167895348): >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.