operator-framework / operator-sdk

SDK for building Kubernetes applications. Provides high level APIs, useful abstractions, and project scaffolding.
https://sdk.operatorframework.io
Apache License 2.0
7.15k stars 1.74k forks source link

Unable to deploy CNF when there are some metrics apiservices are in False or FailedDiscoveryCheck #6782

Open ansvu opened 1 month ago

ansvu commented 1 month ago

Type of question

Question

What did you do?

There is a partner using ansbile operator 1.34-2, when they tried to deploy their CNF, the following error occurred.

2024-07-05 06:25:14,241 p=17 u=ansible n=ansible | fatal: [localhost]: FAILED! => {"changed": false, "msg": "Failed to create object: b'Unable to determine if virtual resource\\n'", "reason": "Internal Server Error"}

This is the API being called from ansible code:

- name: Store CNF status and data
 k8s:
   api_version: "{{ cnf_resource_api_map['ConfigMap'] }}"
   kind: ConfigMap
   state: present
   namespace: '{{ ansible_operator_meta.namespace }}'
   name: cnf-info-data

They noticed these two apiservices are in False or FailedDiscoveryCheck state:

v1beta1.custom.metrics.k8s.io                  kube-system/prometheus-adapter              False (FailedDiscoveryCheck)   118s
v1beta1.metrics.k8s.io                         gke-managed-metrics-server/metrics-server   False (FailedDiscoveryCheck)   38d

If they removed these two apiservices then the CNF deployment worked fine.

They said that they did not observe any error in ansible-operator v1.31 when there are some apiservices in False state. Are there any new changes in ansible-operator v1.34.2 that triggered this issue? Is it needed for all apiservices to be in True state now?

What did you expect to see?

CNF to be deployed without this error "Failed to create object: b'Unable to determine if virtual resource

What did you see instead? Under which circumstances?

any ansible task used by the operator through the ansible K8s module, throwing the error.

Environment

Operator type:

ansible-operator 1.34-2

Kubernetes cluster type:

Google GKE

$ operator-sdk version

ansbile-operator 1.34-2

$ go version (if language is Go)

NA

$ kubectl version v1.29.3

Additional context

Some existing issues reported but there is no solution but advised to fix the cluster health or removed apiservices.
https://access.redhat.com/solutions/6813781

https://bugzilla.redhat.com/show_bug.cgi?id=2063774

https://github.com/operator-framework/operator-sdk/issues/5596

https://github.com/operator-framework/operator-sdk/pull/6222

acornett21 commented 1 month ago

Hi @ansvu what version of operator-sdk is being used? There were many issues in the 1.34 series of both operator-sdk and of the ansilbe plugin. I'd considering anything not latest of both, to have potential issues, can they test with 1.35.0 of operator-sdk? This contains 1.34.3 of the ansible plugin.

ansvu commented 1 month ago

Thanks @acornett21 for your info. They used the ansible-operator version from the community. So you meant this version quay.io/operator-framework/ansible-operator:v1.34.3?

acornett21 commented 1 month ago

@ansvu Are they just updating the image and not updating the version of the binary needed to scaffold/build a project?

ansvu commented 1 month ago

@acornett21 This CNF is a little bit special, they combined between helm chart and ansible-operator and they used ansible-operator version straight from here quay.io/operator-framework/ansible-operator. No OLM integrated. It designs and architects not only for OCP but also other Kubernetes cluster as well.

acornett21 commented 1 month ago

@ansvu I understand, but if they have to have some yaml manifests that go along with the ansible operator. So wouldn't they be using the operator-sdk to build/bunlde/etc those manifests? If so those versions should be in-sync.

ansvu commented 1 month ago

Hi @acornett21 as I know that they don't use operator-sdk to build the ansible-operator image (bundle/etc) but using the ansible-operator image from this link quay.io/operator-framework/ansible-operator. What or how to maintain/modify the build/bundle/manfitests, this question has been asked to them. We just noticed this version 1.35.0 just built quay.io/operator-framework/ansible-operator:v1.35.0 2 hours ago. Can they try to test this version 1.35.0? Thanks.

ansvu commented 1 month ago

Hi @acornett21, they used this version v1.35.0 to test with following condition(apiservice)

kubectl get apiservice | grep False                                                                                            
v1alpha1.example.com                           try/api                                     False (ServiceNotFound)   27m 

The result has same error as in version v1.34.0

2024-07-24 15:29:13,913 p=3470 u=ansible n=ansible | TASK [cnf_status : Store CNF status and data] **********************************                      
2024-07-24 15:29:13,913 p=3470 u=ansible n=ansible | fatal: [localhost]: FAILED! => {"changed": false, "msg": "Failed to create object: b'Unable to determine if virtual resource\\n'", "reason": "Internal Server Error"}                                                                                            
2024-07-24 15:29:13,914 p=3470 u=ansible n=ansible | PLAY RECAP
wying3 commented 3 weeks ago

Hi @acornett21, has any suggests on above test result?