operator-framework / operator-sdk

SDK for building Kubernetes applications. Provides high level APIs, useful abstractions, and project scaffolding.
https://sdk.operatorframework.io
Apache License 2.0
7.27k stars 1.75k forks source link

Unable to deploy CNF when there are some metrics apiservices are in False or FailedDiscoveryCheck #6782

Open ansvu opened 4 months ago

ansvu commented 4 months ago

Type of question

Question

What did you do?

There is a partner using ansbile operator 1.34-2, when they tried to deploy their CNF, the following error occurred.

2024-07-05 06:25:14,241 p=17 u=ansible n=ansible | fatal: [localhost]: FAILED! => {"changed": false, "msg": "Failed to create object: b'Unable to determine if virtual resource\\n'", "reason": "Internal Server Error"}

This is the API being called from ansible code:

- name: Store CNF status and data
 k8s:
   api_version: "{{ cnf_resource_api_map['ConfigMap'] }}"
   kind: ConfigMap
   state: present
   namespace: '{{ ansible_operator_meta.namespace }}'
   name: cnf-info-data

They noticed these two apiservices are in False or FailedDiscoveryCheck state:

v1beta1.custom.metrics.k8s.io                  kube-system/prometheus-adapter              False (FailedDiscoveryCheck)   118s
v1beta1.metrics.k8s.io                         gke-managed-metrics-server/metrics-server   False (FailedDiscoveryCheck)   38d

If they removed these two apiservices then the CNF deployment worked fine.

They said that they did not observe any error in ansible-operator v1.31 when there are some apiservices in False state. Are there any new changes in ansible-operator v1.34.2 that triggered this issue? Is it needed for all apiservices to be in True state now?

What did you expect to see?

CNF to be deployed without this error "Failed to create object: b'Unable to determine if virtual resource

What did you see instead? Under which circumstances?

any ansible task used by the operator through the ansible K8s module, throwing the error.

Environment

Operator type:

ansible-operator 1.34-2

Kubernetes cluster type:

Google GKE

$ operator-sdk version

ansbile-operator 1.34-2

$ go version (if language is Go)

NA

$ kubectl version v1.29.3

Additional context

Some existing issues reported but there is no solution but advised to fix the cluster health or removed apiservices.
https://access.redhat.com/solutions/6813781

https://bugzilla.redhat.com/show_bug.cgi?id=2063774

https://github.com/operator-framework/operator-sdk/issues/5596

https://github.com/operator-framework/operator-sdk/pull/6222

acornett21 commented 4 months ago

Hi @ansvu what version of operator-sdk is being used? There were many issues in the 1.34 series of both operator-sdk and of the ansilbe plugin. I'd considering anything not latest of both, to have potential issues, can they test with 1.35.0 of operator-sdk? This contains 1.34.3 of the ansible plugin.

ansvu commented 4 months ago

Thanks @acornett21 for your info. They used the ansible-operator version from the community. So you meant this version quay.io/operator-framework/ansible-operator:v1.34.3?

acornett21 commented 4 months ago

@ansvu Are they just updating the image and not updating the version of the binary needed to scaffold/build a project?

ansvu commented 4 months ago

@acornett21 This CNF is a little bit special, they combined between helm chart and ansible-operator and they used ansible-operator version straight from here quay.io/operator-framework/ansible-operator. No OLM integrated. It designs and architects not only for OCP but also other Kubernetes cluster as well.

acornett21 commented 4 months ago

@ansvu I understand, but if they have to have some yaml manifests that go along with the ansible operator. So wouldn't they be using the operator-sdk to build/bunlde/etc those manifests? If so those versions should be in-sync.

ansvu commented 4 months ago

Hi @acornett21 as I know that they don't use operator-sdk to build the ansible-operator image (bundle/etc) but using the ansible-operator image from this link quay.io/operator-framework/ansible-operator. What or how to maintain/modify the build/bundle/manfitests, this question has been asked to them. We just noticed this version 1.35.0 just built quay.io/operator-framework/ansible-operator:v1.35.0 2 hours ago. Can they try to test this version 1.35.0? Thanks.

ansvu commented 4 months ago

Hi @acornett21, they used this version v1.35.0 to test with following condition(apiservice)

kubectl get apiservice | grep False                                                                                            
v1alpha1.example.com                           try/api                                     False (ServiceNotFound)   27m 

The result has same error as in version v1.34.0

2024-07-24 15:29:13,913 p=3470 u=ansible n=ansible | TASK [cnf_status : Store CNF status and data] **********************************                      
2024-07-24 15:29:13,913 p=3470 u=ansible n=ansible | fatal: [localhost]: FAILED! => {"changed": false, "msg": "Failed to create object: b'Unable to determine if virtual resource\\n'", "reason": "Internal Server Error"}                                                                                            
2024-07-24 15:29:13,914 p=3470 u=ansible n=ansible | PLAY RECAP
wying3 commented 4 months ago

Hi @acornett21, has any suggests on above test result?

komish commented 2 months ago

Hey folks, just wanted to add more information here. To me, it would seem like https://github.com/operator-framework/operator-sdk/pull/6222 is a potential fix to this problem, given the error comes from that proxy code.

Granted, this has since moved to this repo, so the equivalent would be here: https://github.com/operator-framework/ansible-operator-plugins/blob/main/internal/ansible/proxy/inject_owner.go#L86-L96

From what I can tell, #6222 stalled because a proper test case wasn't found. Based on my testing, you can just stand up an APIService with an invalid service reference and it should immediately trigger this issue.

E.g.

apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  name: v1alpha1.example.com
spec:
  caBundle: 'Zm9vCg=='
  group: example.com
  groupPriorityMinimum: 1000
  service:
    name: example-api
    namespace: non-existent
    port: 443
  version: v1alpha1
  versionPriority: 15

The APIServer accepts this, but it immediately becomes unavailable because the underlying service is not found.

I'll leave it up to maintainers what they want to do with this information, or if they want to take #6222 and replicate it over in the ansible-operator-plugins repository.