openshift / verification-tests

Blackbox test suite for OpenShift.
GNU General Public License v3.0
50 stars 170 forks source link

[OCPQE-24496] Add pods/nodes debugging info when clusteroperator does not transition to expected status #3640

Closed xingxingxia closed 1 month ago

xingxingxia commented 1 month ago

Sometimes there were failures like OCPQE-24496. For such, it is often due to other system issue, for example, some master nodes may be not Ready, et al ..., therefore the auth pods on some master nodes are Pending or so. Thus adding such debugging info via this PR to easier exclude auth cause. Test demo log: /job/Runner-v3-smoke/7616/console as below

    Given operator "authentication" becomes available/non-progressing/non-degraded within 30 seconds # features/step_definitions/operators.rb:188
      [09:59:54] INFO> #### Operator authentication Expected conditions: {"Available"=>"True", "Progressing"=>"False", "Degraded"=>"False"}
      [09:59:54] INFO> #### After 1.0035127690061927 seconds and 1 iterations operator authentication becomes: {"Available"=>"True", "Progressing"=>"False", "Degraded"=>"False"}

    Given operator "authentication" becomes available/progressing/non-degraded within 30 seconds     # features/step_definitions/operators.rb:188
...
      [09:59:55] INFO> Exit Status: 0
      [10:00:19] INFO> last 3 messages repeated 4 times
      [10:00:24] INFO> #### Operator authentication Expected conditions: {"Available"=>"True", "Progressing"=>"True", "Degraded"=>"False"}
      [10:00:24] INFO> #### Checking pods in namespace openshift-authentication:
      [10:00:24] INFO> Shell Commands: oc get pods -o wide --kubeconfig=/home/jenkins/ws/workspace/Runner-v3-smoke/workdir/ocp4_admin.kubeconfig --namespace=openshift-authentication
      NAME                               READY   STATUS    RESTARTS   AGE    IP            NODE                                                          NOMINATED NODE   READINESS GATES
      oauth-openshift-554444bc68-f5db2   1/1     Running   0          153m   10.128.0.55   xxxxxxxxxxxx-master-2.us-central1-c.c.openshift-qe.internal   <none>           <none>
      oauth-openshift-554444bc68-l4sf8   1/1     Running   0          152m   10.130.0.60   xxxxxxxxxxxx-master-0.us-central1-a.c.openshift-qe.internal   <none>           <none>
      oauth-openshift-554444bc68-qmz5w   1/1     Running   0          152m   10.129.0.80   xxxxxxxxxxxx-master-1.us-central1-b.c.openshift-qe.internal   <none>           <none>
      [10:00:25] INFO> Exit Status: 0
      [10:00:25] INFO> #### Checking pods in namespace openshift-oauth-apiserver:
      [10:00:25] INFO> Shell Commands: oc get pods -o wide --kubeconfig=/home/jenkins/ws/workspace/Runner-v3-smoke/workdir/ocp4_admin.kubeconfig --namespace=openshift-oauth-apiserver
      NAME                         READY   STATUS    RESTARTS   AGE    IP            NODE                                                          NOMINATED NODE   READINESS GATES
      apiserver-687fc644d6-ghgdc   1/1     Running   0          166m   10.129.0.63   xxxxxxxxxxxx-master-1.us-central1-b.c.openshift-qe.internal   <none>           <none>
      apiserver-687fc644d6-kwtcn   1/1     Running   0          165m   10.128.0.37   xxxxxxxxxxxx-master-2.us-central1-c.c.openshift-qe.internal   <none>           <none>
      apiserver-687fc644d6-q22lb   1/1     Running   0          164m   10.130.0.48   xxxxxxxxxxxx-master-0.us-central1-a.c.openshift-qe.internal   <none>           <none>
      [10:00:26] INFO> Exit Status: 0
      [10:00:26] INFO> #### Checking nodes:
      [10:00:26] INFO> Shell Commands: oc get nodes --kubeconfig=/home/jenkins/ws/workspace/Runner-v3-smoke/workdir/ocp4_admin.kubeconfig
      NAME                                                          STATUS   ROLES                  AGE    VERSION
      xxxxxxxxxxxx-master-0.us-central1-a.c.openshift-qe.internal   Ready    control-plane,master   174m   v1.26.11+4ad3e1b
      xxxxxxxxxxxx-master-1.us-central1-b.c.openshift-qe.internal   Ready    control-plane,master   174m   v1.26.11+4ad3e1b
      xxxxxxxxxxxx-master-2.us-central1-c.c.openshift-qe.internal   Ready    control-plane,master   174m   v1.26.11+4ad3e1b
      xxxxxxxxxxxx-worker-a-nvt8q                                   Ready    worker                 165m   v1.26.11+4ad3e1b
      xxxxxxxxxxxx-worker-b-rkq7j                                   Ready    worker                 165m   v1.26.11+4ad3e1b
      xxxxxxxxxxxx-worker-c-lsn25                                   Ready    worker                 165m   v1.26.11+4ad3e1b
      [10:00:27] INFO> Exit Status: 0
      The authentication operator still didn't become {"Available"=>"True", "Progressing"=>"True", "Degraded"=>"False"} after 30 seconds (RuntimeError)
      /home/jenkins/ws/workspace/Runner-v3-smoke/features/step_definitions/operators.rb:249:in `/^operator "(.+?)" becomes ([\S]+|<%=.+?%>)(?: within ([0-9]+|<%=.+?%>) seconds)?$/'
      features/test/operators.feature:15:in `operator "authentication" becomes available/progressing/non-degraded within 30 seconds'
waiting for operation up to 3600 seconds.. 
waiting for operation up to 3600 seconds.. 
      [10:00:27] INFO> === After Scenario: test print pods and nodes info unless success ===

CC @liangxia , @pruan-rht , please help review / merge, thanks!

openshift-ci[bot] commented 1 month ago

@xingxingxia: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
liangxia commented 1 month ago

/lgtm

openshift-ci[bot] commented 1 month ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liangxia

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/verification-tests/blob/master/OWNERS)~~ [liangxia] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment