redhat-ztp / ztp-cluster-deploy

5 stars 25 forks source link

[ibm-telco] Issue with certificate approval while adding worker nodes #116

Open maheshd2 opened 3 years ago

maheshd2 commented 3 years ago

@yrobla We are facing issue with certificate approval while adding worker nodes. Please go through below list of use cases which we tried and corresponding issues which we are facing.

Use Case 1: Imported the control plane cluster(spoke cluster) into ACM Hub and then added a worker node - It approved the certificates and subsequently node was listed by 'oc' command.

Use Case 2: Removed the above worker node from the cluster and re-added the same node to the cluster back - But in this case, we see that certs are not getting approved and also worker node is not listed by 'oc' command.

Use Case 3: Added a 2nd worker node the existing cluster, Again it is observed that the certs are not getting approved for the 2nd worker node and even it is not listed by 'oc' command.

As discussed earlier, we came to know that importing the cluster into ACM will take care of approving the certs.

So, Please let us know how to solve the issues which we are facing now.

yrobla commented 3 years ago

Have you checked if the auto-labeler pods are running in your cluster, and can you check the logs there? Also, if you do oc get csr -A, do you see any pending certificate requests?

maheshd2 commented 3 years ago

@yrobla It seems some environmental issue, We did a fresh deployment and found node-autolabeler pods running. And all the 3 use cases which are mentioned above are working fine. We can close this issue.

maheshd2 commented 3 years ago

@yrobla FYI,

We faced csr approval issue again, but after restarting node-labeller pod...it worked.

This is the error we observed in the node labeller pod.

[root@vcuhost tmp]# oc get csr
NAME        AGE   SIGNERNAME                                    REQUESTOR                                                                   CONDITION
csr-8rdtx   78s   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
[root@vcuhost tmp]# oc logs -f autoruler-68f74b547c-gdwgh
Error from server (NotFound): pods "autoruler-68f74b547c-gdwgh" not found
[root@vcuhost tmp]# oc logs -f autoruler-68f74b547c-gdwgh -n node-autolabeler
error: a container name must be specified for pod autoruler-68f74b547c-gdwgh, choose one of: [autosigner autolabeller]
[root@vcuhost tmp]# oc logs -f autoruler-68f74b547c-gdwgh -n node-autolabeler -c autolabeler
error: container autolabeler is not valid for pod autoruler-68f74b547c-gdwgh
[root@vcuhost tmp]# oc logs -f autoruler-68f74b547c-gdwgh -n node-autolabeler -c autosigner
Missing configmap autorules in namespace node-autolabeler
No rules defined, using dummy worker one
Handling name rule .*worker.*
No specific allowed_networks defined. No check on ips will be made
Starting main signing loop...
Signing client cert csr-wwvt2
Signing server cert csr-bjzzl
Signing client cert csr-brxcs
Signing server cert csr-h552d
Signing client cert csr-4zh6n
Signing server cert csr-gsdtn
Signing client cert csr-jvrh9
Signing server cert csr-x7r2v
Signing client cert csr-q6kwz
Signing server cert csr-8tmtf
Signing client cert csr-pvpk5
Signing server cert csr-sfq2s
Incorrect username in csr csr-gt9rm. Ignoring
Incorrect username in csr csr-gt9rm. Ignoring
Incorrect username in csr csr-gt9rm. Ignoring
Incorrect username in csr csr-gt9rm. Ignoring
Incorrect username in csr csr-gt9rm. Ignoring
Incorrect username in csr csr-gt9rm. Ignoring
Incorrect username in csr csr-gt9rm. Ignoring
Incorrect username in csr csr-gt9rm. Ignoring
Incorrect username in csr csr-gt9rm. Ignoring
Incorrect username in csr csr-gt9rm. Ignoring
Incorrect username in csr csr-gt9rm. Ignoring
Incorrect username in csr csr-gt9rm. Ignoring
Incorrect username in csr csr-gt9rm. Ignoring
Incorrect username in csr csr-gt9rm. Ignoring
Incorrect username in csr csr-gt9rm. Ignoring
Incorrect username in csr csr-gt9rm. Ignoring
Incorrect username in csr csr-gt9rm. Ignoring
Incorrect username in csr csr-gt9rm. Ignoring
Incorrect username in csr csr-gt9rm. Ignoring
Incorrect username in csr csr-gt9rm. Ignoring
Incorrect username in csr csr-gt9rm. Ignoring
Incorrect username in csr csr-gt9rm. Ignoring
Incorrect username in csr csr-gt9rm. Ignoring
Incorrect username in csr csr-gt9rm. Ignoring
Incorrect username in csr csr-gt9rm. Ignoring
Incorrect username in csr csr-gt9rm. Ignoring
Incorrect username in csr csr-gt9rm. Ignoring
Incorrect username in csr csr-gt9rm. Ignoring
Incorrect username in csr csr-gt9rm. Ignoring
Incorrect username in csr csr-gt9rm. Ignoring
Incorrect username in csr csr-gt9rm. Ignoring
Incorrect username in csr csr-gt9rm. Ignoring
Incorrect username in csr csr-gt9rm. Ignoring
Incorrect username in csr csr-gt9rm. Ignoring
Incorrect username in csr csr-gt9rm. Ignoring
Signing server cert csr-6mz2c
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/autosigner.py", line 96, in watch_csrs
    certs_api.replace_certificate_signing_request_approval(csr_name, body)
  File "/usr/lib/python3.7/site-packages/kubernetes/client/api/certificates_v1beta1_api.py", line 1439, in replace_certificate_signing_request_approval
    return self.replace_certificate_signing_request_approval_with_http_info(name, body, **kwargs)  # noqa: E501
  File "/usr/lib/python3.7/site-packages/kubernetes/client/api/certificates_v1beta1_api.py", line 1548, in replace_certificate_signing_request_approval_with_http_info
    collection_formats=collection_formats)
  File "/usr/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 353, in call_api
    _preload_content, _request_timeout, _host)
  File "/usr/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 184, in __call_api
    _request_timeout=_request_timeout)
  File "/usr/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 405, in request
    body=body)
  File "/usr/lib/python3.7/site-packages/kubernetes/client/rest.py", line 290, in PUT
    body=body)
  File "/usr/lib/python3.7/site-packages/kubernetes/client/rest.py", line 233, in request
    raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (409)
Reason: Conflict
HTTP response headers: HTTPHeaderDict({'Audit-Id': '94d481fc-a7d1-425a-a123-140239882d79', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Warning': '299 - "certificates.k8s.io/v1beta1 CertificateSigningRequest is deprecated in v1.19+, unavailable in v1.22+; use certificates.k8s.io/v1 CertificateSigningRequest"', 'X-Kubernetes-Pf-Flowschema-Uid': 'fcc80339-a9a5-4aba-af4c-b3017f0e1fac', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'e4bf808b-72ff-4a15-b2d2-b7e9fdd001c8', 'Date': 'Thu, 27 May 2021 09:18:10 GMT', 'Content-Length': '396'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Operation cannot be fulfilled on certificatesigningrequests.certificates.k8s.io \"csr-6mz2c\": the object has been modified; please apply your changes to the latest version and try again","reason":"Conflict","details":{"name":"csr-6mz2c","group":"certificates.k8s.io","kind":"certificatesigningrequests"},"code":409}
^C
[root@vcuhost tmp]# oc get csr
NAME        AGE     SIGNERNAME                                    REQUESTOR                                                                   CONDITION
csr-8rdtx   3m31s   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
[root@vcuhost tmp]# oc get pods --all-namespaces | grep auto
node-autolabeler                                   autoruler-68f74b547c-gdwgh                                        2/2     Running     0          6d18h
openshift-machine-api                              cluster-autoscaler-operator-558bcc56d-kvhhc                       2/2     Running     0          6d18h
[root@vcuhost tmp]# oc delete pod autoruler-68f74b547c-gdwgh -n node-autolabeler
pod "autoruler-68f74b547c-gdwgh" deleted
[root@vcuhost tmp]#
[root@vcuhost tmp]# oc get pods --all-namespaces | grep auto
node-autolabeler                                   autoruler-68f74b547c-9g2bd                                        2/2     Running     0          38s
openshift-machine-api                              cluster-autoscaler-operator-558bcc56d-kvhhc                       2/2     Running     0          6d18h
[root@vcuhost tmp]# oc logs -f autoruler-68f74b547c-gdwgh -n node-autolabeler^Cc autolabeler
[root@vcuhost tmp]# oc get csr
NAME        AGE     SIGNERNAME                                    REQUESTOR                                                                   CONDITION
csr-8rdtx   4m54s   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-qjjb9   37s     kubernetes.io/kubelet-serving                 system:node:sos-worker-2                                                    Approved,Issued
[root@vcuhost tmp]# oc get nodes
NAME           STATUS     ROLES           AGE     VERSION
sos-master-1   Ready      master,worker   6d20h   v1.19.0+e49167a
sos-master-2   Ready      master,worker   6d20h   v1.19.0+e49167a
sos-master-3   Ready      master,worker   6d20h   v1.19.0+e49167a
sos-worker-2   NotReady   worker          55s     v1.19.0+e49167a
[root@vcuhost tmp]# watch oc get nodes
[root@vcuhost tmp]# cd /tmp/
[root@vcuhost tmp]# ls -lrth
traghave123 commented 3 years ago

@yrobla could you please check above error in the autoruler-68f74b547c-gdwgh pod. the issue seems happening more frequently.