redhat-ztp / ztp-cluster-deploy

5 stars 25 forks source link

Autoruler deployment is not created in control plane cluster #149

Open traghave123 opened 3 years ago

traghave123 commented 3 years ago

Steps followed:

  1. Deploy control plane cluster with 3 master nodes using ztp playbooks
  2. Import control plane cluster into RHACM deployed in management cluster leo
  3. Add worker node to control plane cluster

We observed that csr is not getting auto approved

[taurus@taurus-kvm-server rhacm]$ oc get csr
NAME        AGE     SIGNERNAME                                    REQUESTOR                                                                   CONDITION
csr-lb47b   54m     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-p4r9z   8m49s   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-rj8fk   24m     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-rx8pg   39m     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-tq7wm   69m     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending

While debugging we found that the pods are not created

[taurus@taurus-kvm-server rhacm]$ oc get pods -n node-autolabeler
No resources found in node-autolabeler namespace.

Which we think are responsible for auto approving csr.

Could you please help here?

traghave123 commented 3 years ago

I've tried to create the below artifacts manually,

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
  generation: 1
  labels:
    app: autoruler
  name: autoruler
  namespace: node-autolabeler
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: autoruler
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: autoruler
    spec:
      containers:
      - image: quay.io/karmab/autosigner:latest
        imagePullPolicy: Always
        name: autosigner
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      - image: quay.io/karmab/autolabeller:latest
        imagePullPolicy: Always
        name: autolabeller
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
        operator: Exists
status: {}

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: autoruler-sa-role
rules:
- apiGroups:
  - ""
  resources:
  - configmaps
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - certificates.k8s.io
  resources:
  - '*'
  verbs:
  - '*'

---  
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: autoruler-sa-rolebinding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: autoruler-sa-role
subjects:
  - kind: ServiceAccount
    name: default
    namespace: node-autolabeler

This time it got approved automatically.

But we think these artifacts should be automatically created. Could you please let us know what we are missing in order to have these artifacts automatically in control plane cluster?

yrobla commented 3 years ago

They should be created automatically, because they are part of the day 2 configuration. Can you share the repository and configuration that you are using to configure your clusters? Thanks.

traghave123 commented 3 years ago

Hi @yrobla

We found that there is issue in rhacm manifests generated, after fixing it the node-labeller pods are running. Below is the repo/path we are using for creating rhacm policy.

https://github.com/traghave123/test-ran-manifests/tree/master/rhacm-manifests

However we are facing issue with csr auto approval randomly with below errors, could you please help

Incorrect group in csr csr-hgl9p. Ignoring
Signing server cert csr-zprlf
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/autosigner.py", line 96, in watch_csrs
    certs_api.replace_certificate_signing_request_approval(csr_name, body)
  File "/usr/lib/python3.7/site-packages/kubernetes/client/api/certificates_v1beta1_api.py", line 1439, in replace_certificate_signing_request_approval
    return self.replace_certificate_signing_request_approval_with_http_info(name, body, **kwargs)  # noqa: E501
  File "/usr/lib/python3.7/site-packages/kubernetes/client/api/certificates_v1beta1_api.py", line 1548, in replace_certificate_signing_request_approval_with_http_info
    collection_formats=collection_formats)
  File "/usr/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 353, in call_api
    _preload_content, _request_timeout, _host)
  File "/usr/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 184, in __call_api
    _request_timeout=_request_timeout)
  File "/usr/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 405, in request
    body=body)
  File "/usr/lib/python3.7/site-packages/kubernetes/client/rest.py", line 290, in PUT
    body=body)
  File "/usr/lib/python3.7/site-packages/kubernetes/client/rest.py", line 233, in request
    raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (409)
Reason: Conflict
HTTP response headers: HTTPHeaderDict({'Audit-Id': 'a1223e40-645e-4310-82e3-2623a948f2bb', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Warning': '299 - "certificates.k8s.io/v1beta1 CertificateSigningRequest is deprecated in v1.19+, unavailable in v1.22+; use certificates.k8s.io/v1 CertificateSigningRequest"', 'X-Kubernetes-Pf-Flowschema-Uid': '8045ad41-3b0e-4264-86e3-1f03a8185467', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'b40c123a-2f0e-4194-856d-ae119ea2d75b', 'Date': 'Thu, 15 Jul 2021 03:18:40 GMT', 'Content-Length': '396'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Operation cannot be fulfilled on certificatesigningrequests.certificates.k8s.io \"csr-zprlf\": the object has been modified; please apply your changes to the latest version and try again","reason":"Conflict","details":{"name":"csr-zprlf","group":"certificates.k8s.io","kind":"certificatesigningrequests"},"code":409}

Also please find the attached file with full logs. ErrorDuringCSRApproval.txt

yrobla commented 3 years ago

Is this problem still happening? After getting some feedback, it seems that this problem is a temporary one but should disappear after autoapprover retries.

traghave123 commented 3 years ago

@yrobla Yeah, this happens randomly. When a worker node is added the csr is not getting approved, when this situation occurs, we need to restart the autolabeller pod using below command. oc delete pod autoruler-68f74b547c-gdwgh -n node-autolabeler and the auto approval works. Kindly help how can we avoid restart of pods, as you can see there are some errors in the logs of the pod.

yrobla commented 3 years ago

A fix has been pushed to the autolabeler image. Please can you redeploy, ensuring that you have the latest images, and see if the problem is fixed? Thanks.