nokia / danm

TelCo grade network management in a Kubernetes cluster
BSD 3-Clause "New" or "Revised" License
373 stars 81 forks source link

Danm installation through installer not going through #238

Closed sriramec closed 3 years ago

sriramec commented 4 years ago

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

bug feature

What happened: Danm installation through installer not going through

What you expected to happen: Danm installation should succeed using installer job

How to reproduce it: Modify the danm-installer-config.yaml as per the bootstrap CNI, I have mentioned calico. Make sure to have /etc/cni/net.d/calico.conf in all the nodes of the cluster. Install danm using installer. Installer job is crashing.

root@master-node:~/sriram/7-10-2020/danm# kubectl apply -f integration/install
serviceaccount/danm-installer created
clusterrole.rbac.authorization.k8s.io/caas:danm-installer created
clusterrolebinding.rbac.authorization.k8s.io/caas:danm-installer created
configmap/danm-installer-config created
job.batch/danm-installer created

root@master-node:~/sriram/7-10-2020/danm# kubectl get pods -n kube-system
NAME                                       READY   STATUS             RESTARTS   AGE
calico-kube-controllers-675b7c9569-mq98v   1/1     Running            0          5h26m
calico-node-48zx2                          1/1     Running            0          5h25m
calico-node-pb552                          1/1     Running            0          5h26m
coredns-f9fd979d6-8qmb9                    1/1     Running            0          5h28m
coredns-f9fd979d6-lmmvn                    1/1     Running            0          5h28m
**danm-installer-bk5wn                       0/1     CrashLoopBackOff   7          13m**
etcd-master-node                           1/1     Running            0          5h28m

root@master-node:/etc/cni/net.d# pwd
/etc/cni/net.d
root@master-node:/etc/cni/net.d# ls
calico.conf  calico-kubeconfig

root@worker01:/etc/cni/net.d# pwd
/etc/cni/net.d
root@worker01:/etc/cni/net.d# ls
calico.conf  calico-kubeconfig
root@worker01:/etc/cni/net.d#

These are the logs that I see. Sine the pod is restarted repeatedly, logs say resource exists. But I had cleaned up everything before doing installation

root@master-node:~/sriram/7-10-2020/danm/integration/install# kubectl logs danm-installer-bk5wn -f -n kube-system
Not using any image registry prefix
Not using any image tag
Not using any image pull secret

Reading Kubernetes API server certificate

Applying CRDs to extend Kubernetes API...
customresourcedefinition.apiextensions.k8s.io/clusternetworks.danm.k8s.io unchanged
customresourcedefinition.apiextensions.k8s.io/danmeps.danm.k8s.io unchanged
customresourcedefinition.apiextensions.k8s.io/tenantconfigs.danm.k8s.io unchanged
customresourcedefinition.apiextensions.k8s.io/tenantnetworks.danm.k8s.io unchanged

Creating Service Account
Error from server (AlreadyExists): serviceaccounts "danm" already exists
clusterrole.rbac.authorization.k8s.io/caas:danm unchanged
clusterrolebinding.rbac.authorization.k8s.io/caas:danm unchanged

Creating WebHook certificate...
creating certs in tmpdir /tmp/tmp.ooBNpd
Generating RSA private key, 2048 bit long modulus (2 primes)
..............................................+++++
...+++++
e is 65537 (0x010001)
Error from server (AlreadyExists): error when creating "STDIN": certificatesigningrequests.certificates.k8s.io "danm-webhook-svc.kube-system" already exists

I see this csr being present

root@master-node:~/sriram/7-10-2020/danm# kubectl get csr -n kube-system
NAME                           AGE   SIGNERNAME                     REQUESTOR                                          CONDITION
danm-webhook-svc.kube-system   10m   kubernetes.io/legacy-unknown   system:serviceaccount:kube-system:danm-installer   Pending

Its not getting approved. All the images required for danm installation are present in the cluster. Is there anything I m missing ? please suggest.

Anything else we need to know?:

Environment:

eMGabriel commented 4 years ago

Hi @sriramec,

First issue is that danm-installer pod is in CrashLoopBackOff. Please also include the output of kubectl describe for danm-installer-* pod, we may find out from it why pod cannot run.

sriramec commented 4 years ago

Hi emGabriel,

Please find the output of

"kubectl describe pod danm-installer-48847 -n kube-system"

root@master-node:/etc/cni/net.d# kubectl describe pod danm-installer-48847 -n kube-system
Name:         danm-installer-48847
Namespace:    kube-system
Priority:     0
Node:         worker01/192.168.56.9
Start Time:   Thu, 08 Oct 2020 00:15:56 +0530
Labels:       controller-uid=635b12fa-9ed0-4f0a-a8c8-02202bbae6a3
              job-name=danm-installer
Annotations:  cni.projectcalico.org/podIP: 172.17.5.19/32
              cni.projectcalico.org/podIPs: 172.17.5.19/32
Status:       Running
IP:           172.17.5.19
IPs:
  IP:           172.17.5.19
Controlled By:  Job/danm-installer
Containers:
  danm-installer:
    Container ID:   docker://188a3615b98c2a6d2aa35de3f3b4779c6041eebe9ba090c3c34523c90ad4b662
    Image:          danm-installer:latest
    Image ID:       docker://sha256:76de65eb0ab0a1f5b54da9c4eb6a71b66c36591f1936e9f7109218ec8ce465de
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Fri, 09 Oct 2020 10:27:18 +0530
      Finished:     Fri, 09 Oct 2020 10:27:19 +0530
    Ready:          False
    Restart Count:  404
    Environment:    <none>
    Mounts:
      /config from danm-installer-config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from danm-installer-token-krvjv (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  danm-installer-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      danm-installer-config
    Optional:  false
  danm-installer-token-krvjv:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  danm-installer-token-krvjv
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason   Age                     From               Message
  ----     ------   ----                    ----               -------
  Normal   Pulled   42m (x397 over 34h)     kubelet, worker01  Container image "danm-installer:latest" already present on machine
  Warning  BackOff  2m26s (x9433 over 34h)  kubelet, worker01  Back-off restarting failed container
root@master-node:/etc/cni/net.d#
sriramec commented 4 years ago

Is there anything that I m missing here ? Let me know if logs are required.

Regards, Sriram

eMGabriel commented 4 years ago

Please also include a 'previous' log: kubectl logs --previous danm-installer-<genid> -f -n kube-system It has a weak chance of containing useful information, we shall see.

An idea came into my mind. Modify or rebuild the danm-installer image with the following modification (choose option which is easier for you)

This modification may solve Error from server (AlreadyExists): error when creating "STDIN": certificatesigningrequests.certificates.k8s.io "danm-webhook-svc.kube-system" already exists

toshiiw commented 4 years ago

I think I hit the same problem with k8s 1.19.2. The installer pod generates this weird error.

error: no kind "CertificateSigningRequest" is registered for version "certificates.k8s.io/v1" in scheme "k8s.io/kubectl/pkg/scheme/scheme.go:28"

The pod uses kubectl 1.17.4, which violates the k8s version skew policy. Changing KUBECTL_VERSION in scm/build/Dockerfile.install to 1.18.9 did the trick.

Levovar commented 4 years ago

if that solves the problem we might need to bump our dependencies

sriramec commented 4 years ago

Thanks everyone for the suggestions. In scm/build/Dockerfile.install I set the kubectl version to 1.19.1 since k8s version in my setup was 1.19.2, it is working fine now.