Errors in deploying Intel OpenNESS platform on VMs

pavanats commented 4 years ago

Hi, I have created a controller and edge node setup using 2 VMs. I am unable to get a proper running setup due to the following issues seen:

1. CrashLoopBackOff errors: Following pods are always failing.

[root@controller01 ~]# kubectl get pod -A -o wide| grep Crash

cdi cdi-operator-76b6694845-hvcvw 0/1 CrashLoopBackOff 23 9h 10.16.0.4 node01 kubernetes-dashboard kubernetes-dashboard-7bfbb48676-6g7l4 0/1 CrashLoopBackOff 15 57m 10.16.0.8 node01 kubevirt virt-operator-79c97797-qzctm 0/1 CrashLoopBackOff 11 23m 10.16.0.17 node01

2. Some warning/error msgs I see are: Warning BackOff 14m (x31 over 23m) kubelet, node01 Back-off restarting failed container Warning Unhealthy 9m22s (x20 over 24m) kubelet, node01 Readiness probe failed: Get https://10.16.0.17:8443/metrics: dial tcp 10.16.0.17:8443: connect: connection refused Warning FailedMount 5m17s (x2 over 5m19s) kubelet, node01 MountVolume.SetUp failed for volume "kubevirt-operator-token-9f87j" : failed to sync secret cache: timed out waiting for the condition

Warning FailedMount 12m (x2 over 12m) kubelet, node01 MountVolume.SetUp failed for volume "kubernetes-dashboard-certs" : failed to sync secret cache: timed out waiting for the condition Warning FailedMount 12m (x2 over 12m) kubelet, node01 MountVolume.SetUp failed for volume "kubernetes-dashboard-token-2jj6c" : failed to sync secret cache: timed out waiting for the condition

I am interested in finally deploying VMs with OpenNESS but presently can't even deploy a SampleApp. Certainly, would appreciate if any pointers can be provided.

Thank you. Pavan

amr-mokhtar commented 4 years ago

Hi @pavanats, Can you clarify what steps did you use to bring up the cluster? Also, are you deploying the nodes on bare-metal or inside VMs? Did you attempt that installation on freshly installed OS?

pavanats commented 4 years ago

Following are the steps:

Created 2 CentOS7 VMs with the correct Kernel version. From the Controller VM, I ran the ansible script to deploy the controller and then also deployed network edge node. After couple of attempts, I managed to get 0 failures. It didnt happen in one go.
My 2 VMs are run within a CentOS8 based host machine using KVM. Each VM has 2 network interfaces.
I used CentOS-7-x86_64-DVD-1810 ISO file from http://repos-va.psychz.net/centos/7.6.1810/isos/x86_64/.

I havent been really able to bring up a stable working setup so far, though once I could get the K8S dashboard up. Generally, some pod or ther other is in crashloopbackoff state.

pavanats commented 4 years ago

Hi Amr, I have shared the information on the community forum. I am presently stuck with the errors and would appreciate if you could help resolve those. Regards, Pavan

From: Amr Mokhtar notifications@github.com Sent: Monday, July 6, 2020 10:53 PM To: open-ness/openness-experience-kits openness-experience-kits@noreply.github.com Cc: Pavan Gupta pavan.gupta@atsgen.com; Mention mention@noreply.github.com Subject: Re: [open-ness/openness-experience-kits] Errors in deploying Intel OpenNESS platform on VMs (#30)

Hi @pavanatshttps://github.com/pavanats, Can you clarify what steps did you use to bring up the cluster? Also, are you deploying the nodes on bare-metal or inside VMs? Did you attempt that installation on freshly installed OS?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/open-ness/openness-experience-kits/issues/30#issuecomment-654366797, or unsubscribehttps://github.com/notifications/unsubscribe-auth/APSLZC5NO6Q2JZDPCE7QS7DR2ICB5ANCNFSM4ORXSBXA.

amr-mokhtar commented 4 years ago

Sorry, I am little confused. Are you deploying OpenNESS worker nodes as VMs then deploying VM-based apps within? Can you try to install without kube-virt enabled? For this you can try the minimal flavor deployment.

pavanats commented 4 years ago

Hi Amr, VM1 is the controller VM. Here I also download openness experience kit and run ./deploy_ne.sh controller. VM2 is the edge node VM. From VM1, I run deploy_ne.sh node to run the ansible script for the edge node.

My eventual goal is to deploy VM based VNF using Kubevirt. I am trying again on a fresh setup. Will let you know if I still see the failures. Pavan

From: Amr Mokhtar notifications@github.com Sent: Tuesday, July 7, 2020 7:56 PM To: open-ness/openness-experience-kits openness-experience-kits@noreply.github.com Cc: Pavan Gupta pavan.gupta@atsgen.com; Mention mention@noreply.github.com Subject: Re: [open-ness/openness-experience-kits] Errors in deploying Intel OpenNESS platform on VMs (#30)

Sorry, I am little confused. Are you deploying OpenNESS worker nodes as VMs then deploying VM-based apps within? Can you try to install without kube-virt enabled? For this you can try the minimal flavorhttps://github.com/open-ness/specs/blob/master/doc/flavors.md#minimal-flavor deployment.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/open-ness/openness-experience-kits/issues/30#issuecomment-654901003, or unsubscribehttps://github.com/notifications/unsubscribe-auth/APSLZC3434WYMQC4CDSPANLR2MWALANCNFSM4ORXSBXA.

amr-mokhtar commented 4 years ago

I am not sure if that is a possible case, running a VM in a VM. Probably, you may consider changing the VNF into a CNF (Container Network Function) - that should work.

surajit-ats commented 4 years ago

Hi Amr,

We will try a deployment without kubevirt to see if that is causing the issue. Eventually though we do need support for kubevirt as we expect some workloads to run as VNFs. For that we can move to bare metal.

Adding a few data points on what the event logs suggest.

Only pods stuck in CrashLoop are: cdi-operator virt-operator kubernetes-dashboard

It is consistently these 3 pod types that are getting into crash loop. What the event's suggest: They all seem to be caused by the same issue. All these pods need to volumemount some tokens/secrets that are stored as native k8s secrets. This they are failing to do.

The event logs look like this: Warning FailedMount 5m17s (x2 over 5m19s) kubelet, node01 MountVolume.SetUp failed for volume "kubevirt-operator-token-9f87j" : failed to sync secret cache: timed out waiting for the condition Needed by:

Volumes:
  kubevirt-operator-token-9f87j:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  kubevirt-operator-token-9f87j
    Optional:    false

So I checked if these secrets are non existent, but found they are available:

[root@controller01 ~]# kubectl get secrets -A | grep -i virt
kubevirt               default-token-mck9p                                      kubernetes.io/service-account-token   3      31h
kubevirt               kubevirt-operator-token-9f87j                            kubernetes.io/service-account-token   3      31h

We will be willing to move to bare metal, but before that would prefer to verify if the cause is really incapability of running VMs within VMs. Because right now the event logs suggest issues with mounting secrets in the pods as volumes.

Thanks, Surajit

damiankopyto commented 4 years ago

On first look this look like a connectivity issue, for some reason kubelet seems not to be able to connect to the secret. Is this setup running behind any proxy, which CNI is being used?

pavanats commented 4 years ago

Hi, We are running the setup on 2 VMs - one for the controller and another for the edge node. There is no proxy involved with both VMs running on the same physical host. Thank you. Pavan

From: damiankopyto notifications@github.com Sent: Thursday, July 9, 2020 6:37 PM To: open-ness/openness-experience-kits openness-experience-kits@noreply.github.com Cc: Pavan Gupta pavan.gupta@atsgen.com; Mention mention@noreply.github.com Subject: Re: [open-ness/openness-experience-kits] Errors in deploying Intel OpenNESS platform on VMs (#30)

On first look this look like a connectivity issue, for some reason kubelet seems not to be able to connect to the service. Is this setup running behind any proxy, which CNI is being used?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/open-ness/openness-experience-kits/issues/30#issuecomment-656116377, or unsubscribehttps://github.com/notifications/unsubscribe-auth/APSLZCZVKJQVJIFJJCYHQKTR2W6KRANCNFSM4ORXSBXA.

damiankopyto commented 4 years ago

Can you provide full log for the operator "kubectl describe pod virt-operator-xxxxxxxxxx -n kubevirt"

pavanats commented 4 years ago

Hi, Please see the attached file for the requested logs. Pavan

From: damiankopyto notifications@github.com Sent: Thursday, July 9, 2020 6:58 PM To: open-ness/openness-experience-kits openness-experience-kits@noreply.github.com Cc: Pavan Gupta pavan.gupta@atsgen.com; Mention mention@noreply.github.com Subject: Re: [open-ness/openness-experience-kits] Errors in deploying Intel OpenNESS platform on VMs (#30)

Can you provide full log for the operator "kubectl describe pod virt-operator-xxxxxxxxxx -n kubevirt"

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/open-ness/openness-experience-kits/issues/30#issuecomment-656127207, or unsubscribehttps://github.com/notifications/unsubscribe-auth/APSLZC3DCK6K7JKUJLX2B7TR2XAWPANCNFSM4ORXSBXA.

damiankopyto commented 4 years ago

Hi @pavanats I cannot find the attachment.

pavanats commented 4 years ago

HI Damian, I have pasted the cli output below for different cmds. Pavan

[root@controller ~]# kubectl get pods -o wide -A NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cdi cdi-operator-76b6694845-vslzm 0/1 CrashLoopBackOff 222 46h 10.16.0.9 node01 kube-system coredns-66bff467f8-b9jjj 1/1 Running 0 47h 10.16.0.2 controller kube-system coredns-66bff467f8-q8dc5 1/1 Running 0 47h 10.16.0.3 controller kube-system descheduler-cronjob-1594306560-nrtkj 0/1 Completed 0 5m45s 10.16.0.15 node01 kube-system descheduler-cronjob-1594306680-xtnj4 0/1 Completed 0 3m45s 10.16.0.8 node01 kube-system descheduler-cronjob-1594306800-6wrcn 0/1 Completed 0 105s 10.16.0.21 node01 kube-system etcd-controller 1/1 Running 0 47h 192.168.122.41 controller kube-system kube-apiserver-controller 1/1 Running 0 47h 192.168.122.41 controller kube-system kube-controller-manager-controller 1/1 Running 0 47h 192.168.122.41 controller kube-system kube-ovn-cni-gdts7 1/1 Running 6 21h 192.168.122.94 node01 kube-system kube-ovn-cni-mxbz6 1/1 Running 6 47h 192.168.122.41 controller kube-system kube-ovn-controller-96f89c68b-swg9g 1/1 Running 0 47h 192.168.122.41 controller kube-system kube-ovn-controller-96f89c68b-vmh2g 1/1 Running 0 20h 192.168.122.94 node01 kube-system kube-proxy-f2bcg 1/1 Running 0 47h 192.168.122.41 controller kube-system kube-proxy-h68tq 1/1 Running 0 21h 192.168.122.94 node01 kube-system kube-scheduler-controller 1/1 Running 3 21h 192.168.122.41 controller kube-system ovn-central-74986486f9-4kzcn 1/1 Running 0 47h 192.168.122.41 controller kube-system ovs-ovn-fk8fn 1/1 Running 10 47h 192.168.122.41 controller kube-system ovs-ovn-mp7dm 1/1 Running 14 21h 192.168.122.94 node01 kubevirt virt-operator-79c97797-9rfsh 0/1 CrashLoopBackOff 222 46h 10.16.0.7 node01 kubevirt virt-operator-79c97797-qbzmk 0/1 CrashLoopBackOff 222 46h 10.16.0.6 node01 openness docker-registry-deployment-54d5bb5c-ncf2v 1/1 Running 0 21h 192.168.122.41 controller openness eaa-6f8b94c9d7-c8j6n 1/1 Running 0 20h 10.16.0.25 node01 openness edgedns-qxphn 1/1 Running 0 20h 10.16.0.23 node01 openness interfaceservice-rmtvf 1/1 Running 0 20h 10.16.0.26 node01 openness nfd-release-node-feature-discovery-master-5f6c5bc9b7-92mbt 1/1 Running 0 21h 10.16.0.16 controller openness nfd-release-node-feature-discovery-worker-kxxdw 1/1 Running 206 20h 192.168.122.94 node01 openness syslog-master-v9jsv 1/1 Running 0 47h 10.16.0.5 controller openness syslog-ng-7w5f6 1/1 Running 0 20h 10.16.0.18 node01 telemetry cadvisor-m9h97 2/2 Running 0 20h 10.16.0.22 node01 telemetry collectd-8mnsb 2/2 Running 0 20h 192.168.122.94 node01 telemetry custom-metrics-apiserver-54699b845f-q7xgs 1/1 Running 0 46h 10.16.0.13 controller telemetry grafana-6b79c984b-nz98t 2/2 Running 0 21h 10.16.0.17 controller telemetry otel-collector-7d5b75bbdf-2cjf9 1/2 CrashLoopBackOff 243 46h 10.16.0.12 node01 telemetry prometheus-node-exporter-vl6lz 1/1 Running 0 20h 10.16.0.20 node01 telemetry prometheus-server-76c96b9497-7p5ks 3/3 Running 0 46h 10.16.0.10 controller telemetry telemetry-aware-scheduling-68467c4ccd-lp5fb 2/2 Running 0 21h 10.16.0.14 controller telemetry telemetry-collector-certs-tzxjl 0/1 Completed 0 46h 10.16.0.11 node01 telemetry telemetry-node-certs-4xc4d 1/1 Running 0 20h 10.16.0.19 node01 [root@controller ~]# +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ [root@controller ~]# kubectl describe pod virt-operator-79c97797-qbzmk -n kubevirt Name: virt-operator-79c97797-qbzmk Namespace: kubevirt Priority: 0 Node: node01/192.168.122.94 Start Time: Wed, 08 Jul 2020 14:25:48 -0400 Labels: kubevirt.io=virt-operator pod-template-hash=79c97797 prometheus.kubevirt.io= Annotations: ovn.kubernetes.io/allocated: true ovn.kubernetes.io/cidr: 10.16.0.0/16 ovn.kubernetes.io/gateway: 10.16.0.1 ovn.kubernetes.io/ip_address: 10.16.0.6 ovn.kubernetes.io/logical_switch: ovn-default ovn.kubernetes.io/mac_address: 32:68:5e:10:00:07 scheduler.alpha.kubernetes.io/critical-pod: scheduler.alpha.kubernetes.io/tolerations: [{"key":"CriticalAddonsOnly","operator":"Exists"}] Status: Running IP: 10.16.0.6 IPs: IP: 10.16.0.6 Controlled By: ReplicaSet/virt-operator-79c97797 Containers: virt-operator: Container ID: docker://8db17a1a831f5e88f66eb79a30b297cf992b99fe5053856563ea49545dcbb027 Image: index.docker.io/kubevirt/virt-operator@sha256:4537e45d8f09d52ce202d53b368f34ab6744c06c11519f5219457a339355259e Image ID: docker-pullable://kubevirt/virt-operator@sha256:4537e45d8f09d52ce202d53b368f34ab6744c06c11519f5219457a339355259e Ports: 8443/TCP, 8444/TCP Host Ports: 0/TCP, 0/TCP Command: virt-operator --port 8443 -v 2 State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 1 Started: Thu, 09 Jul 2020 11:00:17 -0400 Finished: Thu, 09 Jul 2020 11:00:48 -0400 Ready: False Restart Count: 222 Readiness: http-get https://:8443/metrics delay=5s timeout=10s period=10s #success=1 #failure=3 Environment: OPERATOR_IMAGE: index.docker.io/kubevirt/virt-operator@sha256:4537e45d8f09d52ce202d53b368f34ab6744c06c11519f5219457a339355259e WATCH_NAMESPACE: (v1:metadata.annotations['olm.targetNamespaces']) KUBEVIRT_VERSION: v0.26.0 VIRT_API_SHASUM: sha256:26f1d7c255eefa7fa56dec2923efcdafd522d15a8fee7dff956c9f96f2752f47 VIRT_CONTROLLER_SHASUM: sha256:1ab2afac91c890be4518bbc5cfa3d66526e2f08032648b4557b2abb86eb369a3 VIRT_HANDLER_SHASUM: sha256:0609eb3ea5711ae6290c178275c7d09116685851caa58a8f231277d11224e3d8 VIRT_LAUNCHER_SHASUM: sha256:66d6a5ce83d4340bb1c662198668081b3a1a37f39adc8ae4eb8f6c744fcae0fd Mounts: /var/run/secrets/kubernetes.io/serviceaccount from kubevirt-operator-token-5dr4h (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: kubevirt-operator-token-5dr4h: Type: Secret (a volume populated by a Secret) SecretName: kubevirt-operator-token-5dr4h Optional: false QoS Class: BestEffort Node-Selectors: Tolerations: cmk:NoSchedule node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message

Warning Unhealthy 43m (x593 over 20h) kubelet, node01 Readiness probe failed: Get https://10.16.0.6:8443/metrics: dial tcp 10.16.0.6:8443: connect: connection refused Warning BackOff 3m5s (x5113 over 20h) kubelet, node01 Back-off restarting failed container kubectl describe pod virt-operator-79c97797-9rfsh -n kubevirt Name: virt-operator-79c97797-9rfsh Namespace: kubevirt Priority: 0 Node: node01/192.168.122.94 Start Time: Wed, 08 Jul 2020 14:25:59 -0400 Labels: kubevirt.io=virt-operator pod-template-hash=79c97797 prometheus.kubevirt.io= Annotations: ovn.kubernetes.io/allocated: true ovn.kubernetes.io/cidr: 10.16.0.0/16 ovn.kubernetes.io/gateway: 10.16.0.1 ovn.kubernetes.io/ip_address: 10.16.0.7 ovn.kubernetes.io/logical_switch: ovn-default ovn.kubernetes.io/mac_address: 32:68:5e:10:00:08 scheduler.alpha.kubernetes.io/critical-pod: scheduler.alpha.kubernetes.io/tolerations: [{"key":"CriticalAddonsOnly","operator":"Exists"}] Status: Running IP: 10.16.0.7 IPs: IP: 10.16.0.7 Controlled By: ReplicaSet/virt-operator-79c97797 Containers: virt-operator: Container ID: docker://0fdcb806515c8dd92bbaad4ab51b6d7a8580529815956591e6d0d1c7ab130182 Image: index.docker.io/kubevirt/virt-operator@sha256:4537e45d8f09d52ce202d53b368f34ab6744c06c11519f5219457a339355259e Image ID: docker-pullable://kubevirt/virt-operator@sha256:4537e45d8f09d52ce202d53b368f34ab6744c06c11519f5219457a339355259e Ports: 8443/TCP, 8444/TCP Host Ports: 0/TCP, 0/TCP Command: virt-operator --port 8443 -v 2 State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 1 Started: Thu, 09 Jul 2020 11:01:07 -0400 Finished: Thu, 09 Jul 2020 11:01:38 -0400 Ready: False Restart Count: 222 Readiness: http-get https://:8443/metrics delay=5s timeout=10s period=10s #success=1 #failure=3 Environment: OPERATOR_IMAGE: index.docker.io/kubevirt/virt-operator@sha256:4537e45d8f09d52ce202d53b368f34ab6744c06c11519f5219457a339355259e WATCH_NAMESPACE: (v1:metadata.annotations['olm.targetNamespaces']) KUBEVIRT_VERSION: v0.26.0 VIRT_API_SHASUM: sha256:26f1d7c255eefa7fa56dec2923efcdafd522d15a8fee7dff956c9f96f2752f47 VIRT_CONTROLLER_SHASUM: sha256:1ab2afac91c890be4518bbc5cfa3d66526e2f08032648b4557b2abb86eb369a3 VIRT_HANDLER_SHASUM: sha256:0609eb3ea5711ae6290c178275c7d09116685851caa58a8f231277d11224e3d8 VIRT_LAUNCHER_SHASUM: sha256:66d6a5ce83d4340bb1c662198668081b3a1a37f39adc8ae4eb8f6c744fcae0fd Mounts: /var/run/secrets/kubernetes.io/serviceaccount from kubevirt-operator-token-5dr4h (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: kubevirt-operator-token-5dr4h: Type: Secret (a volume populated by a Secret) SecretName: kubevirt-operator-token-5dr4h Optional: false QoS Class: BestEffort Node-Selectors: Tolerations: cmk:NoSchedule node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message

Warning BackOff 6m22s (x5099 over 20h) kubelet, node01 Back-off restarting failed container Warning Unhealthy 88s (x625 over 20h) kubelet, node01 Readiness probe failed: Get https://10.16.0.7:8443/metrics: dial tcp 10.16.0.7:8443: connect: connection refused

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

From: damiankopyto notifications@github.com Sent: Thursday, July 9, 2020 9:37 PM To: open-ness/openness-experience-kits openness-experience-kits@noreply.github.com Cc: Pavan Gupta pavan.gupta@atsgen.com; Mention mention@noreply.github.com Subject: Re: [open-ness/openness-experience-kits] Errors in deploying Intel OpenNESS platform on VMs (#30)

Hi @pavanatshttps://github.com/pavanats I cannot find the attachment.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/open-ness/openness-experience-kits/issues/30#issuecomment-656215858, or unsubscribehttps://github.com/notifications/unsubscribe-auth/APSLZC5WULQUEHWPNNZLBQTR2XTM7ANCNFSM4ORXSBXA.

tomaszwesolowski commented 4 years ago

Closing this issue as stale. If problem still occurs please open new issue.

smart-edge-open / converged-edge-experience-kits

Errors in deploying Intel OpenNESS platform on VMs #30