nre-learning / nrelabs-curriculum

Learn next-generation skills for network engineers, all in your browser.
https://nrelabs.io
Apache License 2.0
141 stars 79 forks source link

"container init exited prematurely" also showing up in vqfx snapshot #274

Open Mierdin opened 5 years ago

Mierdin commented 5 years ago

The "container init exited prematurely" error seemed to be intermittently only on the container-vqfx image, but looks like it happens on the snapshot image too.

kubectl describe pods -n=15-5uj8zl2e2b2copns-ns vqfx2                                                                                                                                                [13:52:30]
Name:         vqfx2
Namespace:    15-5uj8zl2e2b2copns-ns
Priority:     0
Node:         antidote-worker-3
Start Time:   Fri, 04 Oct 2019 13:35:34 -0700
Labels:       lessonId=15
              podName=vqfx2
              syringeManaged=yes
Annotations:  k8s.v1.cni.cncf.io/networks: [{"name":"vqfx1-vqfx2-net"},{"name":"vqfx2-vqfx3-net"}]
              k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "",
                    "ips": [
                        "192.168.241.21"
                    ],
                    "default": true,
                    "dns": {}
                },{
                    "name": "15-5uj8zl2e2b2copns-ns-vqfx1-vqfx2-net",
                    "ips": [
                        "10.10.0.6"
                    ],
                    "dns": {}
                },{
                    "name": "15-5uj8zl2e2b2copns-ns-vqfx2-vqfx3-net",
                    "ips": [
                        "10.10.0.6"
                    ],
                    "dns": {}
                }]
Status:       Running
IP:           192.168.241.21
Init Containers:
  git-clone:
    Container ID:  docker://4a1841c61177c05096168735d9b87108beb3dd47c032b46ccfa7f4c144496832
    Image:         antidotelabs/githelper:v0.4.0
    Image ID:      docker-pullable://docker.io/antidotelabs/githelper@sha256:2edfc05da9e8ceca17bab6c37ced1a064f057c446e238500245d23ab295de1f1
    Port:          <none>
    Host Port:     <none>
    Args:
      https://github.com/nre-learning/nrelabs-curriculum.git
      master
      /antidote
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 04 Oct 2019 13:36:17 -0700
      Finished:     Fri, 04 Oct 2019 13:36:21 -0700
    Ready:          True
    Restart Count:  1
    Environment:    <none>
    Mounts:
      /antidote from git-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-mhh9d (ro)
Containers:
  vqfx2:
    Container ID:  docker://d63be1651d74a170a7a8e3e71d8c81335aa6976623712672146269f9bd81754d
    Image:         antidotelabs/vqfx-snap2:v1.0.0
    Image ID:      docker-pullable://docker.io/antidotelabs/vqfx-snap2@sha256:bc96ed79cf00b1dfe5958443af8033493796cf0c66d78b5d559063753d3e8ad5
    Port:          22/TCP
    Host Port:     0/TCP
    State:         Waiting
      Reason:      CrashLoopBackOff
    Last State:    Terminated
      Reason:      ContainerCannotRun
      Message:     oci runtime error: container_linux.go:235: starting container process caused "container init exited prematurely"

      Exit Code:    128
      Started:      Fri, 04 Oct 2019 13:47:22 -0700
      Finished:     Fri, 04 Oct 2019 13:47:22 -0700
    Ready:          False
    Restart Count:  7
    Environment:
      SYRINGE_FULL_REF:  15-5uj8zl2e2b2copns-ns-vqfx2
    Mounts:
      /antidote from git-volume (rw,path="lessons/tools/lesson-15-stackstorm")
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-mhh9d (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  git-volume:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  default-token-mhh9d:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-mhh9d
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason          Age                  From                        Message
  ----     ------          ----                 ----                        -------
  Normal   Scheduled       16m                  default-scheduler           Successfully assigned 15-5uj8zl2e2b2copns-ns/vqfx2 to antidote-worker-3
  Normal   Started         16m                  kubelet, antidote-worker-3  Started container vqfx2
  Normal   SandboxChanged  16m                  kubelet, antidote-worker-3  Pod sandbox changed, it will be killed and re-created.
  Normal   Killing         16m                  kubelet, antidote-worker-3  Stopping container vqfx2
  Normal   Pulled          16m (x2 over 16m)    kubelet, antidote-worker-3  Container image "antidotelabs/githelper:v0.4.0" already present on machine
  Normal   Created         16m (x2 over 16m)    kubelet, antidote-worker-3  Created container git-clone
  Normal   Started         16m (x2 over 16m)    kubelet, antidote-worker-3  Started container git-clone
  Normal   Pulling         15m (x4 over 16m)    kubelet, antidote-worker-3  Pulling image "antidotelabs/vqfx-snap2:v1.0.0"
  Normal   Pulled          15m (x4 over 16m)    kubelet, antidote-worker-3  Successfully pulled image "antidotelabs/vqfx-snap2:v1.0.0"
  Normal   Created         15m (x4 over 16m)    kubelet, antidote-worker-3  Created container vqfx2
  Warning  Failed          15m (x3 over 16m)    kubelet, antidote-worker-3  Error: failed to start container "vqfx2": Error response from daemon: oci runtime error: container_linux.go:235: starting container process caused "container init exited prematurely"
  Warning  BackOff         113s (x63 over 15m)  kubelet, antidote-worker-3  Back-off restarting failed container

Ideas for fixing:

Maybe useful output from the kubelet (not on this image, the full vqfx image. Might be related tho, or worst case we should get same when testing this image):

kubelet[10873]: E0829 23:13:26.374746   10873 pod_workers.go:190] Error syncing pod 88310250-caaf-11e9-8781-0cc47ae547a8 ("vqfx2_12-l0yh5ozt7urx8ran-ns(88310250-caaf-11e9-8781-0cc47ae547a8)"), skipping: failed to "StartContainer" for "vqfx2" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=vqfx2 pod=vqfx2_12-l0yh5ozt7urx8ran-ns(88310250-caaf-11e9-8781-0cc47ae547a8)"
Aug 29 23:13:26 antidote-worker-1 dockerd-current[12144]: DEBU: 2019/08/29 23:13:26.395575 EVENT UpdatePod {"metadata":{"annotations":{"k8s.v1.cni.cncf.io/networks":"[{\"name\":\"vqfx1-vqfx2-net\"},{\"name\":\"vqfx2-vqfx3-net\"}]","k8s.v1.cni.cncf.io/networks-status":"[{\n    \"name\": \"\",\n    \"ips\": [\n        \"192.168.67.69\"\n    ],\n    \"default\": true,\n    \"dns\": {}\n},{\n    \"name\": \"12-l0yh5ozt7urx8ran-ns-vqfx1-vqfx2-net\",\n    \"ips\": [\n        \"10.10.0.4\"\n    ],\n    \"dns\": {}\n},{\n    \"name\": \"12-l0yh5ozt7urx8ran-ns-vqfx2-vqfx3-net\",\n    \"ips\": [\n        \"10.10.0.4\"\n    ],\n    \"dns\": {}\n}]"},"creationTimestamp":"2019-08-29T22:51:26Z","labels":{"lessonId":"12","podName":"vqfx2","syringeManaged":"yes"},"name":"vqfx2","namespace":"12-l0yh5ozt7urx8ran-ns","resourceVersion":"4358037","selfLink":"/api/v1/namespaces/12-l0yh5ozt7urx8ran-ns/pods/vqfx2","uid":"88310250-caaf-11e9-8781-0cc47ae547a8"},"spec":{"affinity":{"podAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":[{"labelSelector":{"matchLabels":{"lessonId":"12","syringeManaged":"yes"}},"namespaces":["12-l0yh5ozt7urx8ran-ns"],"topologyKey":"kubernetes.io/hostname"}]}},"containers":[{"image":"antidotelabs/vqfx-snap2:v1.0.0","imagePullPolicy":"Always","name":"vqfx2","ports":[{"containerPort":22,"protocol":"TCP"}],"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File"}],"dnsPolicy":"ClusterFirst","initContainers":[{"args":["https://github.com/nre-learning/nrelabs-curriculum.git","master","/antidote"],"image":"antidotelabs/githelper:v0.4.0","imagePullPolicy":"IfNotPresent","name":"git-clone","resources":{},"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File","volumeMounts":[{"mountPath":"/antidote","name":"git-volume"},{"mountPath":"/var/run/secrets/kubernetes.io/serviceaccount","name":"default-token-2smvx","readOnly":true}]}],"nodeName":"antidote-worker-1","priority":0,"restartPolicy":"Always","schedulerName":"default-scheduler","securityContext":{},"serviceAccount":"default","serviceAccountName":"default","terminationGracePeriodSeconds":30},"status":{"conditions":[{"lastProbeTime":null,"lastTransitionTime":"2019-08-29T22:52:14Z","status":"True","type":"Initialized"},{"lastProbeTime":null,"lastTransitionTime":"2019-08-29T22:52:09Z","message":"containers with unready status: [vqfx2]","reason":"ContainersNotReady","status":"False","type":"Ready"},{"lastProbeTime":null,"lastTransitionTime":"2019-08-29T22:52:09Z","message":"containers with unready status: [vqfx2]","reason":"ContainersNotReady","status":"False","type":"ContainersReady"},{"lastProbeTime":null,"lastTransitionTime":"2019-08-29T22:51:26Z","status":"True","type":"PodScheduled"}],"hostIP":"147.75.88.205","initContainerStatuses":[{"containerID":"docker://21ede9b83b8ae058fb4f49ca26ced17110d980073fd7560ab16f91e7d81ba942","image":"docker.io/antidotelabs/githelper:v0.4.0","imageID":"docker-pullable://docker.io/antidotelabs/githelper@sha256:2edfc05da9e8ceca17bab6c37ced1a064f057c446e238500245d23ab295de1f1","lastState":{},"name":"git-clone","ready":true,"restartCount":1,"state":{"terminated":{"containerID":"docker://21ede9b83b8ae058fb4f49ca26ced17110d980073fd7560ab16f91e7d81ba942","exitCode":0,"finishedAt":"2019-08-29T22:52:14Z","reason":"Completed","startedAt":"2019-08-29T22:52:08Z"}}}],"phase":"Running","podIP":"192.168.67.69","qosClass":"BestEffort","startTime":"2019-08-29T22:51:26Z"}} {"metadata":{"annotations":{"k8s.v1.cni.cncf.io/networks":"[{\"name\":\"vqfx1-vqfx2-net\"},{\"name\":\"vqfx2-vqfx3-net\"}]","k8s.v1.cni.cncf.io/networks-status":"[{\n    \"name\": \"\",\n    \"ips\": [\n        \"192.168.67.69\"\n    ],\n    \"default\": true,\n    \"dns\": {}\n},{\n    \"name\": \"12-l0yh5ozt7urx8ran-ns-vqfx1-vqfx2-net\",\n    \"ips\": [\n        \"10.10.0.4\"\n    ],\n    \"dnsMaybe container-vqfx fix ideas:\": {}\n},{\n    \"name\": \"12-l0yh5ozt7urx8ran-ns-vqfx2-vqfx3-net\",\n    \"ips\": [\n        \"10.10.0.4\"\n    ],\n    \"dns\": {}\n}]"},"creationTimestamp":"2019-08-29T22:51:26Z","labels":{"lessonId":"12","podName":"vqfx2","syringeManaged":"yes"},"name":"vqfx2","namespace":"12-l0yh5ozt7u
cloudtoad commented 5 years ago

these are the combined recommendations for vSRX and vMX:

  1. Disable Transparent Huge Buffers (THB)
  2. Disable Kernel Samepage Merging (KSM)
  3. Disable Page Modification Logging (PML)
  4. Disable APICv
  5. Enable nested virtualization
  6. Enable 1G of Huge Buffers

I believe this particular problem is related to the 1G of Huge Buffers, however, I'm not 100% certain since I configured all these recommendations at once. I am not sure which of these (if any) must also be configured in the container. I can do some research on that and update later, but I think that none of these need to be configured in individual containers. They only need to be configured on the host. These are all kernel configuration options.

Numbers 1-3 are memory management optimization techniques. These are ways of organizing and deduplicating memory to reduce memory footprint or speed up read/writes to memory.

On number 5, while virtualization might be enabled in BIOS, nested virtualization further requires a configuration step in linux. Some distros of linux have this enabled already.