aws-node NetworkPlugin cni failures on AWS Hubs

scottyhq commented 5 years ago

For some time we've been noticing network related errors on AWS hubs when scaling up from zero nodes. This hasn't been too much of a concern because things usually recover on their own, for example, launching the first user-notebook on the cluster we see (and the user sees these messages while their jupyterhub session is loading). Note in particular FailedCreatePodSandBox followed by SandboxChanged:

Events:
  Type     Reason                  Age                     From                                    Message
  ----     ------                  ----                    ----                                    -------
  Normal   TriggeredScaleUp        9m26s                   cluster-autoscaler                      pod triggered scale-up: [{eksctl-pangeo-nasa-nodegroup-user-notebook-v2-NodeGroup-8VRX95CW3EOF 0->1 (max: 100)}]
  Warning  FailedScheduling        8m20s (x10 over 9m37s)  nasa-staging-user-scheduler             0/1 nodes are available: 1 node(s) didn't match node selector.
  Warning  FailedScheduling        8m (x7 over 8m18s)      nasa-staging-user-scheduler             0/2 nodes are available: 1 node(s) didn't match node selector, 1 node(s) were not ready.
  Warning  FailedScheduling        7m58s                   nasa-staging-user-scheduler             0/2 nodes are available: 1 node(s) didn't match node selector, 1 node(s) had taints that the pod didn't tolerate.
  Normal   Scheduled               7m58s                   nasa-staging-user-scheduler             Successfully assigned nasa-staging/jupyter-scottyhq to ip-192-168-57-24.ec2.internal
  Warning  FailedCreatePodSandBox  7m56s                   kubelet, ip-192-168-57-24.ec2.internal  Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "8e556785fa45174bcc4e05721247f4c1cbecf8581b7763f1c34f522d3fac1266" network for pod "jupyter-scottyhq": NetworkPlugin cni failed to set up pod "jupyter-scottyhq_nasa-staging" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "8e556785fa45174bcc4e05721247f4c1cbecf8581b7763f1c34f522d3fac1266" network for pod "jupyter-scottyhq": NetworkPlugin cni failed to teardown pod "jupyter-scottyhq_nasa-staging" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]
  Normal   SandboxChanged          6m34s (x8 over 7m55s)   kubelet, ip-192-168-57-24.ec2.internal  Pod sandbox changed, it will be killed and re-created.
  Normal   Pulling                 6m34s                   kubelet, ip-192-168-57-24.ec2.internal  pulling image "busybox"
  Normal   Pulled                  6m33s                   kubelet, ip-192-168-57-24.ec2.internal  Successfully pulled image "busybox"
  Normal   Created                 6m33s                   kubelet, ip-192-168-57-24.ec2.internal  Created container
  Normal   Started                 6m33s                   kubelet, ip-192-168-57-24.ec2.internal  Started container
  Normal   Pulling                 6m33s                   kubelet, ip-192-168-57-24.ec2.internal  pulling image "pangeo/base-notebook:latest"
  Normal   Pulled                  6m1s                    kubelet, ip-192-168-57-24.ec2.internal  Successfully pulled image "pangeo/base-notebook:latest"
  Normal   Created                 5m10s                   kubelet, ip-192-168-57-24.ec2.internal  Created container
  Normal   Started                 5m10s                   kubelet, ip-192-168-57-24.ec2.internal  Started container
(aws) scott@pangeo-cloud-federation:kubectl describe pod dask-scottyhq-cf1fb8da-0tzpk5   -n nasa-staging

But, when the first dask worker pods are launched they get stuck in Pending state and there is no recovery from the error:

Events:
  Type     Reason                  Age                    From                                    Message
  ----     ------                  ----                   ----                                    -------
  Warning  FailedScheduling        5m1s (x2 over 5m1s)    default-scheduler                       0/2 nodes are available: 1 Insufficient cpu, 1 Insufficient memory, 1 node(s) had taints that the pod didn't tolerate.
  Normal   TriggeredScaleUp        4m50s                  cluster-autoscaler                      pod triggered scale-up: [{eksctl-pangeo-nasa-nodegroup-dask-worker-v2-NodeGroup-7BQ1Z5FRVX6H 0->1 (max: 100)}]
  Warning  FailedScheduling        3m31s (x4 over 3m51s)  default-scheduler                       0/3 nodes are available: 1 Insufficient cpu, 1 Insufficient memory, 2 node(s) had taints that the pod didn't tolerate.
  Normal   Scheduled               3m31s                  default-scheduler                       Successfully assigned nasa-staging/dask-scottyhq-cf1fb8da-0tzpk5 to ip-192-168-55-40.ec2.internal
  Warning  FailedCreatePodSandBox  3m29s                  kubelet, ip-192-168-55-40.ec2.internal  Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "a12c84d5290fdfaeda86834eb552f9690ccfd24501e2e6d8091eeb596aae8b8b" network for pod "dask-scottyhq-cf1fb8da-0tzpk5": NetworkPlugin cni failed to set up pod "dask-scottyhq-cf1fb8da-0tzpk5_nasa-staging" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "a12c84d5290fdfaeda86834eb552f9690ccfd24501e2e6d8091eeb596aae8b8b" network for pod "dask-scottyhq-cf1fb8da-0tzpk5": NetworkPlugin cni failed to teardown pod "dask-scottyhq-cf1fb8da-0tzpk5_nasa-staging" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]

A current non-ideal workaround is to shutdown the first cluster in the notebook and launch a new one. Since the node is already up at that point, things seem to work.

@jacobtomlinson - do you possibly have some insight here?

scottyhq commented 5 years ago

This might be related.. long start up times are partly due to continuous-image-puller being enabled - When new user or dask nodes are added all of the images in profileList are pulled. I think image pulling happens after the cni errors described above though, so isn't the causing the SandBox errors.

When a new user node launches you can see two image-puller pods simultaneously pulling all the images (on each staging and prod namespace), resulting in ~10min startup times for nodes:

kubectl get pods --all-namespaces -o wide

nasa-prod      continuous-image-puller-mt28p         1/1     Running   0          24m   192.168.6.97     ip-192-168-20-77.ec2.internal    <none>           <none>
nasa-staging   continuous-image-puller-dmxr6         1/1     Running   0          25m   192.168.11.138   ip-192-168-20-77.ec2.internal    <none>           <none>

kubectl describe pod continuous-image-puller-mt28p -n nasa-prod

  Normal   Scheduled               19m                default-scheduler                       Successfully assigned nasa-prod/continuous-image-puller-mt28p to ip-192-168-20-77.ec2.internal
  Warning  FailedCreatePodSandBox  19m                kubelet, ip-192-168-20-77.ec2.internal  Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "23a039cde942063b50c495d258d9cfbc0e55d25dfc83f8a1b3ddf4588a9e9bbb" network for pod "continuous-image-puller-mt28p": NetworkPlugin cni failed to set up pod "continuous-image-puller-mt28p_nasa-prod" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "23a039cde942063b50c495d258d9cfbc0e55d25dfc83f8a1b3ddf4588a9e9bbb" network for pod "continuous-image-puller-mt28p": NetworkPlugin cni failed to teardown pod "continuous-image-puller-mt28p_nasa-prod" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]
  Normal   SandboxChanged          17m (x9 over 19m)  kubelet, ip-192-168-20-77.ec2.internal  Pod sandbox changed, it will be killed and re-created.
  Normal   Pulling                 17m                kubelet, ip-192-168-20-77.ec2.internal  pulling image "783380859522.dkr.ecr.us-east-1.amazonaws.com/pangeo-nasa:e76d5bf"
  Normal   Pulled                  16m                kubelet, ip-192-168-20-77.ec2.internal  Successfully pulled image "783380859522.dkr.ecr.us-east-1.amazonaws.com/pangeo-nasa:e76d5bf"
  Normal   Started                 16m                kubelet, ip-192-168-20-77.ec2.internal  Started container
  Normal   Created                 16m                kubelet, ip-192-168-20-77.ec2.internal  Created container
  Normal   Pulling                 16m                kubelet, ip-192-168-20-77.ec2.internal  pulling image "scottyhq/geohackweek2019:latest"
  Normal   Pulled                  15m                kubelet, ip-192-168-20-77.ec2.internal  Successfully pulled image "scottyhq/geohackweek2019:latest"
  Normal   Started                 15m                kubelet, ip-192-168-20-77.ec2.internal  Started container
  Normal   Created                 15m                kubelet, ip-192-168-20-77.ec2.internal  Created container
  Normal   Pulling                 15m                kubelet, ip-192-168-20-77.ec2.internal  pulling image "scottyhq/pangeo-nasa-asp:2019.07.17"
  Normal   Pulled                  14m                kubelet, ip-192-168-20-77.ec2.internal  Successfully pulled image "scottyhq/pangeo-nasa-asp:2019.07.17"
  Normal   Created                 14m                kubelet, ip-192-168-20-77.ec2.internal  Created container
  Normal   Started                 14m                kubelet, ip-192-168-20-77.ec2.internal  Started container
  Normal   Pulling                 14m                kubelet, ip-192-168-20-77.ec2.internal  pulling image "pangeo/pangeo-notebook:latest"
  Normal   Pulled                  13m                kubelet, ip-192-168-20-77.ec2.internal  Successfully pulled image "pangeo/pangeo-notebook:latest"
  Normal   Created                 12m                kubelet, ip-192-168-20-77.ec2.internal  Created container
  Normal   Started                 12m                kubelet, ip-192-168-20-77.ec2.internal  Started container

on the node, docker images output (note large images sizes):

REPOSITORY                                                     TAG                 IMAGE ID            CREATED             SIZE
pangeo/pangeo-notebook                                         latest              36ff77c04f9c        2 days ago          5.09GB
783380859522.dkr.ecr.us-east-1.amazonaws.com/pangeo-nasa       e76d5bf             303561b3aba1        11 days ago         4.77GB
pangeo/base-notebook                                           latest              e46fad17850c        13 days ago         2.47GB
scottyhq/geohackweek2019                                       latest              557b13d2ca9f        5 weeks ago         4.28GB
scottyhq/pangeo-nasa-asp                                       2019.07.17          d4784b6d51f8        7 weeks ago         4.88GB
602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/kube-proxy    v1.13.7             76e6800b6822        3 months ago        80.2MB
602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni    v1.5.0              e9b29049f758        3 months ago        247MB
602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause-amd64   3.1                 9e462c010bf3        15 months ago       742kB
gcr.io/google_containers/pause                                 3.0                 99e59f495ffa        3 years ago         747kB

jacobtomlinson commented 5 years ago

This does seem odd. I would be tempted to raise continuous image puller issue upstream in z2jh.

Ideally the node shouldn't be schedulable until the image puller is done. The idea is to avoid 10 workers being created on the node at once and all pulling the image at once. Pods only check for the image at schedule time and there is no effort to avoid duplicate pulls as far as I know.

pangeo-data / pangeo-cloud-federation

aws-node NetworkPlugin cni failures on AWS Hubs #356