Open scottyhq opened 5 years ago
This might be related.. long start up times are partly due to continuous-image-puller being enabled - When new user or dask nodes are added all of the images in profileList are pulled. I think image pulling happens after the cni errors described above though, so isn't the causing the SandBox errors.
When a new user node launches you can see two image-puller pods simultaneously pulling all the images (on each staging and prod namespace), resulting in ~10min startup times for nodes:
kubectl get pods --all-namespaces -o wide
nasa-prod continuous-image-puller-mt28p 1/1 Running 0 24m 192.168.6.97 ip-192-168-20-77.ec2.internal <none> <none>
nasa-staging continuous-image-puller-dmxr6 1/1 Running 0 25m 192.168.11.138 ip-192-168-20-77.ec2.internal <none> <none>
kubectl describe pod continuous-image-puller-mt28p -n nasa-prod
Normal Scheduled 19m default-scheduler Successfully assigned nasa-prod/continuous-image-puller-mt28p to ip-192-168-20-77.ec2.internal
Warning FailedCreatePodSandBox 19m kubelet, ip-192-168-20-77.ec2.internal Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "23a039cde942063b50c495d258d9cfbc0e55d25dfc83f8a1b3ddf4588a9e9bbb" network for pod "continuous-image-puller-mt28p": NetworkPlugin cni failed to set up pod "continuous-image-puller-mt28p_nasa-prod" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "23a039cde942063b50c495d258d9cfbc0e55d25dfc83f8a1b3ddf4588a9e9bbb" network for pod "continuous-image-puller-mt28p": NetworkPlugin cni failed to teardown pod "continuous-image-puller-mt28p_nasa-prod" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]
Normal SandboxChanged 17m (x9 over 19m) kubelet, ip-192-168-20-77.ec2.internal Pod sandbox changed, it will be killed and re-created.
Normal Pulling 17m kubelet, ip-192-168-20-77.ec2.internal pulling image "783380859522.dkr.ecr.us-east-1.amazonaws.com/pangeo-nasa:e76d5bf"
Normal Pulled 16m kubelet, ip-192-168-20-77.ec2.internal Successfully pulled image "783380859522.dkr.ecr.us-east-1.amazonaws.com/pangeo-nasa:e76d5bf"
Normal Started 16m kubelet, ip-192-168-20-77.ec2.internal Started container
Normal Created 16m kubelet, ip-192-168-20-77.ec2.internal Created container
Normal Pulling 16m kubelet, ip-192-168-20-77.ec2.internal pulling image "scottyhq/geohackweek2019:latest"
Normal Pulled 15m kubelet, ip-192-168-20-77.ec2.internal Successfully pulled image "scottyhq/geohackweek2019:latest"
Normal Started 15m kubelet, ip-192-168-20-77.ec2.internal Started container
Normal Created 15m kubelet, ip-192-168-20-77.ec2.internal Created container
Normal Pulling 15m kubelet, ip-192-168-20-77.ec2.internal pulling image "scottyhq/pangeo-nasa-asp:2019.07.17"
Normal Pulled 14m kubelet, ip-192-168-20-77.ec2.internal Successfully pulled image "scottyhq/pangeo-nasa-asp:2019.07.17"
Normal Created 14m kubelet, ip-192-168-20-77.ec2.internal Created container
Normal Started 14m kubelet, ip-192-168-20-77.ec2.internal Started container
Normal Pulling 14m kubelet, ip-192-168-20-77.ec2.internal pulling image "pangeo/pangeo-notebook:latest"
Normal Pulled 13m kubelet, ip-192-168-20-77.ec2.internal Successfully pulled image "pangeo/pangeo-notebook:latest"
Normal Created 12m kubelet, ip-192-168-20-77.ec2.internal Created container
Normal Started 12m kubelet, ip-192-168-20-77.ec2.internal Started container
on the node, docker images
output (note large images sizes):
REPOSITORY TAG IMAGE ID CREATED SIZE
pangeo/pangeo-notebook latest 36ff77c04f9c 2 days ago 5.09GB
783380859522.dkr.ecr.us-east-1.amazonaws.com/pangeo-nasa e76d5bf 303561b3aba1 11 days ago 4.77GB
pangeo/base-notebook latest e46fad17850c 13 days ago 2.47GB
scottyhq/geohackweek2019 latest 557b13d2ca9f 5 weeks ago 4.28GB
scottyhq/pangeo-nasa-asp 2019.07.17 d4784b6d51f8 7 weeks ago 4.88GB
602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/kube-proxy v1.13.7 76e6800b6822 3 months ago 80.2MB
602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni v1.5.0 e9b29049f758 3 months ago 247MB
602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause-amd64 3.1 9e462c010bf3 15 months ago 742kB
gcr.io/google_containers/pause 3.0 99e59f495ffa 3 years ago 747kB
This does seem odd. I would be tempted to raise continuous image puller issue upstream in z2jh.
Ideally the node shouldn't be schedulable until the image puller is done. The idea is to avoid 10 workers being created on the node at once and all pulling the image at once. Pods only check for the image at schedule time and there is no effort to avoid duplicate pulls as far as I know.
For some time we've been noticing network related errors on AWS hubs when scaling up from zero nodes. This hasn't been too much of a concern because things usually recover on their own, for example, launching the first user-notebook on the cluster we see (and the user sees these messages while their jupyterhub session is loading). Note in particular
FailedCreatePodSandBox
followed bySandboxChanged
:But, when the first dask worker pods are launched they get stuck in
Pending
state and there is no recovery from the error:A current non-ideal workaround is to shutdown the first cluster in the notebook and launch a new one. Since the node is already up at that point, things seem to work.
@jacobtomlinson - do you possibly have some insight here?