Open jmoortgat opened 10 months ago
(edited the issue title)
There isn't a status page.
Which environment type are you trying to get? The GPU environments are often up against our quota. I did just successfully start a Python environment.
Note, there was a warning:
2023-08-28T15:43:06Z [Warning] Unable to attach or mount volumes: unmounted volumes=[volume-taugspurger-40microsoft-2ecom], unattached volumes=[volume-taugspurger-40microsoft-2ecom user-etc-singleuser dshm]: timed out waiting for the condition
I'll look into that when I have time, but that's saying the thing providing the storage wasn't ready when we asked for it. Kuberentes will retry though, and eventually succeed.
Another common failure scenario is if you filled up your home directory, in which case jupyterlab isn't able to start (since it tries to write a file in the home directory). However, I that you and all your colleagues hit that same error.
I do actually need a GPU environment, but I can't even get a CPU one. Log from last attempt:
Your server is starting up.
You will be redirected automatically when it's ready for you.
100% Complete Spawn failed: Timeout
Event log Server requested 2023-08-28T16:06:08.937460Z [Warning] 0/111 nodes are available: 108 Insufficient memory, 2 node(s) had taint {kubernetes.azure.com/scalesetpriority: spot}, that the pod didn't tolerate, 76 Insufficient cpu. 2023-08-28T16:06:08.963353Z [Normal] Successfully assigned prod/jupyter-kingsmountainlion-40gmail-2ecom to aks-user-37927680-vmss0003yg 2023-08-28T16:12:46Z [Warning] Unable to attach or mount volumes: unmounted volumes=[volume-kingsmountainlion-40gmail-2ecom], unattached volumes=[user-etc-singleuser dshm volume-kingsmountainlion-40gmail-2ecom]: timed out waiting for the condition 2023-08-28T16:17:22Z [Warning] Unable to attach or mount volumes: unmounted volumes=[volume-kingsmountainlion-40gmail-2ecom], unattached volumes=[volume-kingsmountainlion-40gmail-2ecom user-etc-singleuser dshm]: timed out waiting for the condition 2023-08-28T16:19:36Z [Warning] Unable to attach or mount volumes: unmounted volumes=[volume-kingsmountainlion-40gmail-2ecom], unattached volumes=[dshm volume-kingsmountainlion-40gmail-2ecom user-etc-singleuser]: timed out waiting for the condition 2023-08-28T16:20:40Z [Warning] AttachVolume.Attach failed for volume "pvc-6b12ef2d-0996-4dda-9510-55f045182cda" : timed out waiting for external-attacher of disk.csi.azure.com CSI driver to attach volume /subscriptions/9da7523a-cb61-4c3e-b1d4-afa5fc6d2da9/resourceGroups/mc_pcc-prod-2-rg_pcc-prod-2-cluster_westeurope/providers/Microsoft.Compute/disks/pvc-6b12ef2d-0996-4dda-9510-55f045182cda Spawn failed: Timeout
I don't I should be out of disk space.
It would even be helpful if there was a way to simply access the files on an account, i.e. even without any compute node attached.
Can you try with a non-GPU server, to see if that starts successfully? The GPU nodes do take longer to come up. Maybe I need to increase their timeout :/
It would even be helpful if there was a way to simply access the files on an account, i.e. even without any compute node attached.
I'm not sure if JupyterHub (which is what this deployment uses) supports that. Do you know?
You might want to store your files somewhere more permanent, like a GitHub repository, and clone that when you start your notebook server.
The above is for a CPU server. That's what I was trying to say. Even though I need GPU for computations, I cannot get any type of node just to access files (the above was an attempt to get a CPU node).
Tried again for CPU node, with same result:
Your server is starting up.
You will be redirected automatically when it's ready for you.
100% Complete Spawn failed: Timeout
Event log Server requested 2023-08-28T16:31:53.097284Z [Normal] Successfully assigned prod/jupyter-kingsmountainlion-40gmail-2ecom to aks-user-37927680-vmss0003gj 2023-08-28T16:36:11Z [Warning] Unable to attach or mount volumes: unmounted volumes=[volume-kingsmountainlion-40gmail-2ecom], unattached volumes=[user-etc-singleuser dshm volume-kingsmountainlion-40gmail-2ecom]: timed out waiting for the condition 2023-08-28T16:38:25Z [Warning] Unable to attach or mount volumes: unmounted volumes=[volume-kingsmountainlion-40gmail-2ecom], unattached volumes=[volume-kingsmountainlion-40gmail-2ecom user-etc-singleuser dshm]: timed out waiting for the condition 2023-08-28T16:39:20Z [Warning] AttachVolume.Attach failed for volume "pvc-6b12ef2d-0996-4dda-9510-55f045182cda" : timed out waiting for external-attacher of disk.csi.azure.com CSI driver to attach volume /subscriptions/9da7523a-cb61-4c3e-b1d4-afa5fc6d2da9/resourceGroups/mc_pcc-prod-2-rg_pcc-prod-2-cluster_westeurope/providers/Microsoft.Compute/disks/pvc-6b12ef2d-0996-4dda-9510-55f045182cda 2023-08-28T16:47:29Z [Warning] Unable to attach or mount volumes: unmounted volumes=[volume-kingsmountainlion-40gmail-2ecom], unattached volumes=[dshm volume-kingsmountainlion-40gmail-2ecom user-etc-singleuser]: timed out waiting for the condition Spawn failed: Timeout
One of those failures might have been might fault. I was checking the capacity of your volume at the same time, and only one Pod can attach to a volume at one.
But it looks like it might have succeeded this time?
(Edited the issue title to reflect that this is just an issue with the PC Hub).
Thanks Tom. Yes, I was finally able to connect to a CPU node after the earlier couple of time-out failures. I'm afraid I may run into a similar issue more broadly, though, and don't quite understand what you mean by attaching pods to volumes.
As I mentioned, I really need a GPU node, so I disconnected the above CPU connection, hit the 'Stop Server' button, hit the log-out button, and then tried from scratch to get a GPU node again. But I'm getting similar errors: Your server is starting up.
You will be redirected automatically when it's ready for you.
88% Complete 2023-08-28T17:57:11Z [Warning] Unable to attach or mount volumes: unmounted volumes=[volume-kingsmountainlion-40gmail-2ecom], unattached volumes=[dshm volume-kingsmountainlion-40gmail-2ecom user-etc-singleuser]: timed out waiting for the condition
Event log Server requested 2023-08-28T17:42:30.635221Z [Warning] 0/110 nodes are available: 110 Insufficient nvidia.com/gpu, 72 Insufficient cpu, 98 Insufficient memory. 2023-08-28T17:42:30.665402Z [Warning] 0/110 nodes are available: 110 Insufficient nvidia.com/gpu, 72 Insufficient cpu, 98 Insufficient memory. 2023-08-28T17:42:43Z [Normal] pod triggered scale-up: [{aks-gpuuser-16178637-vmss 0->1 (max: 25)}] 2023-08-28T17:43:02.428799Z [Warning] 0/110 nodes are available: 110 Insufficient nvidia.com/gpu, 72 Insufficient cpu, 98 Insufficient memory. 2023-08-28T17:48:52.918466Z [Warning] 0/111 nodes are available: 1 node(s) had taint {node.cloudprovider.kubernetes.io/uninitialized: true}, that the pod didn't tolerate, 110 Insufficient nvidia.com/gpu, 72 Insufficient cpu, 98 Insufficient memory. 2023-08-28T17:49:23.444818Z [Warning] 0/111 nodes are available: 1 node(s) had taint {node.cloudprovider.kubernetes.io/uninitialized: true}, that the pod didn't tolerate, 110 Insufficient nvidia.com/gpu, 72 Insufficient cpu, 98 Insufficient memory. 2023-08-28T17:50:07Z [Normal] pod triggered scale-up: [{aks-gpuuser-16178637-vmss 1->2 (max: 25)}] 2023-08-28T17:55:08.245541Z [Normal] Successfully assigned prod/jupyter-kingsmountainlion-40gmail-2ecom to aks-gpuuser-16178637-vmss0008hu 2023-08-28T17:56:09Z [Warning] Multi-Attach error for volume "pvc-6b12ef2d-0996-4dda-9510-55f045182cda" Volume is already exclusively attached to one node and can't be attached to another 2023-08-28T17:57:11Z [Warning] Unable to attach or mount volumes: unmounted volumes=[volume-kingsmountainlion-40gmail-2ecom], unattached volumes=[dshm volume-kingsmountainlion-40gmail-2ecom user-etc-singleuser]: timed out waiting for the condition
Why is it saying "Volume is already exclusively attached to one node and can't be attached to another"? Is it possible for a spawn to fail and time out, but somehow still have a volume attached, which then prevents the next attempt to get a node to fail?
Yeah, the GPU thing I'll look into later when I get a chance. Most likely, it's just the GPU nodes taking longer to start, but I'd like to understand why that's happening.
From: jmoortgat @.> Sent: Monday, August 28, 2023 1:00 PM To: microsoft/PlanetaryComputer @.> Cc: Comment @.>; Subscribed @.> Subject: Re: [microsoft/PlanetaryComputer] SpawnTimeout on Planetary Computer Hub (Issue #261)
Thanks Tom. Yes, I was finally able to connect to a CPU node after the earlier couple of time-out failures. I'm afraid I may run into a similar issue more broadly, though, and don't quite understand what you mean by attaching pods to volumes.
As I mentioned, I really need a GPU node, so I disconnected the above CPU connection, hit the 'Stop Server' button, hit the log-out button, and the tried from scratch to the a GPU node again. But I'm getting similar errors:
— Reply to this email directly, view it on GitHubhttps://github.com/microsoft/PlanetaryComputer/issues/261#issuecomment-1696113631 or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAKAOIXJRUGCCXKO3T7OM5TXXTME5BFKMF2HI4TJMJ2XIZLTSOBKK5TBNR2WLJDUOJ2WLJDOMFWWLO3UNBZGKYLEL5YGC4TUNFRWS4DBNZ2F6YLDORUXM2LUPGBKK5TBNR2WLJDUOJ2WLJDOMFWWLLTXMF2GG2C7MFRXI2LWNF2HTAVFOZQWY5LFUVUXG43VMWSG4YLNMWVXI2DSMVQWIX3UPFYGLLDTOVRGUZLDORPXI6LQMWWES43TOVSUG33NNVSW45FGORXXA2LDOOJIFJDUPFYGLKTSMVYG643JORXXE6NFOZQWY5LFVEZTCNZUGAYTKMZUQKSHI6LQMWSWS43TOVS2K5TBNR2WLKRRHA3DSOJTGU3TOMNHORZGSZ3HMVZKMY3SMVQXIZI. You are receiving this email because you commented on the thread.
Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
BTW, you can use your own compute in your own Azure subscription if this is blocking you.
Thanks Tom. I'd understand if the GPU nodes are over-subscribed, but I don't know how to interpret the logs to know what the issue is. The main error below seems to be a 'multi-attach' issue, just like I had for CPU node before:
Spawn failed: Timeout
Event log Server requested 2023-08-28T18:08:04.222833Z [Warning] 0/116 nodes are available: 1 node(s) had taint {ToBeDeletedByClusterAutoscaler: 1693245517}, that the pod didn't tolerate, 100 Insufficient memory, 114 Insufficient nvidia.com/gpu, 70 Insufficient cpu. 2023-08-28T18:08:04.253862Z [Warning] 0/116 nodes are available: 1 node(s) had taint {ToBeDeletedByClusterAutoscaler: 1693245517}, that the pod didn't tolerate, 100 Insufficient memory, 114 Insufficient nvidia.com/gpu, 70 Insufficient cpu. 2023-08-28T18:08:13Z [Normal] pod triggered scale-up: [{aks-gpuuser-16178637-vmss 1->2 (max: 25)}] 2023-08-28T18:08:42Z [Warning] 0/116 nodes are available: 1 node(s) had taint {ToBeDeletedByClusterAutoscaler: 1693245517}, that the pod didn't tolerate, 100 Insufficient memory, 114 Insufficient nvidia.com/gpu, 70 Insufficient cpu. 2023-08-28T18:14:33.114875Z [Warning] 0/116 nodes are available: 1 node(s) had taint {ToBeDeletedByClusterAutoscaler: 1693245517}, that the pod didn't tolerate, 114 Insufficient nvidia.com/gpu, 70 Insufficient cpu, 98 Insufficient memory. 2023-08-28T18:14:39.176559Z [Warning] 0/116 nodes are available: 1 node(s) had taint {ToBeDeletedByClusterAutoscaler: 1693245517}, that the pod didn't tolerate, 114 Insufficient nvidia.com/gpu, 70 Insufficient cpu, 98 Insufficient memory. 2023-08-28T18:15:04.645980Z [Warning] 0/116 nodes are available: 1 node(s) had taint {node.cloudprovider.kubernetes.io/uninitialized: true}, that the pod didn't tolerate, 114 Insufficient nvidia.com/gpu, 70 Insufficient cpu, 98 Insufficient memory. 2023-08-28T18:15:06.650951Z [Warning] 0/117 nodes are available: 114 Insufficient nvidia.com/gpu, 2 node(s) had taint {node.cloudprovider.kubernetes.io/uninitialized: true}, that the pod didn't tolerate, 70 Insufficient cpu, 98 Insufficient memory. 2023-08-28T18:15:09.628602Z [Warning] 0/117 nodes are available: 114 Insufficient nvidia.com/gpu, 2 node(s) had taint {node.cloudprovider.kubernetes.io/uninitialized: true}, that the pod didn't tolerate, 70 Insufficient cpu, 98 Insufficient memory. 2023-08-28T18:20:47.355252Z [Normal] Successfully assigned prod/jupyter-kingsmountainlion-40gmail-2ecom to aks-gpuuser-16178637-vmss0008hx 2023-08-28T18:20:47Z [Warning] Multi-Attach error for volume "pvc-6b12ef2d-0996-4dda-9510-55f045182cda" Volume is already exclusively attached to one node and can't be attached to another 2023-08-28T18:22:50Z [Warning] Unable to attach or mount volumes: unmounted volumes=[volume-kingsmountainlion-40gmail-2ecom], unattached volumes=[user-etc-singleuser dshm volume-kingsmountainlion-40gmail-2ecom]: timed out waiting for the condition Spawn failed: Timeout
I looked into this a bit over the past couple days, but don't have a ton to report.
The most common occurance seems to be something like:
Normal Scheduled 2m28s dhub-prod-user-scheduler Successfully assigned prod/jupyter-taugspurger-40microsoft-2ecom to aks-user-37927680-vmss00040u
Warning FailedAttachVolume 28s attachdetach-controller AttachVolume.Attach failed for volume "pvc-c0ca6639-cb48-4bd6-8f15-50df54eef04b" : timed out waiting for external-attacher of disk.csi.azure.com CSI driver to attach volume /subscriptions/9da7523a-cb61-4c3e-b1d4-afa5fc6d2da9/resourceGroups/MC_pcc-prod-2-rg_pcc-prod-2-cluster_westeurope/providers/Microsoft.Compute/disks/restore-77f43900-b1f0-458f-a4c5-3201f9549675
Warning FailedMount 25s kubelet Unable to attach or mount volumes: unmounted volumes=[volume-taugspurger-40microsoft-2ecom], unattached volumes=[dshm volume-taugspurger-40microsoft-2ecom user-etc-singleuser]: timed out waiting for the condition
Normal SuccessfulAttachVolume 7s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-c0ca6639-cb48-4bd6-8f15-50df54eef04b"
In prose:
FailedAttachVolume
indicates that something timed out when attaching the volume with my home directory (I think it's being attached to a node, but not 100% sure; maybe a pod)FailedMount
from kubelet
is, I think, just another log message indicating the same thing. We couldn't attach or mount the volume to the podSuccessfulAttachVolume
indicates that when Kubernetes retried the volume mounting, things went fine.And eventually my pod started.
This isn't how things always go. Especially when you need a GPU, we're going to be autoscaling the cluster, and perhaps there's another opportunity for something to go wrong there? And I've seen cases where the retries don't necessarily succeed, and eventual the notebook spawn times out. I don't (yet) know enough about Kubernetes & storage to say what's going on.
Next steps would be to get some more logs from the attachdetach-controller
to see what's going on around then, when the initial attach request timed out. But I haven't found what pod is generating those events.
Myself and a dozen colleagues have tried non-stop to get any type of Hub node for over a week without success. Is there any website that that shows the status of PC Hub nodes? Also, is there some way to still access files when it is impossible to get an interactive compute node?