microsoft / PlanetaryComputer

Issues, discussions, and information about the Microsoft Planetary Computer
https://planetarycomputer.microsoft.com/
MIT License
176 stars 6 forks source link

SpawnTimeout on Planetary Computer Hub #261

Open jmoortgat opened 10 months ago

jmoortgat commented 10 months ago

Myself and a dozen colleagues have tried non-stop to get any type of Hub node for over a week without success. Is there any website that that shows the status of PC Hub nodes? Also, is there some way to still access files when it is impossible to get an interactive compute node?

TomAugspurger commented 10 months ago

(edited the issue title)

There isn't a status page.

Which environment type are you trying to get? The GPU environments are often up against our quota. I did just successfully start a Python environment.

Note, there was a warning:

2023-08-28T15:43:06Z [Warning] Unable to attach or mount volumes: unmounted volumes=[volume-taugspurger-40microsoft-2ecom], unattached volumes=[volume-taugspurger-40microsoft-2ecom user-etc-singleuser dshm]: timed out waiting for the condition

I'll look into that when I have time, but that's saying the thing providing the storage wasn't ready when we asked for it. Kuberentes will retry though, and eventually succeed.

Another common failure scenario is if you filled up your home directory, in which case jupyterlab isn't able to start (since it tries to write a file in the home directory). However, I that you and all your colleagues hit that same error.

jmoortgat commented 10 months ago

I do actually need a GPU environment, but I can't even get a CPU one. Log from last attempt:

Your server is starting up.

You will be redirected automatically when it's ready for you.

100% Complete Spawn failed: Timeout

Event log Server requested 2023-08-28T16:06:08.937460Z [Warning] 0/111 nodes are available: 108 Insufficient memory, 2 node(s) had taint {kubernetes.azure.com/scalesetpriority: spot}, that the pod didn't tolerate, 76 Insufficient cpu. 2023-08-28T16:06:08.963353Z [Normal] Successfully assigned prod/jupyter-kingsmountainlion-40gmail-2ecom to aks-user-37927680-vmss0003yg 2023-08-28T16:12:46Z [Warning] Unable to attach or mount volumes: unmounted volumes=[volume-kingsmountainlion-40gmail-2ecom], unattached volumes=[user-etc-singleuser dshm volume-kingsmountainlion-40gmail-2ecom]: timed out waiting for the condition 2023-08-28T16:17:22Z [Warning] Unable to attach or mount volumes: unmounted volumes=[volume-kingsmountainlion-40gmail-2ecom], unattached volumes=[volume-kingsmountainlion-40gmail-2ecom user-etc-singleuser dshm]: timed out waiting for the condition 2023-08-28T16:19:36Z [Warning] Unable to attach or mount volumes: unmounted volumes=[volume-kingsmountainlion-40gmail-2ecom], unattached volumes=[dshm volume-kingsmountainlion-40gmail-2ecom user-etc-singleuser]: timed out waiting for the condition 2023-08-28T16:20:40Z [Warning] AttachVolume.Attach failed for volume "pvc-6b12ef2d-0996-4dda-9510-55f045182cda" : timed out waiting for external-attacher of disk.csi.azure.com CSI driver to attach volume /subscriptions/9da7523a-cb61-4c3e-b1d4-afa5fc6d2da9/resourceGroups/mc_pcc-prod-2-rg_pcc-prod-2-cluster_westeurope/providers/Microsoft.Compute/disks/pvc-6b12ef2d-0996-4dda-9510-55f045182cda Spawn failed: Timeout

I don't I should be out of disk space.

jmoortgat commented 10 months ago

It would even be helpful if there was a way to simply access the files on an account, i.e. even without any compute node attached.

TomAugspurger commented 10 months ago

Can you try with a non-GPU server, to see if that starts successfully? The GPU nodes do take longer to come up. Maybe I need to increase their timeout :/

It would even be helpful if there was a way to simply access the files on an account, i.e. even without any compute node attached.

I'm not sure if JupyterHub (which is what this deployment uses) supports that. Do you know?

You might want to store your files somewhere more permanent, like a GitHub repository, and clone that when you start your notebook server.

jmoortgat commented 10 months ago

The above is for a CPU server. That's what I was trying to say. Even though I need GPU for computations, I cannot get any type of node just to access files (the above was an attempt to get a CPU node).

jmoortgat commented 10 months ago

Tried again for CPU node, with same result:

Your server is starting up.

You will be redirected automatically when it's ready for you.

100% Complete Spawn failed: Timeout

Event log Server requested 2023-08-28T16:31:53.097284Z [Normal] Successfully assigned prod/jupyter-kingsmountainlion-40gmail-2ecom to aks-user-37927680-vmss0003gj 2023-08-28T16:36:11Z [Warning] Unable to attach or mount volumes: unmounted volumes=[volume-kingsmountainlion-40gmail-2ecom], unattached volumes=[user-etc-singleuser dshm volume-kingsmountainlion-40gmail-2ecom]: timed out waiting for the condition 2023-08-28T16:38:25Z [Warning] Unable to attach or mount volumes: unmounted volumes=[volume-kingsmountainlion-40gmail-2ecom], unattached volumes=[volume-kingsmountainlion-40gmail-2ecom user-etc-singleuser dshm]: timed out waiting for the condition 2023-08-28T16:39:20Z [Warning] AttachVolume.Attach failed for volume "pvc-6b12ef2d-0996-4dda-9510-55f045182cda" : timed out waiting for external-attacher of disk.csi.azure.com CSI driver to attach volume /subscriptions/9da7523a-cb61-4c3e-b1d4-afa5fc6d2da9/resourceGroups/mc_pcc-prod-2-rg_pcc-prod-2-cluster_westeurope/providers/Microsoft.Compute/disks/pvc-6b12ef2d-0996-4dda-9510-55f045182cda 2023-08-28T16:47:29Z [Warning] Unable to attach or mount volumes: unmounted volumes=[volume-kingsmountainlion-40gmail-2ecom], unattached volumes=[dshm volume-kingsmountainlion-40gmail-2ecom user-etc-singleuser]: timed out waiting for the condition Spawn failed: Timeout

TomAugspurger commented 10 months ago

One of those failures might have been might fault. I was checking the capacity of your volume at the same time, and only one Pod can attach to a volume at one.

But it looks like it might have succeeded this time?

TomAugspurger commented 10 months ago

(Edited the issue title to reflect that this is just an issue with the PC Hub).

jmoortgat commented 10 months ago

Thanks Tom. Yes, I was finally able to connect to a CPU node after the earlier couple of time-out failures. I'm afraid I may run into a similar issue more broadly, though, and don't quite understand what you mean by attaching pods to volumes.

As I mentioned, I really need a GPU node, so I disconnected the above CPU connection, hit the 'Stop Server' button, hit the log-out button, and then tried from scratch to get a GPU node again. But I'm getting similar errors: Your server is starting up.

You will be redirected automatically when it's ready for you.

88% Complete 2023-08-28T17:57:11Z [Warning] Unable to attach or mount volumes: unmounted volumes=[volume-kingsmountainlion-40gmail-2ecom], unattached volumes=[dshm volume-kingsmountainlion-40gmail-2ecom user-etc-singleuser]: timed out waiting for the condition

Event log Server requested 2023-08-28T17:42:30.635221Z [Warning] 0/110 nodes are available: 110 Insufficient nvidia.com/gpu, 72 Insufficient cpu, 98 Insufficient memory. 2023-08-28T17:42:30.665402Z [Warning] 0/110 nodes are available: 110 Insufficient nvidia.com/gpu, 72 Insufficient cpu, 98 Insufficient memory. 2023-08-28T17:42:43Z [Normal] pod triggered scale-up: [{aks-gpuuser-16178637-vmss 0->1 (max: 25)}] 2023-08-28T17:43:02.428799Z [Warning] 0/110 nodes are available: 110 Insufficient nvidia.com/gpu, 72 Insufficient cpu, 98 Insufficient memory. 2023-08-28T17:48:52.918466Z [Warning] 0/111 nodes are available: 1 node(s) had taint {node.cloudprovider.kubernetes.io/uninitialized: true}, that the pod didn't tolerate, 110 Insufficient nvidia.com/gpu, 72 Insufficient cpu, 98 Insufficient memory. 2023-08-28T17:49:23.444818Z [Warning] 0/111 nodes are available: 1 node(s) had taint {node.cloudprovider.kubernetes.io/uninitialized: true}, that the pod didn't tolerate, 110 Insufficient nvidia.com/gpu, 72 Insufficient cpu, 98 Insufficient memory. 2023-08-28T17:50:07Z [Normal] pod triggered scale-up: [{aks-gpuuser-16178637-vmss 1->2 (max: 25)}] 2023-08-28T17:55:08.245541Z [Normal] Successfully assigned prod/jupyter-kingsmountainlion-40gmail-2ecom to aks-gpuuser-16178637-vmss0008hu 2023-08-28T17:56:09Z [Warning] Multi-Attach error for volume "pvc-6b12ef2d-0996-4dda-9510-55f045182cda" Volume is already exclusively attached to one node and can't be attached to another 2023-08-28T17:57:11Z [Warning] Unable to attach or mount volumes: unmounted volumes=[volume-kingsmountainlion-40gmail-2ecom], unattached volumes=[dshm volume-kingsmountainlion-40gmail-2ecom user-etc-singleuser]: timed out waiting for the condition

Why is it saying "Volume is already exclusively attached to one node and can't be attached to another"? Is it possible for a spawn to fail and time out, but somehow still have a volume attached, which then prevents the next attempt to get a node to fail?

TomAugspurger commented 10 months ago

Yeah, the GPU thing I'll look into later when I get a chance. Most likely, it's just the GPU nodes taking longer to start, but I'd like to understand why that's happening.


From: jmoortgat @.> Sent: Monday, August 28, 2023 1:00 PM To: microsoft/PlanetaryComputer @.> Cc: Comment @.>; Subscribed @.> Subject: Re: [microsoft/PlanetaryComputer] SpawnTimeout on Planetary Computer Hub (Issue #261)

Thanks Tom. Yes, I was finally able to connect to a CPU node after the earlier couple of time-out failures. I'm afraid I may run into a similar issue more broadly, though, and don't quite understand what you mean by attaching pods to volumes.

As I mentioned, I really need a GPU node, so I disconnected the above CPU connection, hit the 'Stop Server' button, hit the log-out button, and the tried from scratch to the a GPU node again. But I'm getting similar errors:

— Reply to this email directly, view it on GitHubhttps://github.com/microsoft/PlanetaryComputer/issues/261#issuecomment-1696113631 or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAKAOIXJRUGCCXKO3T7OM5TXXTME5BFKMF2HI4TJMJ2XIZLTSOBKK5TBNR2WLJDUOJ2WLJDOMFWWLO3UNBZGKYLEL5YGC4TUNFRWS4DBNZ2F6YLDORUXM2LUPGBKK5TBNR2WLJDUOJ2WLJDOMFWWLLTXMF2GG2C7MFRXI2LWNF2HTAVFOZQWY5LFUVUXG43VMWSG4YLNMWVXI2DSMVQWIX3UPFYGLLDTOVRGUZLDORPXI6LQMWWES43TOVSUG33NNVSW45FGORXXA2LDOOJIFJDUPFYGLKTSMVYG643JORXXE6NFOZQWY5LFVEZTCNZUGAYTKMZUQKSHI6LQMWSWS43TOVS2K5TBNR2WLKRRHA3DSOJTGU3TOMNHORZGSZ3HMVZKMY3SMVQXIZI. You are receiving this email because you commented on the thread.

Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

TomAugspurger commented 10 months ago

BTW, you can use your own compute in your own Azure subscription if this is blocking you.

jmoortgat commented 10 months ago

Thanks Tom. I'd understand if the GPU nodes are over-subscribed, but I don't know how to interpret the logs to know what the issue is. The main error below seems to be a 'multi-attach' issue, just like I had for CPU node before:

Spawn failed: Timeout

Event log Server requested 2023-08-28T18:08:04.222833Z [Warning] 0/116 nodes are available: 1 node(s) had taint {ToBeDeletedByClusterAutoscaler: 1693245517}, that the pod didn't tolerate, 100 Insufficient memory, 114 Insufficient nvidia.com/gpu, 70 Insufficient cpu. 2023-08-28T18:08:04.253862Z [Warning] 0/116 nodes are available: 1 node(s) had taint {ToBeDeletedByClusterAutoscaler: 1693245517}, that the pod didn't tolerate, 100 Insufficient memory, 114 Insufficient nvidia.com/gpu, 70 Insufficient cpu. 2023-08-28T18:08:13Z [Normal] pod triggered scale-up: [{aks-gpuuser-16178637-vmss 1->2 (max: 25)}] 2023-08-28T18:08:42Z [Warning] 0/116 nodes are available: 1 node(s) had taint {ToBeDeletedByClusterAutoscaler: 1693245517}, that the pod didn't tolerate, 100 Insufficient memory, 114 Insufficient nvidia.com/gpu, 70 Insufficient cpu. 2023-08-28T18:14:33.114875Z [Warning] 0/116 nodes are available: 1 node(s) had taint {ToBeDeletedByClusterAutoscaler: 1693245517}, that the pod didn't tolerate, 114 Insufficient nvidia.com/gpu, 70 Insufficient cpu, 98 Insufficient memory. 2023-08-28T18:14:39.176559Z [Warning] 0/116 nodes are available: 1 node(s) had taint {ToBeDeletedByClusterAutoscaler: 1693245517}, that the pod didn't tolerate, 114 Insufficient nvidia.com/gpu, 70 Insufficient cpu, 98 Insufficient memory. 2023-08-28T18:15:04.645980Z [Warning] 0/116 nodes are available: 1 node(s) had taint {node.cloudprovider.kubernetes.io/uninitialized: true}, that the pod didn't tolerate, 114 Insufficient nvidia.com/gpu, 70 Insufficient cpu, 98 Insufficient memory. 2023-08-28T18:15:06.650951Z [Warning] 0/117 nodes are available: 114 Insufficient nvidia.com/gpu, 2 node(s) had taint {node.cloudprovider.kubernetes.io/uninitialized: true}, that the pod didn't tolerate, 70 Insufficient cpu, 98 Insufficient memory. 2023-08-28T18:15:09.628602Z [Warning] 0/117 nodes are available: 114 Insufficient nvidia.com/gpu, 2 node(s) had taint {node.cloudprovider.kubernetes.io/uninitialized: true}, that the pod didn't tolerate, 70 Insufficient cpu, 98 Insufficient memory. 2023-08-28T18:20:47.355252Z [Normal] Successfully assigned prod/jupyter-kingsmountainlion-40gmail-2ecom to aks-gpuuser-16178637-vmss0008hx 2023-08-28T18:20:47Z [Warning] Multi-Attach error for volume "pvc-6b12ef2d-0996-4dda-9510-55f045182cda" Volume is already exclusively attached to one node and can't be attached to another 2023-08-28T18:22:50Z [Warning] Unable to attach or mount volumes: unmounted volumes=[volume-kingsmountainlion-40gmail-2ecom], unattached volumes=[user-etc-singleuser dshm volume-kingsmountainlion-40gmail-2ecom]: timed out waiting for the condition Spawn failed: Timeout

TomAugspurger commented 10 months ago

I looked into this a bit over the past couple days, but don't have a ton to report.

The most common occurance seems to be something like:

  Normal   Scheduled               2m28s  dhub-prod-user-scheduler  Successfully assigned prod/jupyter-taugspurger-40microsoft-2ecom to aks-user-37927680-vmss00040u
  Warning  FailedAttachVolume      28s    attachdetach-controller   AttachVolume.Attach failed for volume "pvc-c0ca6639-cb48-4bd6-8f15-50df54eef04b" : timed out waiting for external-attacher of disk.csi.azure.com CSI driver to attach volume /subscriptions/9da7523a-cb61-4c3e-b1d4-afa5fc6d2da9/resourceGroups/MC_pcc-prod-2-rg_pcc-prod-2-cluster_westeurope/providers/Microsoft.Compute/disks/restore-77f43900-b1f0-458f-a4c5-3201f9549675
  Warning  FailedMount             25s    kubelet                   Unable to attach or mount volumes: unmounted volumes=[volume-taugspurger-40microsoft-2ecom], unattached volumes=[dshm volume-taugspurger-40microsoft-2ecom user-etc-singleuser]: timed out waiting for the condition
  Normal   SuccessfulAttachVolume  7s     attachdetach-controller   AttachVolume.Attach succeeded for volume "pvc-c0ca6639-cb48-4bd6-8f15-50df54eef04b"

In prose:

  1. Kubernetes schedules the user pod onto a node (there was an empty slot ready, so no need to scale the cluster)
  2. FailedAttachVolume indicates that something timed out when attaching the volume with my home directory (I think it's being attached to a node, but not 100% sure; maybe a pod)
  3. FailedMount from kubelet is, I think, just another log message indicating the same thing. We couldn't attach or mount the volume to the pod
  4. SuccessfulAttachVolume indicates that when Kubernetes retried the volume mounting, things went fine.

And eventually my pod started.

This isn't how things always go. Especially when you need a GPU, we're going to be autoscaling the cluster, and perhaps there's another opportunity for something to go wrong there? And I've seen cases where the retries don't necessarily succeed, and eventual the notebook spawn times out. I don't (yet) know enough about Kubernetes & storage to say what's going on.

Next steps would be to get some more logs from the attachdetach-controller to see what's going on around then, when the initial attach request timed out. But I haven't found what pod is generating those events.