microsoft / PlanetaryComputer

Issues, discussions, and information about the Microsoft Planetary Computer
https://planetarycomputer.microsoft.com/
MIT License
182 stars 8 forks source link

Unable to attach or mount volumes, Spawn failed: did not start in 900 seconds #171

Open scottyhq opened 1 year ago

scottyhq commented 1 year ago

The first log line (0/116 nodes are available) makes me think it's a scaling limit issue. But the status page (https://planetarycomputer-status.microsoft.com/) looks fine...

Event log
Server requested
2023-01-26T16:32:17.088705Z [Warning] 0/116 nodes are available: 51 node(s) had taint {kubernetes.azure.com/scalesetpriority: spot}, that the pod didn't tolerate, 63 Insufficient cpu, 63 Insufficient memory.
2023-01-26T16:32:17.114684Z [Normal] Successfully assigned prod/jupyter-scottyh-40uw-2eedu to aks-user-37927680-vmss00019p
2023-01-26T16:36:38Z [Warning] Unable to attach or mount volumes: unmounted volumes=[volume-scottyh-40uw-2eedu], unattached volumes=[user-etc-singleuser dshm volume-scottyh-40uw-2eedu]: timed out waiting for the condition
2023-01-26T16:41:12Z [Warning] Unable to attach or mount volumes: unmounted volumes=[volume-scottyh-40uw-2eedu], unattached volumes=[volume-scottyh-40uw-2eedu user-etc-singleuser dshm]: timed out waiting for the condition
2023-01-26T16:42:25Z [Warning] AttachVolume.Attach failed for volume "pvc-7c839cb1-304b-4c05-8a08-6da914f50791" : timed out waiting for external-attacher of disk.csi.azure.com CSI driver to attach volume /subscriptions/9da7523a-cb61-4c3e-b1d4-afa5fc6d2da9/resourceGroups/MC_pcc-prod-2-rg_pcc-prod-2-cluster_westeurope/providers/Microsoft.Compute/disks/restore-b66b4a4b-b3a3-4f17-ac08-7e70b5c7c670
2023-01-26T16:43:27Z [Warning] Unable to attach or mount volumes: unmounted volumes=[volume-scottyh-40uw-2eedu], unattached volumes=[dshm volume-scottyh-40uw-2eedu user-etc-singleuser]: timed out waiting for the condition
Spawn failed: pod prod/jupyter-scottyh-40uw-2eedu did not start in 900 seconds!

Seems separate from #117

scottyhq commented 1 year ago

For what it's worth after the failed start, hitting relaunch worked with the usual log messages:

Event log
Server requested
2023-01-26T16:51:29.874186Z [Normal] Successfully assigned prod/jupyter-scottyh-40uw-2eedu to aks-user-37927680-vmss00019a
2023-01-26T16:51:47Z [Normal] AttachVolume.Attach succeeded for volume "pvc-7c839cb1-304b-4c05-8a08-6da914f50791"
2023-01-26T16:51:57Z [Normal] Container image "jupyterhub/k8s-network-tools:1.2.0" already present on machine
2023-01-26T16:51:57Z [Normal] Created container block-cloud-metadata
2023-01-26T16:51:58Z [Normal] Started container block-cloud-metadata
2023-01-26T16:51:58Z [Normal] Container image "pcccr.azurecr.io/public/planetary-computer/python:2022.9.16.0" already present on machine
2023-01-26T16:51:58Z [Normal] Created container notebook
2023-01-26T16:51:58Z [Normal] Started container notebook
TomAugspurger commented 1 year ago

Thanks for the report. I think the first line about the nodes is somewhat expected. Kubernetes will emit that before the autoscaler adds more nodes.

The line at

2023-01-26T16:42:25Z [Warning] AttachVolume.Attach failed for volume "pvc-7c839cb1-304b-4c05-8a08-6da914f50791" : timed out waiting for external-attacher of disk.csi.azure.com CSI driver to attach volume /subscriptions/9da7523a-cb61-4c3e-b1d4-afa5fc6d2da9/resourceGroups/MC_pcc-prod-2-rg_pcc-prod-2-cluster_westeurope/providers/Microsoft.Compute/disks/restore-b66b4a4b-b3a3-4f17-ac08-7e70b5c7c670

is an error we used to see pretty often, but it seemed to be mostly fixed with our migration to a newer Kubernetes Cluster.

As you saw, you saw, the volume attach seems to always succeed on subsequent attempts.

I'll keep an eye out to see if this continues to happen.

cxyth commented 6 months ago

@TomAugspurger Looks like my home directory was full, but it seems there is no other way to delete files without logging into the server?

Spawn failed: Server at http://10.244.224.162:8888/compute/user/cxyth@live.com/ didn't respond in 30 seconds

Event log
Server requested
2024-03-28T06:35:27.260261Z [Normal] Successfully assigned prod/jupyter-cxyth-40live-2ecom to aks-user-17077795-vmss0000x1
2024-03-28T06:35:36Z [Normal] AttachVolume.Attach succeeded for volume "pvc-1ebcdaf5-21f2-40c9-bdd8-d96e49e974a5"
2024-03-28T06:35:40Z [Normal] Container image "jupyterhub/k8s-network-tools:1.2.0" already present on machine
2024-03-28T06:35:40Z [Normal] Created container block-cloud-metadata
2024-03-28T06:35:41Z [Normal] Started container block-cloud-metadata
2024-03-28T06:35:41Z [Normal] Container image "pcccr.azurecr.io/planetary-computer/python:2024.3.20.1" already present on machine
2024-03-28T06:35:41Z [Normal] Created container notebook
2024-03-28T06:35:41Z [Normal] Started container notebook
Spawn failed: Server at http://10.244.224.162:8888/compute/user/cxyth@live.com/ didn't respond in 30 seconds
TomAugspurger commented 6 months ago

Could you send us an email at @.*** with the address you signed up with and we'll take a look?


From: cxyth @.> Sent: Thursday, March 28, 2024 9:37 PM To: microsoft/PlanetaryComputer @.> Cc: Mention @.>; Comment @.>; Subscribed @.***> Subject: Re: [microsoft/PlanetaryComputer] Unable to attach or mount volumes, Spawn failed: did not start in 900 seconds (Issue #171)

@TomAugspurgerhttps://github.com/TomAugspurger Looks like my home directory was full, but it seems there is no other way to delete files without logging into the server?

Spawn failed: Server at @.***/ didn't respond in 30 seconds

Event log Server requested 2024-03-28T06:35:27.260261Z [Normal] Successfully assigned prod/jupyter-cxyth-40live-2ecom to aks-user-17077795-vmss0000x1 2024-03-28T06:35:36Z [Normal] AttachVolume.Attach succeeded for volume "pvc-1ebcdaf5-21f2-40c9-bdd8-d96e49e974a5" 2024-03-28T06:35:40Z [Normal] Container image "jupyterhub/k8s-network-tools:1.2.0" already present on machine 2024-03-28T06:35:40Z [Normal] Created container block-cloud-metadata 2024-03-28T06:35:41Z [Normal] Started container block-cloud-metadata 2024-03-28T06:35:41Z [Normal] Container image "pcccr.azurecr.io/planetary-computer/python:2024.3.20.1" already present on machine 2024-03-28T06:35:41Z [Normal] Created container notebook 2024-03-28T06:35:41Z [Normal] Started container notebook Spawn failed: Server at @.***/ didn't respond in 30 seconds

— Reply to this email directly, view it on GitHubhttps://github.com/microsoft/PlanetaryComputer/issues/171#issuecomment-2026507451 or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAKAOISCWFMVUYWR5IPDD2TY2THW3BFKMF2HI4TJMJ2XIZLTSOBKK5TBNR2WLJDUOJ2WLJDOMFWWLO3UNBZGKYLEL5YGC4TUNFRWS4DBNZ2F6YLDORUXM2LUPGBKK5TBNR2WLJDUOJ2WLJDOMFWWLLTXMF2GG2C7MFRXI2LWNF2HTAVFOZQWY5LFUVUXG43VMWSG4YLNMWVXI2DSMVQWIX3UPFYGLLDTOVRGUZLDORPXI6LQMWWES43TOVSUG33NNVSW45FGORXXA2LDOOJIFJDUPFYGLKTSMVYG643JORXXE6NFOZQWY5LFVEZTCNZUGAYTKMZUQKSHI6LQMWSWS43TOVS2K5TBNR2WLKRRGU2TQNBVGMYTGNNHORZGSZ3HMVZKMY3SMVQXIZI. You are receiving this email because you were mentioned.

Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.