Data download & Dask both failing

zherbz commented 1 year ago

Hello,

I am trying to subset CMIP6 data using the MPC and save it to my local machine.

I have been using the same code for months now to subset and download data from the MPC, but now I can't seem to save and download anything. There really isn't anything crazy going on with the one function that I have written and nothing has changed on my end from when it was working before.

My guess is there is an issue with persisting the data from blob storage to local storage within my hub environment. The line for persisting throws no errors, but then trying to actually load() and save to netCDF is where everything fails and runs until it times out.

Also, my Dask dashboard is lifeless now when I try to see what is going on, showing no cores even to begin with when I have set up a minimum 4 cores to be scaled adaptively to 24 as needed. Dask will show tasks being added to the progress panel, but no progress will actually develop. I have tested this by also trying to just follow along with some of the provided MPC tutorials and I still run into the same issue. Dask just adds jobs to the que within the progress bar but then there are no workers to actually distribute the work.

Any help is greatly appreciated!

Thanks

code dask

zherbz commented 1 year ago

This seems to be a server issue when initially being setup for a session. For clarity I was just using the standard 4-core python environment. All day yesterday was the same problem as described above, but now today with a new server instance being created the problem is fixed. I tried signing in and out a bunch yesterday too, but to no avail.

Is there a way to instead cancel the server instance and start a new one when issues like this occur?

TomAugspurger commented 1 year ago

It doesn't look like you have any workers ready yet, so Dask isn't able to actually progress on the computation.

Is there a way to instead cancel the server instance and start a new one when issues like this occur?

This issue is unrelated to the singleuser notebook server instance, so restarting that won't help (or hurt). The worker pools are separate. The total number of worker nodes (across all users on the Hub) is capped, and it's possible we're at that limit when you requested workers.

Your options are to 1.) wait until workers become available / try again later, 2.) Use one of the other compute options, including deploying Dask workers on compute in your own Azure subscription.

zherbz commented 1 year ago

Thanks!

microsoft / PlanetaryComputer

Data download & Dask both failing #157