Open jbusecke opened 3 years ago
client.wait_for_workers
? https://distributed.dask.org/en/latest/api.html#distributed.Client.wait_for_workers
Ill give that try...
It seems like this error is already raised during the cluster = coiled.Cluster(...)
step. So even before creating the client object.
I did not see any options in the Cluster API except for timeout=None
, but it seems like this error is triggered quite promptly.
client.wait_for_workers
will be useful for blocking until a certain number of workers have spun up. However from the traceback @jbusecke linked to, it appears coiled
is raising an error here
when the initial request for workers is made because your CPU limit has already been reached. This means we're in a place where the actual request for workers isn't successful so we don't even have a cluster
object we can connect a client
to.
@jbusecke could you try using coiled.list_clusters
and coiled.delete_cluster
at the beginning of your script to ensure any previous cmip6_derived_cloud_datasets
clusters have been shut down?
My current remedy is to run this with less workers per cluster, but since I am planning to run this over many different models in the end, it would be amazing to be able to not having those fail.
Another approach could be to limit the amount of concurrent jobs launched by gh -actions and then adjust the coiled workers accordingly. Ill look into that now.
@jbusecke could you try using coiled.list_clusters and coiled.delete_cluster to ensure any previous cmip6_derived_cloud_datasets clusters have been shut down?
But this would shut down other clusters that run in a parallel job, right? I definitely do not want to nuke those if they are still running.
A quick look resulted in this concurrency feature. But I think that will actually limit the number of jobs to 1, and I am not sure if there is a way to run this with another fixed number.
Maybe the scenario I envision is just too core hungry? But I would love to at least process 1-3 models in parallel to reduce the time needed...
What is your current core limit? Let's just bump that way up on the Coiled side?
Yeah, that would work as a short term workaround. @jbusecke what the name of the account you're currently using?
"jbusecke"
Just bumped the jbusecke
core limit to 500
Thanks so much @jrbourbeau. I will leave this one open, since this is more of a short time fix?
Sounds good -- checking double checking, did the increased core limit work as a workaround?
My trial AWS account ran out of credits shortly after this conversation, so I was not able to fully rack up the calculation.
Once I have that sorted via Columbia (might take a few days...) I will try to run a calculation over a lot more models. I think as long as I keep the cpu_per_job
smaller than the coiled CPU limit/gh jobs allowed concurrently I should be fine. But I will follow up with this once I can actually try it.
It seems that currently the processing will fail if a certain number of CPUs is exceeded (see here). I wonder if there is a possibility to tell coiled to wait until recources are freed up, instead of failing the process.
That way we could 'line up' way more comutation jobs without increasing the CPU limit.