Processing fails when coiled CPU number is exceeded

jbusecke / cmip6_derived_cloud_datasets

Prototype for derived cloud data pipeline using CMIP6 data.

1 stars 0 forks source link

Processing fails when coiled CPU number is exceeded #4

Open jbusecke opened 3 years ago

jbusecke commented 3 years ago

It seems that currently the processing will fail if a certain number of CPUs is exceeded (see here). I wonder if there is a possibility to tell coiled to wait until recources are freed up, instead of failing the process.

That way we could 'line up' way more comutation jobs without increasing the CPU limit.

rabernat commented 3 years ago

client.wait_for_workers? https://distributed.dask.org/en/latest/api.html#distributed.Client.wait_for_workers

jbusecke commented 3 years ago

Ill give that try...

jbusecke commented 3 years ago

It seems like this error is already raised during the cluster = coiled.Cluster(...) step. So even before creating the client object. I did not see any options in the Cluster API except for timeout=None, but it seems like this error is triggered quite promptly.

jrbourbeau commented 3 years ago

client.wait_for_workers will be useful for blocking until a certain number of workers have spun up. However from the traceback @jbusecke linked to, it appears coiled is raising an error here

https://github.com/jbusecke/cmip6_derived_cloud_datasets/blob/7c22ba97d163ddab562b9d8ff46fc669665c41e9/production_test_env.py#L18-L23

when the initial request for workers is made because your CPU limit has already been reached. This means we're in a place where the actual request for workers isn't successful so we don't even have a cluster object we can connect a client to.

@jbusecke could you try using coiled.list_clusters and coiled.delete_cluster at the beginning of your script to ensure any previous cmip6_derived_cloud_datasets clusters have been shut down?

jbusecke commented 3 years ago

My current remedy is to run this with less workers per cluster, but since I am planning to run this over many different models in the end, it would be amazing to be able to not having those fail.

Another approach could be to limit the amount of concurrent jobs launched by gh -actions and then adjust the coiled workers accordingly. Ill look into that now.

jbusecke commented 3 years ago

@jbusecke could you try using coiled.list_clusters and coiled.delete_cluster to ensure any previous cmip6_derived_cloud_datasets clusters have been shut down?

But this would shut down other clusters that run in a parallel job, right? I definitely do not want to nuke those if they are still running.

jbusecke commented 3 years ago

A quick look resulted in this concurrency feature. But I think that will actually limit the number of jobs to 1, and I am not sure if there is a way to run this with another fixed number.

Maybe the scenario I envision is just too core hungry? But I would love to at least process 1-3 models in parallel to reduce the time needed...

mrocklin commented 3 years ago

What is your current core limit? Let's just bump that way up on the Coiled side?

jrbourbeau commented 3 years ago

Yeah, that would work as a short term workaround. @jbusecke what the name of the account you're currently using?

jbusecke commented 3 years ago

"jbusecke"

jrbourbeau commented 3 years ago

Just bumped the jbusecke core limit to 500

jbusecke commented 3 years ago

Thanks so much @jrbourbeau. I will leave this one open, since this is more of a short time fix?

jrbourbeau commented 3 years ago

Sounds good -- checking double checking, did the increased core limit work as a workaround?

jbusecke commented 3 years ago

My trial AWS account ran out of credits shortly after this conversation, so I was not able to fully rack up the calculation.

Once I have that sorted via Columbia (might take a few days...) I will try to run a calculation over a lot more models. I think as long as I keep the cpu_per_job smaller than the coiled CPU limit/gh jobs allowed concurrently I should be fine. But I will follow up with this once I can actually try it.