related-sciences / ukb-gwas-pipeline-nealelab

Pipeline for reproduction of NealeLab 2018 UKB GWAS
4 stars 3 forks source link

Figure out how to get dependencies into VM images for Dask CP #22

Closed eric-czech closed 3 years ago

eric-czech commented 3 years ago

With GCP support in Dask Cloud Provider (see https://github.com/related-sciences/ukb-gwas-pipeline-nealelab/issues/19), next we need to determine how to modify the worker images to contain necessary dependencies.

This is probably the right approach: https://cloudprovider.dask.org/en/latest/packer.html. We probably want the dask docker image to be installed on the VM image along with any extra software before switching to custom VM and docker images in the Cloud Provider cluster configs.

eric-czech commented 3 years ago

FYI, this is certainly necessary as these errors occur in worker logs when trying to use sgkit:

distributed.worker - ERROR - No module named 'xarray' Traceback (most recent call last): 
File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 915, in handle_scheduler await self.handle_stream( 
File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 579, in handle_stream msgs = await comm.read() 
File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 204, in read msg = await from_frames( 
File "/opt/conda/lib/python3.8/site-packages/distributed/comm/utils.py", line 87, in from_frames res = _from_frames() 
File "/opt/conda/lib/python3.8/site-packages/distributed/comm/utils.py", line 65, in _from_frames return protocol.loads( 
File "/opt/conda/lib/python3.8/site-packages/distributed/protocol/core.py", line 151, in loads value = _deserialize(head, fs, deserializers=deserializers) 
File "/opt/conda/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 335, in deserialize return loads(header, frames) 
File "/opt/conda/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 71, in pickle_loads return pickle.loads(x, buffers=buffers) 
File "/opt/conda/lib/python3.8/site-packages/distributed/protocol/pickle.py", line 75, in loads return pickle.loads(x) ModuleNotFoundError: No module named 'xarray'

distributed.worker - INFO - Connection to scheduler broken. Reconnecting...

It's unclear to me what all is possible in sgkit on vanilla dask cluster at the moment. I suspect not much so we'll probably need library-specific cluster configuration instructions.

eric-czech commented 3 years ago

This can also be accomplished by passing variables like this env_vars.json to the env_vars parameter of Dask Cloud Provider VM Cluster instances. See here for a larger discussion on that.

Using env vars for this adds a few minutes to the cluster start times. This hasn't been annoying enough yet to try to build custom VM images so I'll stick with it for now.

hammer commented 3 years ago

@eric-czech having followed the linked cloud provider discussion, I still don't quite understand why you're holding out on building custom VM images. If you've got a workflow that works for you though, that's fine.

eric-czech commented 3 years ago

It's not worth the effort to make the process 20% faster right now. It would be eventually, but I don't care enough yet while I'm still struggling to do fairly basic things w/ distributed dask + sgkit.