Distributed implementation using mirrored strategy

juglab / cryoCARE_pip

PIP package of cryoCARE

BSD 3-Clause "New" or "Revised" License

25 stars 14 forks source link

Distributed implementation using mirrored strategy #35

Closed rdrighetto closed 1 year ago

rdrighetto commented 1 year ago

A simple multi-GPU implementation for training as described in #11

thorstenwagner commented 1 year ago

If @rdrighetto is happy with it, I think we can merge it :-) Could you maybe change README and add information how to use it?

rdrighetto commented 1 year ago

I updated README.md as instructed by @thorstenwagner but only now I realized that my instructions for multi-GPU usage are somewhat conflicting with @EuanPyle implementation of choosing the GPU ID. I've always assumed that setting CUDA_VISIBLE_DEVICES was the standard way of doing this.

Any suggestions on how to settle this?

EuanPyle commented 1 year ago

Hi Ricardo, I think you should be able to add multiple GPUs to the GPU ID field in the json, so it shouldn't conflict with any multiGPU implementation. I actually added the GPU ID function so external programs (in my case the RELION 5.0 tomography module) can easily specify GPU ID without running extra commands in the shell i.e. the RELION wrapper command would simply be able to specify GPU ID 0,1 and not run 1 line specifying GPU ID and then another running cryoCARE.

In any case, can both options co-exist? If users want to run the CUDA_VISIBLE_DEVICES way, it should still work, and GPU_ID field in the json is optional so that can just be ignored in the README. Hope I am making sense, not sure I am!

rdrighetto commented 1 year ago

Hi @EuanPyle , thanks a lot for the clarification!

I ran a few tests here and found the following:

The correct way of specifying multiple GPU IDs in the train_config.json file is the following: "gpu_id": [1,2] (this was not obvious to me, I had to try a few different ways)
CUDA_VISIBLE_DEVICES always takes precedence over "gpu_id". For a single GPU this is fine, but I get some funny behaviors when both specifiers are set with multiple GPU IDs. So I think it's safe to say that while both options are valid, they are mutually exclusive, and will document it as such.

I will update the README.md for my pull request later today.

Cheers, Ricardo

thorstenwagner commented 1 year ago

@rdrighetto : I see that you are quite active. Tell me when you are ready and I gonna merge it :-)

rdrighetto commented 1 year ago

OK, I did a few more tests. It runs fine on our cluster even when both options are set (with different numbers of GPUs specified in each, just for fun) and actually "gpu_id" takes precedence over CUDA_VISIBLE_DEVICES, not the other way around. So I think it should be good even if users specify both options by mistake (as in: it doesn't crash, and considers the "gpu_id" option correctly). I've also updated my branch to reflect the latest changes in the master branch upstream, so merging should be easy now. I think it's ready :-)

EuanPyle commented 1 year ago

Amazing, thanks Ricardo!

thorstenwagner commented 1 year ago

Thank you a lot! I merge it now.

thorstenwagner commented 1 year ago

I also put it pypi:

https://pypi.org/project/cryoCARE/0.3.0/