ml-struct-bio / cryodrgn

Neural networks for cryo-EM reconstruction
http://cryodrgn.cs.princeton.edu
GNU General Public License v3.0
307 stars 76 forks source link

GPU parallelization #3

Open zhonge opened 4 years ago

heejongkim commented 4 years ago

Hi,

I just wanted to check if you have made a progress on this? I saw some commits that you add DataParallel lines.

Thanks for developing such nice tool for cryo-EM field.

best, heejong

zhonge commented 4 years ago

Hi Heejong,

Thanks for asking! The current top of tree has GPU parallelization (commit 3ba2439db6fef20922dd3c60c2a7ab1508475d76) and mixed precision training. Feel free to give it a shot -- I've been meaning to reorganize the documentation before an official release.

A few mini docs:

Thanks, Ellen

heejongkim commented 4 years ago

Yes. I briefly tested with multiple GPUs and it improved the performance substantially, which helped me quickly try out and settle down at least minimal working parameters for my cases.

One thing that I couldn't make it work is --amp flag. With that flag I'm getting the following error msg: Traceback (most recent call last): File "/home/XXXX/miniconda3/envs/cryodrgn/bin/cryodrgn", line 11, in load_entry_point('cryodrgn==0.2.1b0', 'console_scripts', 'cryodrgn')() File "/home/XXXX/miniconda3/envs/cryodrgn/lib/python3.7/site-packages/cryodrgn-0.2.1b0-py3.7.egg/cryodrgn/main.py", line 50, in main args.func(args) File "/home/XXXX/miniconda3/envs/cryodrgn/lib/python3.7/site-packages/cryodrgn-0.2.1b0-py3.7.egg/cryodrgn/commands/train_vae.py", line 356, in main assert (D-1) % 8 == 0 AssertionError

If you have any ideas what might have caused this issue, it would be tremendously helpful. Once I get it working, I will get back to you with the direct comparison in terms of speed.

Thanks.

zhonge commented 4 years ago

Great to hear! I added some assertion messages for the assert that you ran into (commit f1de270a565592adc88602dfee313ed861afebb5).

It's checking that your image size is a multiple of 8. Mixed precision training leads to dramatic speed ups only if your tensor dimensions are even multiples of 8, so I added a few asserts to ensure this.

heejongkim commented 4 years ago

Fantastic! I just ran over 1 million particles with D128 with --amp -- lazy --zdim 10 -n 50 --qdim 1024 --qlayers 3 --pdim 1024 --players 3 it took only a little bit over one day.

Also, thanks for adding the argument for specifying the K. It's really helpful.

kimdn commented 4 years ago

It's checking that your image size is a multiple of 8. Mixed precision training leads to dramatic speed ups only if your tensor dimensions are even multiples of 8, so I added a few asserts to ensure this.

When I compared amp vs no amp with a common command of cryodrgn train_vae cryosparc_P4_J251_009_particles_cs_abs_w_mrcs_star_06_25.256.mrcs --poses pose_256.pkl --ctf ctf.pkl --zdim 8 -n 100 -o vae256_z8_e100 --lazy --batch-size 64 --beta 4

adding amp runs ~17 times faster.

Although I didn't run formal benchmark (run multiple times to minimize fluke data/condition), I set every other settings same (same # gpu, partition/hardware, command). Therefore, I plan to add amp from now on.

zhonge commented 4 years ago

Wow 17x! Great! I haven't noticed any accuracy degradation when using mixed precision training (admittedly with limited benchmarking), so I usually leave it on by default as well. For smaller architectures, sometimes the overhead makes using amp slightly slower than full precision training so keep that in mind too.

Just as a quick note, I would caution against increasing the batch size too much since it may negatively affect the training dynamics. We're definitely not GPU-memory limited with the default batch size (-b 8) so increasing the batch size can lead to dramatic speed ups in terms of time per epoch... except that it will result in fewer model updates per epoch, so you can actually end up training slower in terms of wall clock time. I've noticed this in some initial tests, but something else to explore/benchmark before officially releasing the GPU parallelization version.

kimdn commented 3 years ago

Wow 17x! Great! I haven't noticed any accuracy degradation when using mixed precision training (admittedly with limited benchmarking), so I usually leave it on by default as well. For smaller architectures, sometimes the overhead makes using amp slightly slower than full precision training so keep that in mind too.

Just as a quick note, I would caution against increasing the batch size too much since it may negatively affect the training dynamics. We're definitely not GPU-memory limited with the default batch size (-b 8) so increasing the batch size can lead to dramatic speed ups in terms of time per epoch... except that it will result in fewer model updates per epoch, so you can actually end up training slower in terms of wall clock time. I've noticed this in some initial tests, but something else to explore/benchmark before officially releasing the GPU parallelization version.

Thanks for your comment.

With default architecture (e.g. --enc-layers QLAYERS Number of hidden layers (default: 3) --enc-dim QDIM Number of nodes in hidden layers (default: 256) --dec-layers PLAYERS Number of hidden layers (default: 3) --dec-dim PDIM Number of nodes in hidden layers (default: 256)) and nvidia a100 chip, I see 2.5 x wall clock time speed up with amp (installed by python).

donghuachensu commented 3 years ago

@kimdn How did you install apex? I did install it in a separate folder apex which is parallel with cryodrgn folder, but --amp option could not work with the following error: NameError: name 'amp' is not defined

Any suggestion would be greatly appreciated.

kimdn commented 3 years ago

@donghuachensu

I ran pip install -v --disable-pip-version-check --no-cache-dir ./ according to https://github.com/NVIDIA/apex#quick-start

c++ installation (e.g. pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./) never worked for my system (always resulted in some error during installation for many months).