juglab / cryoCARE_pip

PIP package of cryoCARE
BSD 3-Clause "New" or "Revised" License
26 stars 14 forks source link

parallelization over multiple GPUs #11

Open jychoi0616 opened 2 years ago

jychoi0616 commented 2 years ago

Hi Tim-Oliver,

Many thanks for developing cryoCARE! Could you implement GPU parellization over multiple GPUs or/and an option to choose which GPU to use for calculation please (when you have a spare time)?

Many thanks, Joy

thorstenwagner commented 2 years ago

You can always choose the GPU with CUDA_VISIBLE_DEVICES=X like

CUDA_VISIBLE_DEVICES=1 cryoCARE_predict.py --conf my_conf.json

I do not plan to add multi-gpu support. Do you @tibuch ?

tibuch commented 2 years ago

I have not planned for it. I am more hoping for tensorflow to support multi-gpu out of the box :smile_cat:

rdrighetto commented 2 years ago

tl;dr: first attempt at multi-GPU training gives a speedup but not much (~2x at best)

Hi,

I attempted a naive multi-GPU implementation of CRYO-CARE training following @tibuch's tip here: https://twitter.com/tibuch_/status/1559161551386513409 All I did was add 2 lines of code to run model.train() inside the tf.distribute.MirroredStrategy() scope.

Below are the training data and training settings I used for benchmarking. Tomograms are 928x928x464 voxels.

train_data_config.json

{
  "even": [
    "/scicore/home/engel0006/diogori/hdcr/201022_Jans_cells_2/9/tomo2_L1G1_even-dose_filt.rec"
  ],
  "odd": [
    "/scicore/home/engel0006/diogori/hdcr/201022_Jans_cells_2/9/tomo2_L1G1_odd-dose_filt.rec"
  ],
  "patch_shape": [
    72,
    72,
    72
  ],
  "num_slices": 2000,
  "split": 0.9,
  "tilt_axis": "Y",
  "n_normalization_samples": 500,
  "path": "./"
}

train_config.json

{
  "train_data": "./",
  "epochs": 100,
  "steps_per_epoch": 200,
  "batch_size": 16,
  "unet_kern_size": 3,
  "unet_n_depth": 3,
  "unet_n_first": 16,
  "learning_rate": 0.0004,
  "model_name": "hdcr_box72_depth3-tomo2_L1G1-distributed_a100",
  "path": "./"
}

Here are the timings using different GPU nodes on our cluster: (1x means vanilla CRYO-CARE v0.2.0 code)

1x TITAN X Pascal: 04:57:46 (17866 s) 6x TITAN X Pascal: 02:18:39 (8319 s) Speedup: 2.15x

1x RTX-8000: 02:49:40 (10180 s) 2x RTX-8000: 01:47:57 (6477 s) Speedup: 1.57x

1x A100: 01:14:00 (4440 s) 4x A100: 00:33:55 (2035 s) Speedup: 2.18x

So, while it is clear that multi-GPU training with this strategy gives some speedup, it is far from scaling linearly with the number of available devices. It seems it does not make sense to use more than 2 or maybe 3 GPUs. Not sure where the overhead lies. I did not use an SSD to store the data, it's on the cluster parallel filesystem, not sure that matters for CRYO-CARE. Of course this a very crude attempt at parallelization, so any input on how we can improve it would be appreciated. If anyone wants to have a look, here it is: https://github.com/rdrighetto/cryoCARE_pip/commit/99f0a6190c4a229d4e37aca43b46d28c93d086f2

Thanks!

Best wishes, Ricardo

thorstenwagner commented 2 years ago

I could imagine its IO bound. Depends on how cryoCARE reads the data. If it does not use multiple processes / threads, then also the parallel filesystem does not help. I will check it out - but maybe @tibuch knows it off the top of his head.

But anyway, having 2x speed is great ;-)

Thanks!!

rdrighetto commented 2 years ago

Some additional info: the speedup gains seem to increase with batch size (nice tip from @LorenzLamm!)

Using same settings as before, just changing batch size:

Batch size: 16 (previous experiment) 1x A100: 01:14:00 (4440 s) 4x A100: 00:33:55 (2035 s) Speedup: 2.18x

Batch size: 32 1x A100: 02:18:19 (8299 s) 4x A100: 00:57:14 (3434 s) Speedup: 2.42x

Batch size: 64 1x A100: 05:24:39 (19479 s) 4x A100: 01:43:43 (6223 s) Speedup: 3.13x

Is it correct to assume that with a larger batch size the training should converge faster? And if yes, how to monitor that?

Thanks!

thorstenwagner commented 2 years ago

The problem is, that there is actually no loss that we expect to decrease a lot. So I think will not be possible to early stop the training by checking some validation loss.... I think you need to find out empirically how many epochs you need given your batch size...

asarnow commented 1 year ago

When I try two GPUs, it hangs after printing Epoch 1/200. With 1 GPU it runs normally. (Two A6000 cards).