Open jychoi0616 opened 2 years ago
You can always choose the GPU with CUDA_VISIBLE_DEVICES=X like
CUDA_VISIBLE_DEVICES=1 cryoCARE_predict.py --conf my_conf.json
I do not plan to add multi-gpu support. Do you @tibuch ?
I have not planned for it. I am more hoping for tensorflow to support multi-gpu out of the box :smile_cat:
Hi,
I attempted a naive multi-GPU implementation of CRYO-CARE training following @tibuch's tip here: https://twitter.com/tibuch_/status/1559161551386513409
All I did was add 2 lines of code to run model.train()
inside the tf.distribute.MirroredStrategy()
scope.
Below are the training data and training settings I used for benchmarking. Tomograms are 928x928x464 voxels.
train_data_config.json
{
"even": [
"/scicore/home/engel0006/diogori/hdcr/201022_Jans_cells_2/9/tomo2_L1G1_even-dose_filt.rec"
],
"odd": [
"/scicore/home/engel0006/diogori/hdcr/201022_Jans_cells_2/9/tomo2_L1G1_odd-dose_filt.rec"
],
"patch_shape": [
72,
72,
72
],
"num_slices": 2000,
"split": 0.9,
"tilt_axis": "Y",
"n_normalization_samples": 500,
"path": "./"
}
train_config.json
{
"train_data": "./",
"epochs": 100,
"steps_per_epoch": 200,
"batch_size": 16,
"unet_kern_size": 3,
"unet_n_depth": 3,
"unet_n_first": 16,
"learning_rate": 0.0004,
"model_name": "hdcr_box72_depth3-tomo2_L1G1-distributed_a100",
"path": "./"
}
Here are the timings using different GPU nodes on our cluster: (1x means vanilla CRYO-CARE v0.2.0 code)
1x TITAN X Pascal: 04:57:46 (17866 s) 6x TITAN X Pascal: 02:18:39 (8319 s) Speedup: 2.15x
1x RTX-8000: 02:49:40 (10180 s) 2x RTX-8000: 01:47:57 (6477 s) Speedup: 1.57x
1x A100: 01:14:00 (4440 s) 4x A100: 00:33:55 (2035 s) Speedup: 2.18x
So, while it is clear that multi-GPU training with this strategy gives some speedup, it is far from scaling linearly with the number of available devices. It seems it does not make sense to use more than 2 or maybe 3 GPUs. Not sure where the overhead lies. I did not use an SSD to store the data, it's on the cluster parallel filesystem, not sure that matters for CRYO-CARE. Of course this a very crude attempt at parallelization, so any input on how we can improve it would be appreciated. If anyone wants to have a look, here it is: https://github.com/rdrighetto/cryoCARE_pip/commit/99f0a6190c4a229d4e37aca43b46d28c93d086f2
Thanks!
Best wishes, Ricardo
I could imagine its IO bound. Depends on how cryoCARE reads the data. If it does not use multiple processes / threads, then also the parallel filesystem does not help. I will check it out - but maybe @tibuch knows it off the top of his head.
But anyway, having 2x speed is great ;-)
Thanks!!
Some additional info: the speedup gains seem to increase with batch size (nice tip from @LorenzLamm!)
Using same settings as before, just changing batch size:
Batch size: 16 (previous experiment) 1x A100: 01:14:00 (4440 s) 4x A100: 00:33:55 (2035 s) Speedup: 2.18x
Batch size: 32 1x A100: 02:18:19 (8299 s) 4x A100: 00:57:14 (3434 s) Speedup: 2.42x
Batch size: 64 1x A100: 05:24:39 (19479 s) 4x A100: 01:43:43 (6223 s) Speedup: 3.13x
Is it correct to assume that with a larger batch size the training should converge faster? And if yes, how to monitor that?
Thanks!
The problem is, that there is actually no loss that we expect to decrease a lot. So I think will not be possible to early stop the training by checking some validation loss.... I think you need to find out empirically how many epochs you need given your batch size...
When I try two GPUs, it hangs after printing Epoch 1/200. With 1 GPU it runs normally. (Two A6000 cards).
Hi Tim-Oliver,
Many thanks for developing cryoCARE! Could you implement GPU parellization over multiple GPUs or/and an option to choose which GPU to use for calculation please (when you have a spare time)?
Many thanks, Joy