juglab / n2v

This is the implementation of Noise2Void training.
Other
387 stars 107 forks source link

Multi gpu training #69

Open psteinb opened 4 years ago

psteinb commented 4 years ago

This needs a bit more testing, but I think going multi-gpu is somewhat straight forward. Or did you try that already?

psteinb commented 4 years ago

Almost there, apparently there is a problematic interplay of tf and keras: https://github.com/tensorflow/tensorflow/issues/30728 https://github.com/keras-team/keras/issues/13057 https://github.com/keras-team/keras/pull/13255 I need to check how to fix this.

psteinb commented 4 years ago

done implementing multi-gpu training. I hope putting that into the constructor of N2V was the right choice. I also added an example notebook derived from examples/2D/denoising2D_BSD68/BSD68_reproducibility_multi_gpu.ipynb.

I'll supply more extensive numbers later, my current estimate for training n2v from this notebook is:

I'll provide 4 GPU numbers later. Note that this "improvement" is expected to be non-linear as keras internally does parallize the batches, so a batch size of 128 will be parallelized to 2 batches of 64 images. As discussed earlier this approach is currently not support with tf 1.14 and keras 2.2.{4,5} due to the bugs mentioned above.

Would love to hear your feedback on this.

tibuch commented 4 years ago

Thank you for this PR!

I have this on my to-do list, but wasn't able to get my hands on a multi-GPU system. I guess the cluster should work for testing.

Although I am very confident that it just works, I would like to test it as well :)

psteinb commented 4 years ago

thanks for having a look. Last time I checked, all GPU configs with >=3 GPUs fail to run due to some problems with the keras data augmentations. Maybe this is leveraged by looking into bringing n2v 100% to tf.keras?

snehashis-roy commented 4 years ago

Hi, I want to use 2 gpus for training. As explained in the notebook, I used the following config,

config = N2VConfig(X_train, unet_kern_size=3, unet_n_depth=3, unet_n_first = 64,
                           train_steps_per_epoch=int(dim[0] / 128), train_epochs=50, train_loss='mse',
                           batch_norm=True, train_num_gpus=2,
                           train_batch_size=64, n2v_perc_pix=1.0, n2v_patch_shape=(128,128),
                           n2v_manipulator='uniform_withCP', n2v_neighborhood_radius=5)

I have set CUDA_VISIBLE_DEVICES to 1,2 before running the training. I have used pip install n2v to install N2V. My TF-GPU is 1.14.1, keras 2.2.5, numpy 1.19.1

The training still uses 1 GPU. Please let me know what I am missing.

tibuch commented 4 years ago

Hi @piby2,

This functionality is not part of the official N2V release yet.

If you would like to test it you would have to clone the fork psteinb/n2v and checkout the branch multi_gpu_training. Then you can run pip install . from inside the git repo and this version will be installed.