rosinality / vq-vae-2-pytorch

Implementation of Generating Diverse High-Fidelity Images with VQ-VAE-2 in PyTorch
Other
1.65k stars 275 forks source link

[train_vqvae] multiple gpu seems not work as expected #45

Open kimdn opened 4 years ago

kimdn commented 4 years ago

Thank you for sharing this great code.

I have used infovae as a subsitute of beta-vae and traditional vae (beta=1). However, I think that your vq-vae-2 better reconstructs images.

Unfortunately, when I used multiple gpus

#SBATCH --gres=gpu:4

python /people/kimd999/script/python/cryoEM/vq-vae-2-pytorch/train_vqvae.py /people/kimd999/MARScryo/dn/data/full/PDX/coexp/input --size 256 --n_gpu 4

it reconstructed image poorly (blank images), and didn't minimize mse that much (mse: 0.01311 after 32 epochs).

While using single gpu

#SBATCH --gres=gpu:1

python /people/kimd999/script/python/cryoEM/vq-vae-2-pytorch/train_vqvae.py /people/kimd999/MARScryo/dn/data/full/PDX/coexp/input --size 256

reconstructed images better (almost as if the input image), and minimized mse better (mse: 0.00583 after 12 epochs).

Consequently, using 1 gpu technically "runs faster" with respect to quality performance although it took more time per epoch (4 hr/ epoch) than 4 gpus' 1.4 hr/ epoch.

I wonder whether you have experienced like this as well?

rosinality commented 4 years ago

I didn't saw that kind of the problems. Both distributed or single gpu training results similar results I think.