Slow Training Speed - Githubissues

mehdidc / feed_forward_vqgan_clip

Feed forward VQGAN-CLIP model, where the goal is to eliminate the need for optimizing the latent space of VQGAN for each input prompt

MIT License

136 stars 18 forks source link

Slow Training Speed #21

Open s13kman opened 2 years ago

s13kman commented 2 years ago

Hi, First of all great work! I really loved it. To replicate, I tried training on the Conceptual 12M Dataset with the depth and dims same as the pretrained models but the training was too slow. Even in 4 days it was going through the first (or 0th) epoch. I'm training it on NVIDIA Quadro RTX A6000 which I don't think is that much slow. Any suggestions to improve the speed of training? I have multi-gpu access but seems it isn't supported rn. Thanks !

mehdidc commented 2 years ago

Hi @s13kman, thanks for your interest! I would suggest to use multi-gpu training to speed up training since you have access to multiple GPUs. Actually multi-gpu is supported through Horovod (https://github.com/horovod/horovod). Once you install Horovod, basically you don't need to change much, something like:

horovodrun -np number_of_gpus python main.py your_config_file.yaml

Given that the dataset is relatively big, I actually train the models usually only on a single epoch.

CrossLee1 commented 2 years ago

How long did it take you to train only a single epoch?

mehdidc commented 2 years ago

Hi @CrossLee1 sorry for replying until now, so it takes around 6 hours, but I train them on 64 A100 GPUs (data parallel with Horovod) to speed up the process. I am quite sure there are a lot things to optimize here in terms of hardware usage, I was mostly going for fast experiments (walltime) to figure out what works the best (in terms of architecture, data augmentation, losses, etc.) rather than optimizing the training speed.