Closed VladimirYugay closed 3 years ago
Hey there,
Thanks for your work. I'm trying to train the model on slurm and I think I'm missing something.
When I set to train with 3 GPUS on one node:
#SBATCH --nodes=1 #SBATCH --gres=gpu:rtx_8000:3
I get the exact same speed as using 1 GPU:
#SBATCH --nodes=1 #SBATCH --gres=gpu:rtx_8000:1
either with --distributed or without it. And the training itself is very slow, it performs like 20 steps (not epochs) for 8 hours.
--distributed
I have around 300k images in train and 200k in validation. All the other parameters are the same.
Is there anything I can do about it?
Hey there,
Thanks for your work. I'm trying to train the model on slurm and I think I'm missing something.
When I set to train with 3 GPUS on one node:
I get the exact same speed as using 1 GPU:
either with
--distributed
or without it. And the training itself is very slow, it performs like 20 steps (not epochs) for 8 hours.I have around 300k images in train and 200k in validation. All the other parameters are the same.
Is there anything I can do about it?