Training on slurm - Githubissues

Hey there,

Thanks for your work. I'm trying to train the model on slurm and I think I'm missing something.

When I set to train with 3 GPUS on one node:

#SBATCH --nodes=1
#SBATCH --gres=gpu:rtx_8000:3

I get the exact same speed as using 1 GPU:

#SBATCH --nodes=1
#SBATCH --gres=gpu:rtx_8000:1

either with --distributed or without it. And the training itself is very slow, it performs like 20 steps (not epochs) for 8 hours.

I have around 300k images in train and 200k in validation. All the other parameters are the same.

Is there anything I can do about it?

shariqfarooq123 / AdaBins