shariqfarooq123 / AdaBins

Official implementation of Adabins: Depth Estimation using adaptive bins
GNU General Public License v3.0
725 stars 156 forks source link

Training on slurm #38

Closed VladimirYugay closed 3 years ago

VladimirYugay commented 3 years ago

Hey there,

Thanks for your work. I'm trying to train the model on slurm and I think I'm missing something.

When I set to train with 3 GPUS on one node:

#SBATCH --nodes=1
#SBATCH --gres=gpu:rtx_8000:3

I get the exact same speed as using 1 GPU:

#SBATCH --nodes=1
#SBATCH --gres=gpu:rtx_8000:1

either with --distributed or without it. And the training itself is very slow, it performs like 20 steps (not epochs) for 8 hours.

I have around 300k images in train and 200k in validation. All the other parameters are the same.

Is there anything I can do about it?