Distributed Training Slower

shanice-l / gdrnpp_bop2022

PyTorch Implementation of GDRNPP, winner (most of the awards) of the BOP Challenge 2022 at ECCV'22

Apache License 2.0

214 stars 49 forks source link

Distributed Training Slower #21

Open akshay-bapat-magna opened 1 year ago

akshay-bapat-magna commented 1 year ago

Hi,

I have two RTX A6000 GPUs available for training (device IDs 0 and 1). I run the GDRN training as: "./core/gdrn_modeling/train_gdrn.sh 0,1". The training starts as usual but it is much slower (takes almost twice as long to train) than when I use just one GPU. The terminal also shows this warning: "[W reducer.cpp:313] Warning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance." Please note that there are no errors in the output, it is just way too slow. Can anyone please help me with this issue?

shanice-l commented 1 year ago

It's weird since we've trained the network using two 2080Ti GPUs, and the speed is 2x faster than training on a single 2080Ti.

akshay-bapat-magna commented 1 year ago

Are there any other changes required to run distributed training, apart from specifying multiple device IDs? For example, in the config file or somewhere else?

shanice-l commented 1 year ago

No extra requirement.

akshay-bapat-magna commented 1 year ago

Here is more information on the matter: If I train a YOLO model using two GPUs, I see a big jump in speed. Only when I train GDRNPP using two GPUs is when I see a drop in speed.

shanice-l commented 1 year ago

I assume the problem exists with the egl renderer. You can try to generate the XYZ coordinate map offline.

CHDrago commented 1 year ago

Hi,I want to ask you how to generate the XYZ coordinate map offline.

ustblogistics87 commented 1 year ago

@shanice-l 您好，借着请问一下，我用两张rtx3090训练ycbv数据集，参数config为IMS_PER_BATCH=48,TOTAL_EPOCHS=40 训练时长显示需要14天。同样用两张3090训练tless数据集，将TOTAL_EPOCHS改为8，同样需要2天的时间才能够训练完，可能是由于epoch不够，测试结果和官方提供的结果差异很大。请问这样的训练时长是正常的嘛，即使采用tlessSO参数对单个物体进行训练，训练时长也较长，根据https://github.com/shanice-l/gdrnpp_bop2022/issues/23 离线生成xyzmap后能够提升多大的速度呢期待您的回复

CHDrago commented 1 year ago

你好，我和你是一样的问题，也是训练时间长。是不是单独一个类别单独训练一个模型，会缩短时间。单独模型进行微调即可？

ustblogistics87 commented 1 year ago

你好，我和你是一样的问题，也是训练时间长。是不是单独一个类别单独训练一个模型，会缩短时间。单独模型进行微调即可？

单独一个类别训练一个模型，以tlesspbrSO为例，两张3090训练，时间大概显示20小时。全部类别的物体应该都需要单独训练一次

shanice-l commented 1 year ago

速度的瓶颈在于CPU而不在GPU，dataloader里组织数据占用时间很长，GPU做inference时间反而比较短。可以试着使用更好的CPU或者增大num_of_workers

shanice-l commented 1 year ago

@shanice-l 您好，借着请问一下，我用两张rtx3090训练ycbv数据集，参数config为IMS_PER_BATCH=48,TOTAL_EPOCHS=40 训练时长显示需要14天。同样用两张3090训练tless数据集，将TOTAL_EPOCHS改为8，同样需要2天的时间才能够训练完，可能是由于epoch不够，测试结果和官方提供的结果差异很大。请问这样的训练时长是正常的嘛，即使采用tlessSO参数对单个物体进行训练，训练时长也较长，根据#23 离线生成xyzmap后能够提升多大的速度呢期待您的回复

最好不要借楼开issue，这样我收不到邮件通知。