Training takes a long time?

wasidennis / AdaptSegNet

Learning to Adapt Structured Output Space for Semantic Segmentation, CVPR 2018 (spotlight)

850 stars 203 forks source link

Training takes a long time? #76

Closed defqoon closed 4 years ago

defqoon commented 4 years ago

Hi,

Been trying to replicate the gta -> cityscapes results on an AWS instance (p2.xlarge). I am using docker with cuda 8.0 and pytorch 0.4.1. Running the following:

python train_gta2cityscapes_multi.py --snapshot-dir ./snapshots/GTA2Cityscapes_single_lsgan \
                                     --lambda-seg 0.0 \
                                     --lambda-adv-target1 0.0 --lambda-adv-target2 0.01 \
                                     --gan LS

It takes about 5.7 seconds per iteration. Given that the model converges at 120k iterations, it's gonna take me more than a week to train it, which sounds insane. Is there something wrong here or are those the expected times?

wasidennis commented 4 years ago

What is the GPU used on AWS? We used Titan X and the training should be done in about 3 days.

defqoon commented 4 years ago

Tesla K80. I can't use more recent gpus on aws (eg, Tesla V100) because of the cuda requirements.

defqoon commented 4 years ago

Have you experimented with more recent version of pytorch / cuda? I am willing to update the code to pytorch 1.4 and cuda 10.1 if you'd like

wasidennis commented 4 years ago

The issue is not the pytorch and cuda versions, but using K80 is much slower. One way is to use multi-GPU training to reduce the time (need to slightly modify the code).

hfslyc commented 4 years ago

one potential issue with cloud training could be the data loading. Ideally, with pytorch's parallel data loader, this line: https://github.com/wasidennis/AdaptSegNet/blob/025a7f54dca681e30fe02327bac46c19dfd8c27c/train_gta2cityscapes_multi.py#L295 should take no time. Could you also check the time usage of this line (and other data loader)?

defqoon commented 4 years ago

yeah, that's definitely not the bottleneck. so I moved to p3.2xlarge which has a tesla V100 gpu and using pytorch 1.0 and cuda 9.0, I am getting 0.8 seconds per iterations, so 7x faster than with the K80. training should only take a day or so.