Closed defqoon closed 4 years ago
What is the GPU used on AWS? We used Titan X and the training should be done in about 3 days.
Tesla K80. I can't use more recent gpus on aws (eg, Tesla V100) because of the cuda requirements.
Have you experimented with more recent version of pytorch / cuda? I am willing to update the code to pytorch 1.4 and cuda 10.1 if you'd like
The issue is not the pytorch and cuda versions, but using K80 is much slower. One way is to use multi-GPU training to reduce the time (need to slightly modify the code).
one potential issue with cloud training could be the data loading. Ideally, with pytorch's parallel data loader, this line: https://github.com/wasidennis/AdaptSegNet/blob/025a7f54dca681e30fe02327bac46c19dfd8c27c/train_gta2cityscapes_multi.py#L295 should take no time. Could you also check the time usage of this line (and other data loader)?
yeah, that's definitely not the bottleneck. so I moved to p3.2xlarge which has a tesla V100 gpu and using pytorch 1.0 and cuda 9.0, I am getting 0.8 seconds per iterations, so 7x faster than with the K80. training should only take a day or so.
Hi,
Been trying to replicate the gta -> cityscapes results on an AWS instance (p2.xlarge). I am using docker with cuda 8.0 and pytorch 0.4.1. Running the following:
It takes about 5.7 seconds per iteration. Given that the model converges at 120k iterations, it's gonna take me more than a week to train it, which sounds insane. Is there something wrong here or are those the expected times?