Very slow training? - Githubissues

hotpotqa / hotpot

Apache License 2.0

445 stars 75 forks source link

Very slow training? #7

Closed valsworthen closed 5 years ago

valsworthen commented 5 years ago

Hello,

I am trying to run your code on a Tesla P100 and it takes more than an hour to compute 1000 steps of the first epoch. I noticed that the 16Go of GPU memory are completely used but the "GPU-Utilization" from nvidia-smi is at only 20%, meaning that there is a serious problem of optimization. Is that normal or am I missing something?

Thanks.

jiangkun1994 commented 5 years ago

I also use Telsa P100 with 16GB memory, but my GPU-Utilization is 66%.

Vimos commented 5 years ago

I am using GTX1080TI, GPU-Utilization is 86%, about 780ms/batch.

qipeng commented 5 years ago

@valsworthen given the other two reports here, I would suggest looking into other issues than the GPU: pytorch version, I/O delays (especially file system delays, which could vary a lot depending on your specific setup), among others :)

IrisQY commented 5 years ago

May I ask for the total time you take to train the baseline? Hope to get an estimation before I start. Thank you!

qipeng commented 5 years ago

Training speed could vary greatly given the specific setup of your hardware/infrastructure, but here's another data point: on a Titan Xp, I'm able to achieve a GPU utility of 60-70% on average, and training speed is about 1500ms/batch (minibatch size of 40).

This means roughly 1/2 hours per checkpoint, and depending on your training schedule, the number of actual checkpoints could vary. For instance, the default hyperparameters will take at least checkpoints to stop training (at least 3-4 hrs).