Closed valsworthen closed 5 years ago
I also use Telsa P100 with 16GB memory, but my GPU-Utilization is 66%.
I am using GTX1080TI, GPU-Utilization is 86%, about 780ms/batch.
@valsworthen given the other two reports here, I would suggest looking into other issues than the GPU: pytorch version, I/O delays (especially file system delays, which could vary a lot depending on your specific setup), among others :)
May I ask for the total time you take to train the baseline? Hope to get an estimation before I start. Thank you!
Training speed could vary greatly given the specific setup of your hardware/infrastructure, but here's another data point: on a Titan Xp, I'm able to achieve a GPU utility of 60-70% on average, and training speed is about 1500ms/batch (minibatch size of 40).
This means roughly 1/2 hours per checkpoint, and depending on your training schedule, the number of actual checkpoints could vary. For instance, the default hyperparameters will take at least checkpoints to stop training (at least 3-4 hrs).
Hello,
I am trying to run your code on a Tesla P100 and it takes more than an hour to compute 1000 steps of the first epoch. I noticed that the 16Go of GPU memory are completely used but the "GPU-Utilization" from
nvidia-smi
is at only 20%, meaning that there is a serious problem of optimization. Is that normal or am I missing something?Thanks.