Closed jsteinberg-rbi closed 6 months ago
Hm. It seems to have stabilized to between 5 and 6 it/s.
completed at 5 it/s.
If you see the GPU utilization going to 0% sometimes, perhaps you are being bottlenecked by a Hardisk or CPU, which may not be able to load your data to GPU as fast as GPU processes it. I believe you should have more iterations per second using such GPU. I suggest:
I also believe you can use the code with Pytorch 2 and newer CUDA versions without changing the code, this can also give you a slight increase in speed.
I'm using an Nvidia A100 40GB VRAM instance -- a powerful GPU. I'm running CUDA 11.3 with PyTorch 1.13 on python 3.10, as you did. I have the following data config:
The only thing changed here is the numworkers(train|teste) because cuda complains that 20 concurrent workers will jam processing and it actually dynamically recommends 12 workers based on my compute specs.
Initially I was getting epoch throughput of 10 it/s. Now on epochs 20 and greater I'm getting 5 it/s. I've been doing some reading on this and it seems there are two distinct possibilities. The first is that a data structure is being appended to and/or scanned with every iteration and grinds the program to a halt. The second is that a custom loss or network function could get more expensive later during training. Do you have any opinion here? Did you notice a slowing during training?
I have not changed anything at all except the ROOT_DIR, file paths for training and the num_workers_train|teste as mentioned above. Here's my
nvidia-smi
output:Here's my
free -m
Here's my
top
: