decrease in iteration speed

jsteinberg-rbi commented 6 months ago

I'm using an Nvidia A100 40GB VRAM instance -- a powerful GPU. I'm running CUDA 11.3 with PyTorch 1.13 on python 3.10, as you did. I have the following data config:

##DATA
ROOT_DIR: /home/jsteinberg/vehicle_reid_itsc2023/VeRI-Wild/images/images
BATCH_SIZE: 128 #48
NUM_INSTANCES: 4 #8
num_workers_train: 12 #20 #8
num_workers_teste: 12 #20

The only thing changed here is the numworkers(train|teste) because cuda complains that 20 concurrent workers will jam processing and it actually dynamically recommends 12 workers based on my compute specs.

Initially I was getting epoch throughput of 10 it/s. Now on epochs 20 and greater I'm getting 5 it/s. I've been doing some reading on this and it seems there are two distinct possibilities. The first is that a data structure is being appended to and/or scanned with every iteration and grinds the program to a halt. The second is that a custom loss or network function could get more expensive later during training. Do you have any opinion here? Did you notice a slowing during training?

I have not changed anything at all except the ROOT_DIR, file paths for training and the num_workers_train|teste as mentioned above. Here's my nvidia-smi output:

Fri Mar 15 18:35:32 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   51C    P0             208W / 400W |  11303MiB / 40960MiB |     86%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     44123      C   python                                    11284MiB |
+---------------------------------------------------------------------------------------+

Here's my free -m

free -m
               total        used        free      shared  buff/cache   available
Mem:           85486        8767       19195        3219       57523       72618
Swap:              0           0           0

Here's my top:

top - 18:37:55 up  4:48,  3 users,  load average: 5.23, 5.17, 5.10
Tasks: 203 total,   5 running, 198 sleeping,   0 stopped,   0 zombie
%Cpu(s): 39.1 us,  4.0 sy,  0.0 ni, 56.4 id,  0.0 wa,  0.0 hi,  0.5 si,  0.0 st
MiB Mem :  85486.2 total,  19262.6 free,   8640.0 used,  57583.6 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  72685.9 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                             
  44123 jsteinb+  20   0   28.3g   7.3g   3.9g R 120.9   8.8 182:04.23 python                                                                                              
  78900 jsteinb+  20   0   28.2g   3.5g 111136 S  47.5   4.2   0:20.83 python                                                                                              
  78904 jsteinb+  20   0   28.2g   3.5g 111136 S  47.2   4.2   0:20.69 python                                                                                              
  78901 jsteinb+  20   0   28.2g   3.5g 111136 S  42.9   4.2   0:21.11 python                                                                                              
  78908 jsteinb+  20   0   28.2g   3.5g 111136 R  42.2   4.2   0:20.98 python                                                                                              
  78909 jsteinb+  20   0   28.2g   3.5g 111136 R  34.9   4.2   0:19.73 python                                                                                              
  78899 jsteinb+  20   0   28.2g   3.5g 111136 S  34.6   4.2   0:21.07 python                                                                                              
  78898 jsteinb+  20   0   28.2g   3.5g 111132 S  26.6   4.2   0:20.63 python                                                                                              
  78910 jsteinb+  20   0   28.2g   3.5g 111132 R  26.6   4.2   0:20.06 python                                                                                              
  78891 jsteinb+  20   0   28.2g   3.5g 110912 S  23.6   4.2   0:20.59 python                                                                                              
  78888 jsteinb+  20   0   28.2g   3.5g 110888 S  22.9   4.2   0:21.00 python                                                                                              
  78890 jsteinb+  20   0   28.2g   3.5g 111104 S  22.9   4.2   0:21.14 python                                                                                              
  78889 jsteinb+  20   0   28.2g   3.5g 111108 S  22.6   4.2   0:20.63 python                                                                                              
   2373 jupyter   20   0 2084416 232240  74036 S   1.0   0.3   0:19.80 python3

jsteinberg-rbi commented 6 months ago

Hm. It seems to have stabilized to between 5 and 6 it/s.

jsteinberg-rbi commented 6 months ago

completed at 5 it/s.

videturfortuna commented 6 months ago

If you see the GPU utilization going to 0% sometimes, perhaps you are being bottlenecked by a Hardisk or CPU, which may not be able to load your data to GPU as fast as GPU processes it. I believe you should have more iterations per second using such GPU. I suggest:

Try to move your dataset to a SSD disk.
Try setting num_workers to different number, sometimes using the total of threads of the cpu is not an ideal choice for speed.
Set cudnn.benchmark = True on main.py, although you loose the deterministic results with this flag.

I also believe you can use the code with Pytorch 2 and newer CUDA versions without changing the code, this can also give you a slight increase in speed.

videturfortuna / vehicle_reid_itsc2023

decrease in iteration speed #14