libffcv / ffcv

FFCV: Fast Forward Computer Vision (and other ML workloads!)
https://ffcv.io
Apache License 2.0
2.82k stars 178 forks source link

Additional GPU memory usage in the first GPU #114

Closed chengxuz closed 2 years ago

chengxuz commented 2 years ago

When training one network on multiple GPUs, I find the first GPU is going to have some memory used by processes run on other GPUs. Is there some way to avoid this? It is an issue as then the first GPU always will have more memory used than the other GPUs, meaning that the other GPUs need to have memory unused to let that possible.

GuillaumeLeclerc commented 2 years ago

Hello,

This is definitely not normal behavior and I am investigating a report from someone else. Are you sure you have ch.cuda.set_device in your code appropriately? If not this is known for causing what you are describing.

GuillaumeLeclerc commented 2 years ago

@chengxuz Can you show me the output of nvidia-smi while this is running ?

chengxuz commented 2 years ago

Here is the output from nvidia-smi. I have just confirmed that I have run torch.cuda.set_device in my code correctly.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA TITAN Xp     Off  | 00000000:1A:00.0 Off |                  N/A |
| 23%   24C    P8     8W / 250W |      0MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA TITAN Xp     Off  | 00000000:1B:00.0 Off |                  N/A |
| 23%   26C    P8     8W / 250W |      0MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA TITAN Xp     Off  | 00000000:1C:00.0 Off |                  N/A |
| 23%   26C    P8     9W / 250W |      0MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA TITAN Xp     Off  | 00000000:1D:00.0 Off |                  N/A |
| 23%   27C    P8    10W / 250W |      0MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA TITAN Xp     Off  | 00000000:1E:00.0 Off |                  N/A |
| 23%   28C    P8     9W / 250W |      0MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA TITAN Xp     Off  | 00000000:3D:00.0 Off |                  N/A |
| 47%   75C    P2   160W / 250W |   8948MiB / 12196MiB |     35%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA TITAN Xp     Off  | 00000000:3E:00.0 Off |                  N/A |
| 49%   79C    P2   286W / 250W |   6549MiB / 12196MiB |     89%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA TITAN Xp     Off  | 00000000:3F:00.0 Off |                  N/A |
| 52%   83C    P2   306W / 250W |   6549MiB / 12196MiB |     99%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   8  NVIDIA TITAN Xp     Off  | 00000000:40:00.0 Off |                  N/A |
| 52%   83C    P2   182W / 250W |   6529MiB / 12196MiB |     83%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   9  NVIDIA TITAN Xp     Off  | 00000000:41:00.0 Off |                  N/A |
| 23%   31C    P8     9W / 250W |      0MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    5   N/A  N/A     31955      C   ...nda2/envs/ffcv/bin/python     6543MiB |
|    5   N/A  N/A     31956      C   ...nda2/envs/ffcv/bin/python      799MiB |
|    5   N/A  N/A     31957      C   ...nda2/envs/ffcv/bin/python      799MiB |
|    5   N/A  N/A     31958      C   ...nda2/envs/ffcv/bin/python      799MiB |
|    6   N/A  N/A     31956      C   ...nda2/envs/ffcv/bin/python     6547MiB |
|    7   N/A  N/A     31957      C   ...nda2/envs/ffcv/bin/python     6547MiB |
|    8   N/A  N/A     31958      C   ...nda2/envs/ffcv/bin/python     6527MiB |
+-----------------------------------------------------------------------------+
GuillaumeLeclerc commented 2 years ago

This is definitely not normal. I can reproduce right now with 2 GPUs. However, for me it's GPU1 that has two processes associated to it

It shouldn't take long to fix now that I can reproduce. Thank you for confirming what I suspected!

GuillaumeLeclerc commented 2 years ago

Hello! Thanks for the report. It should land in v0.0.4. I might deploy a release candidate tonight. You can otherwise install directly from github (branch v0.0.4)