open-mmlab / mmaction2

OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark
https://mmaction2.readthedocs.io
Apache License 2.0
4.3k stars 1.25k forks source link

Distributed training with argument --gpus #929

Closed makecent closed 3 years ago

makecent commented 3 years ago

In your documentation you mentioned that: "--gpus ${GPU_NUM}: Number of gpus to use, which is only applicable to non-distributed training." I am a bit confused by this description: If this argument cannot used for distributed training, how should I control the gpu number for distributed training? I used to use dist_train.sh with argument --gpus for distributed training and it worked properly.

irvingzhang0512 commented 3 years ago

CUDA_VISIBLE_DEVICES is exactly what you need,

CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 ./tools/dist_train.sh ${CONFIG_FILE} 4 CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 ./tools/dist_train.sh ${CONFIG_FILE} 4

more information could refer to here

makecent commented 3 years ago

CUDA_VISIBLE_DEVICES is exactly what you need,

CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 ./tools/dist_train.sh ${CONFIG_FILE} 4 CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 ./tools/dist_train.sh ${CONFIG_FILE} 4

more information could refer to here

But with nvidia-smi, I found that I actually CAN use the --gpus argument to control the number of gpus used by the dist_train.sh.

irvingzhang0512 commented 3 years ago

Intereting. --gpus is only used to initialize cfg.gpu_ids

https://github.com/open-mmlab/mmaction2/blob/aa26715d5c6a0991af4ead5b5ea03c46dd65dcee/tools/train.py#L103

However, in distributed mode, args.gpu_ids will be overwrite by

https://github.com/open-mmlab/mmaction2/blob/aa26715d5c6a0991af4ead5b5ea03c46dd65dcee/tools/train.py#L112

So args.gpus is uesless in distributed mode.

makecent commented 3 years ago

Intereting. --gpus is only used to initialize cfg.gpu_ids

https://github.com/open-mmlab/mmaction2/blob/aa26715d5c6a0991af4ead5b5ea03c46dd65dcee/tools/train.py#L103

However, in distributed mode, args.gpu_ids will be overwrite by

https://github.com/open-mmlab/mmaction2/blob/aa26715d5c6a0991af4ead5b5ea03c46dd65dcee/tools/train.py#L112

So args.gpus is uesless in distributed mode.

With training command: PYTHONPATH=$PWD:$PYTHONPATH mim train mmaction configs/localization/apn/apn_coral_r3dsony_32x5_10e_activitynet5fps_rgb.py --validate --gpus 1 --launcher pytorch I got:

(base) louis@louis-4:~$ nvidia-smi
Tue Jun 15 11:36:47 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| 30%   48C    P2   230W / 250W |   3487MiB / 11019MiB |     99%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  On   | 00000000:02:00.0 Off |                  N/A |
| 27%   29C    P8    22W / 250W |     13MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1269      G   /usr/lib/xorg/Xorg                 45MiB |
|    0   N/A  N/A      2031      G   /usr/lib/xorg/Xorg                194MiB |
|    0   N/A  N/A      2274      G   /usr/bin/gnome-shell               93MiB |
|    0   N/A  N/A      2304      G   ...mviewer/tv_bin/TeamViewer        2MiB |
|    0   N/A  N/A      4300      G   ...AAAAAAAAA= --shared-files       65MiB |
|    0   N/A  N/A     32138      G   ...f_6908.log --shared-files        3MiB |
|    0   N/A  N/A     62156      C   ...nvs/open-mmlab/bin/python     3065MiB |
|    1   N/A  N/A      1269      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A      2031      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

if I change the command to: PYTHONPATH=$PWD:$PYTHONPATH mim train mmaction configs/localization/apn/apn_coral_r3dsony_32x5_10e_activitynet5fps_rgb.py --validate --gpus 2 --launcher pytorch I got:

(base) louis@louis-4:~$ nvidia-smi
Tue Jun 15 11:39:31 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| 36%   53C    P2    65W / 250W |   1446MiB / 11019MiB |      8%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  On   | 00000000:02:00.0 Off |                  N/A |
| 27%   32C    P2    70W / 250W |   1056MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1269      G   /usr/lib/xorg/Xorg                 45MiB |
|    0   N/A  N/A      2031      G   /usr/lib/xorg/Xorg                200MiB |
|    0   N/A  N/A      2274      G   /usr/bin/gnome-shell               62MiB |
|    0   N/A  N/A      2304      G   ...mviewer/tv_bin/TeamViewer        2MiB |
|    0   N/A  N/A      4300      G   ...AAAAAAAAA= --shared-files       71MiB |
|    0   N/A  N/A     32138      G   ...f_6908.log --shared-files        3MiB |
|    0   N/A  N/A     62525      C   ...nvs/open-mmlab/bin/python     1043MiB |
|    1   N/A  N/A      1269      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A      2031      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A     62526      C   ...nvs/open-mmlab/bin/python     1043MiB |
+-----------------------------------------------------------------------------+
irvingzhang0512 commented 3 years ago

mim train mmaction and python tools/train.py are quite different

After a quick look at the mim train code, I would recommend you to ignore docs in tools/train.py and refer to here

makecent commented 3 years ago

mim train mmaction and python tools/train.py are quite different

After a quick look at the mim train code, I would recommend you to ignore docs in tools/train.py and refer to here

I only refactor my code with mim in these two days. And I'd like to correct my statement, I used to directly set a number (not --gpus) after config.py to claim the number of GPUs used in dist_train.sh, and it works. I think it's because the that number will be catched by the

GPUS=$2
...
--nproc_per_node=$GPUS

in dist_train.sh. Anyway, it works and is still very simple. It's just I think a clear explanation about this in docs would be great.

kennymckormick commented 3 years ago

If you use mmaction2 scripts, you can use dist_train.sh to launch distributed training. If you use MIM, please refer to the MIM documentations on how to set gpu numbers (via --gpus 4 --launcher pytorch)

makecent commented 3 years ago

If you use mmaction2 scripts, you can use dist_train.sh to launch distributed training. If you use MIM, please refer to the MIM documentations on how to set gpu numbers (via --gpus 4 --launcher pytorch)

Thanks.