Closed makecent closed 3 years ago
CUDA_VISIBLE_DEVICES
is exactly what you need,
CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 ./tools/dist_train.sh ${CONFIG_FILE} 4 CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 ./tools/dist_train.sh ${CONFIG_FILE} 4
more information could refer to here
CUDA_VISIBLE_DEVICES
is exactly what you need,CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 ./tools/dist_train.sh ${CONFIG_FILE} 4 CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 ./tools/dist_train.sh ${CONFIG_FILE} 4
more information could refer to here
But with nvidia-smi
, I found that I actually CAN use the --gpus
argument to control the number of gpus used by the dist_train.sh
.
Intereting. --gpus
is only used to initialize cfg.gpu_ids
However, in distributed mode, args.gpu_ids
will be overwrite by
So args.gpus
is uesless in distributed mode.
Intereting.
--gpus
is only used to initializecfg.gpu_ids
However, in distributed mode,
args.gpu_ids
will be overwrite bySo
args.gpus
is uesless in distributed mode.
With training command: PYTHONPATH=$PWD:$PYTHONPATH mim train mmaction configs/localization/apn/apn_coral_r3dsony_32x5_10e_activitynet5fps_rgb.py --validate --gpus 1 --launcher pytorch
I got:
(base) louis@louis-4:~$ nvidia-smi
Tue Jun 15 11:36:47 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 Off | N/A |
| 30% 48C P2 230W / 250W | 3487MiB / 11019MiB | 99% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... On | 00000000:02:00.0 Off | N/A |
| 27% 29C P8 22W / 250W | 13MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1269 G /usr/lib/xorg/Xorg 45MiB |
| 0 N/A N/A 2031 G /usr/lib/xorg/Xorg 194MiB |
| 0 N/A N/A 2274 G /usr/bin/gnome-shell 93MiB |
| 0 N/A N/A 2304 G ...mviewer/tv_bin/TeamViewer 2MiB |
| 0 N/A N/A 4300 G ...AAAAAAAAA= --shared-files 65MiB |
| 0 N/A N/A 32138 G ...f_6908.log --shared-files 3MiB |
| 0 N/A N/A 62156 C ...nvs/open-mmlab/bin/python 3065MiB |
| 1 N/A N/A 1269 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 2031 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------+
if I change the command to: PYTHONPATH=$PWD:$PYTHONPATH mim train mmaction configs/localization/apn/apn_coral_r3dsony_32x5_10e_activitynet5fps_rgb.py --validate --gpus 2 --launcher pytorch
I got:
(base) louis@louis-4:~$ nvidia-smi
Tue Jun 15 11:39:31 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 Off | N/A |
| 36% 53C P2 65W / 250W | 1446MiB / 11019MiB | 8% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... On | 00000000:02:00.0 Off | N/A |
| 27% 32C P2 70W / 250W | 1056MiB / 11019MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1269 G /usr/lib/xorg/Xorg 45MiB |
| 0 N/A N/A 2031 G /usr/lib/xorg/Xorg 200MiB |
| 0 N/A N/A 2274 G /usr/bin/gnome-shell 62MiB |
| 0 N/A N/A 2304 G ...mviewer/tv_bin/TeamViewer 2MiB |
| 0 N/A N/A 4300 G ...AAAAAAAAA= --shared-files 71MiB |
| 0 N/A N/A 32138 G ...f_6908.log --shared-files 3MiB |
| 0 N/A N/A 62525 C ...nvs/open-mmlab/bin/python 1043MiB |
| 1 N/A N/A 1269 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 2031 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 62526 C ...nvs/open-mmlab/bin/python 1043MiB |
+-----------------------------------------------------------------------------+
mim train mmaction
and python tools/train.py
are quite different
After a quick look at the mim train
code, I would recommend you to ignore docs in tools/train.py
and refer to here
mim train mmaction
andpython tools/train.py
are quite differentAfter a quick look at the
mim train
code, I would recommend you to ignore docs intools/train.py
and refer to here
I only refactor my code with mim
in these two days. And I'd like to correct my statement, I used to directly set a number (not --gpus
) after config.py to claim the number of GPUs used in dist_train.sh
, and it works. I think it's because the that number will be catched by the
GPUS=$2
...
--nproc_per_node=$GPUS
in dist_train.sh
. Anyway, it works and is still very simple. It's just I think a clear explanation about this in docs would be great.
If you use mmaction2 scripts, you can use dist_train.sh
to launch distributed training. If you use MIM, please refer to the MIM documentations on how to set gpu numbers (via --gpus 4 --launcher pytorch
)
If you use mmaction2 scripts, you can use
dist_train.sh
to launch distributed training. If you use MIM, please refer to the MIM documentations on how to set gpu numbers (via--gpus 4 --launcher pytorch
)
Thanks.
In your documentation you mentioned that: "
--gpus ${GPU_NUM}
: Number of gpus to use, which is only applicable to non-distributed training." I am a bit confused by this description: If this argument cannot used for distributed training, how should I control the gpu number for distributed training? I used to usedist_train.sh
with argument--gpus
for distributed training and it worked properly.