open-mmlab / mmaction2

OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark
https://mmaction2.readthedocs.io
Apache License 2.0
4.17k stars 1.22k forks source link

why the runtime increase when using multiple gpus #1087

Open ray342659093 opened 3 years ago

ray342659093 commented 3 years ago

I try to train i3d model for kinetics700 dataset. When I used single gpu (1 V100) by running the follow command:

./tools/dist_train.sh ./configs/recognition/i3d/i3d_r50_32x2x1_100e_kinetics400_rgb.py 1

The runtime looks fine

2021-08-04 21:31:12,672 - mmaction - INFO - Epoch [1][10000/62577]  lr: 1.000e-02, eta: 46 days, 19:39:55, time: 0.444, data_time: 0.099, memory: 5068, top1_acc: 0.0187, top5_acc: 0.0688, loss_cls: 5.7112, loss: 5.7112, grad_norm: 4.5613
2021-08-04 21:31:23,937 - mmaction - INFO - Epoch [1][10020/62577]  lr: 1.000e-02, eta: 46 days, 19:22:12, time: 0.563, data_time: 0.218, memory: 5068, top1_acc: 0.0187, top5_acc: 0.0875, loss_cls: 5.6884, loss: 5.6884, grad_norm: 4.8015
2021-08-04 21:31:33,720 - mmaction - INFO - Epoch [1][10040/62577]  lr: 1.000e-02, eta: 46 days, 18:49:11, time: 0.489, data_time: 0.145, memory: 5068, top1_acc: 0.0312, top5_acc: 0.0813, loss_cls: 5.7741, loss: 5.7741, grad_norm: 5.2769
2021-08-04 21:31:42,548 - mmaction - INFO - Epoch [1][10060/62577]  lr: 1.000e-02, eta: 46 days, 18:06:24, time: 0.441, data_time: 0.096, memory: 5068, top1_acc: 0.0187, top5_acc: 0.0875, loss_cls: 5.7192, loss: 5.7192, grad_norm: 4.6058
2021-08-04 21:31:51,996 - mmaction - INFO - Epoch [1][10080/62577]  lr: 1.000e-02, eta: 46 days, 17:30:12, time: 0.472, data_time: 0.123, memory: 5068, top1_acc: 0.0125, top5_acc: 0.0875, loss_cls: 5.6755, loss: 5.6755, grad_norm: 5.3815
2021-08-04 21:32:03,394 - mmaction - INFO - Epoch [1][10100/62577]  lr: 1.000e-02, eta: 46 days, 17:14:15, time: 0.570, data_time: 0.228, memory: 5068, top1_acc: 0.0500, top5_acc: 0.1437, loss_cls: 5.7211, loss: 5.7211, grad_norm: 6.5947
2021-08-04 21:32:10,802 - mmaction - INFO - Epoch [1][10120/62577]  lr: 1.000e-02, eta: 46 days, 16:17:18, time: 0.370, data_time: 0.020, memory: 5068, top1_acc: 0.0375, top5_acc: 0.0750, loss_cls: 5.8865, loss: 5.8865, grad_norm: 4.9243

However, when I used multiple gpus (8 V100) to train the model by runing the command:

./tools/dist_train.sh ./configs/recognition/i3d/i3d_r50_32x2x1_100e_kinetics400_rgb.py 8

The runtime increases dramactically.

INFO:mmaction:Epoch [1][3600/7823]  lr: 1.000e-02, eta: 25 days, 22:34:01, time: 2.944, data_time: 0.070, memory: 5174, top1_acc: 0.1805, top5_acc: 0.3820, loss_cls: 4.1032, loss: 4.1032, grad_norm: 6.3017
| 2021-08-10 13:57:56,798 - mmaction - INFO - Epoch [1][3620/7823]  lr: 1.000e-02, eta: 25 days, 22:29:36, time: 2.830, data_time: 0.006, memory: 5174, top1_acc: 0.1672, top5_acc: 0.3719, loss_cls: 4.2293, loss: 4.2293, grad_norm: 6.5334
| INFO:mmaction:Epoch [1][3620/7823]    lr: 1.000e-02, eta: 25 days, 22:29:36, time: 2.830, data_time: 0.006, memory: 5174, top1_acc: 0.1672, top5_acc: 0.3719, loss_cls: 4.2293, loss: 4.2293, grad_norm: 6.5334
| 2021-08-10 13:58:55,511 - mmaction - INFO - Epoch [1][3640/7823]  lr: 1.000e-02, eta: 25 days, 22:32:45, time: 2.936, data_time: 0.434, memory: 5174, top1_acc: 0.1781, top5_acc: 0.3859, loss_cls: 4.1610, loss: 4.1610, grad_norm: 6.3760
| INFO:mmaction:Epoch [1][3640/7823]    lr: 1.000e-02, eta: 25 days, 22:32:45, time: 2.936, data_time: 0.434, memory: 5174, top1_acc: 0.1781, top5_acc: 0.3859, loss_cls: 4.1610, loss: 4.1610, grad_norm: 6.3760
| 2021-08-10 13:59:54,847 - mmaction - INFO - Epoch [1][3660/7823]  lr: 1.000e-02, eta: 25 days, 22:38:04, time: 2.967, data_time: 0.627, memory: 5174, top1_acc: 0.1641, top5_acc: 0.3766, loss_cls: 4.1659, loss: 4.1659, grad_norm: 6.2841
| INFO:mmaction:Epoch [1][3660/7823]    lr: 1.000e-02, eta: 25 days, 22:38:04, time: 2.967, data_time: 0.627, memory: 5174, top1_acc: 0.1641, top5_acc: 0.3766, loss_cls: 4.1659, loss: 4.1659, grad_norm: 6.2841
| 2021-08-10 14:00:59,300 - mmaction - INFO - Epoch [1][3680/7823]  lr: 1.000e-02, eta: 25 days, 23:01:23, time: 3.223, data_time: 0.186, memory: 5174, top1_acc: 0.1648, top5_acc: 0.3758, loss_cls: 4.1978, loss: 4.1978, grad_norm: 6.4122
| INFO:mmaction:Epoch [1][3680/7823]    lr: 1.000e-02, eta: 25 days, 23:01:23, time: 3.223, data_time: 0.186, memory: 5174, top1_acc: 0.1648, top5_acc: 0.3758, loss_cls: 4.1978, loss: 4.1978, grad_norm: 6.4122
| 2021-08-10 14:02:03,899 - mmaction - INFO - Epoch [1][3700/7823]  lr: 1.000e-02, eta: 25 days, 23:24:55, time: 3.230, data_time: 0.005, memory: 5174, top1_acc: 0.1391, top5_acc: 0.3789, loss_cls: 4.1990, loss: 4.1990, grad_norm: 6.3570
| INFO:mmaction:Epoch [1][3700/7823]    lr: 1.000e-02, eta: 25 days, 23:24:55, time: 3.230, data_time: 0.005, memory: 5174, top1_acc: 0.1391, top5_acc: 0.3789, loss_cls: 4.1990, loss: 4.1990, grad_norm: 6.3570
| 2021-08-10 14:03:04,172 - mmaction - INFO - Epoch [1][3720/7823]  lr: 1.000e-02, eta: 25 days, 23:33:06, time: 3.014, data_time: 0.006, memory: 5174, top1_acc: 0.1656, top5_acc: 0.3555, loss_cls: 4.2312, loss: 4.2312, grad_norm: 6.3330
| INFO:mmaction:Epoch [1][3720/7823]    lr: 1.000e-02, eta: 25 days, 23:33:06, time: 3.014, data_time: 0.006, memory: 5174, top1_acc: 0.1656, top5_acc: 0.3555, loss_cls: 4.2312, loss: 4.2312, grad_norm: 6.3330
| 2021-08-10 14:04:02,840 - mmaction - INFO - Epoch [1][3740/7823]  lr: 1.000e-02, eta: 25 days, 23:35:38, time: 2.933, data_time: 0.017, memory: 5174, top1_acc: 0.1773, top5_acc: 0.3781, loss_cls: 4.1784, loss: 4.1784, grad_norm: 6.4687
| INFO:mmaction:Epoch [1][3740/7823]    lr: 1.000e-02, eta: 25 days, 23:35:38, time: 2.933, data_time: 0.017, memory: 5174, top1_acc: 0.1773, top5_acc: 0.3781, loss_cls: 4.1784, loss: 4.1784, grad_norm: 6.4687
| 2021-08-10 14:05:05,744 - mmaction - INFO - Epoch [1][3760/7823]  lr: 1.000e-02, eta: 25 days, 23:52:43, time: 3.145, data_time: 0.006, memory: 5174, top1_acc: 0.1695, top5_acc: 0.3781, loss_cls: 4.1154, loss: 4.1154, grad_norm: 6.4328
 | INFO:mmaction:Epoch [1][3760/7823]   lr: 1.000e-02, eta: 25 days, 23:52:43, time: 3.145, data_time: 0.006, memory: 5174, top1_acc: 0.1695, top5_acc: 0.3781, loss_cls: 4.1154, loss: 4.1154, grad_norm: 6.4328
| 2021-08-10 14:06:08,344 - mmaction - INFO - Epoch [1][3780/7823]  lr: 1.000e-02, eta: 26 days, 0:08:36, time: 3.130, data_time: 0.008, memory: 5174, top1_acc: 0.1609, top5_acc: 0.3633, loss_cls: 4.2048, loss: 4.2048, grad_norm: 6.4456
  

When I check the log file provided in the configs folder, I found that you can keep the runtime within 0.5s/iter even when using multiple gpus. I wonder how to do that.

I then find that the code ''' log_vars[loss_name] = loss_value.item() ''' in the _parse_losses method in mmaction2/mmaction/models/recognizers/base.py is slow. Some gpu do it really quick (less than 0.01s), some do it pretty slow (1~4s). I try to google it, and said item() synchronize GPU operations so that it takes a lot time. However, item() is also called when using single gpu but the runtime is still fine. So is there any solution to solve this problem. Another issue is that when using multiple gpus, the usage of most gpus is over 90%, but there are always one or two gpu with a very low usage (less than 20)

irvingzhang0512 commented 3 years ago

Maybe the bottleneck is io/cpu/memory, not gpu. Reference: #949

ray342659093 commented 3 years ago

Thank you for your suggestion. It does slightly improve the runtime (say reduce to 1.7s/iter, but still 3 times of the reported runtime from the log file provided by you) by adding the codes below:

import cv2
cv2.setNumThreads(1)

The usage of cpu goes down a bit, but the usage of the gpus is still strange(some are very high while there always are one or two gpus with very low usage). So I think the bottleneck maybe is the gpu, as I said it takes a lot of time for calling item(). I am running the code in a machine with 56 cpu cores, 250 GB memory and 8 V100 gpus.