Open ray342659093 opened 3 years ago
Maybe the bottleneck is io/cpu/memory, not gpu. Reference: #949
Thank you for your suggestion. It does slightly improve the runtime (say reduce to 1.7s/iter, but still 3 times of the reported runtime from the log file provided by you) by adding the codes below:
import cv2
cv2.setNumThreads(1)
The usage of cpu goes down a bit, but the usage of the gpus is still strange(some are very high while there always are one or two gpus with very low usage). So I think the bottleneck maybe is the gpu, as I said it takes a lot of time for calling item(). I am running the code in a machine with 56 cpu cores, 250 GB memory and 8 V100 gpus.
I try to train i3d model for kinetics700 dataset. When I used single gpu (1 V100) by running the follow command:
./tools/dist_train.sh ./configs/recognition/i3d/i3d_r50_32x2x1_100e_kinetics400_rgb.py 1
The runtime looks fine
However, when I used multiple gpus (8 V100) to train the model by runing the command:
./tools/dist_train.sh ./configs/recognition/i3d/i3d_r50_32x2x1_100e_kinetics400_rgb.py 8
The runtime increases dramactically.
When I check the log file provided in the configs folder, I found that you can keep the runtime within 0.5s/iter even when using multiple gpus. I wonder how to do that.
I then find that the code ''' log_vars[loss_name] = loss_value.item() ''' in the _parse_losses method in mmaction2/mmaction/models/recognizers/base.py is slow. Some gpu do it really quick (less than 0.01s), some do it pretty slow (1~4s). I try to google it, and said item() synchronize GPU operations so that it takes a lot time. However, item() is also called when using single gpu but the runtime is still fine. So is there any solution to solve this problem. Another issue is that when using multiple gpus, the usage of most gpus is over 90%, but there are always one or two gpu with a very low usage (less than 20)