open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
28.52k stars 9.29k forks source link

Many CPU cores are unused #11678

Open anastasia-spb opened 2 months ago

anastasia-spb commented 2 months ago

Hello, I have encountered the same problem as https://github.com/open-mmlab/mmdetection/issues/10761.

I am launching the following script:

./mmdetection/tools/dist_train.sh ./mmdetection/configs/mask_rcnn/mask-rcnn_r50_fpn_1x_coco.py 4

Conda env summary:

Train batch size: 20

Hardware setup: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 48 On-line CPU(s) list: 0-47 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Gold 5317 CPU @ 3.00GHz CPU family: 6 Model: 106 Thread(s) per core: 2 Core(s) per socket: 12 Socket(s): 2 Stepping: 6 CPU max MHz: 3600,0000 CPU min MHz: 800,0000 BogoMIPS: 6000.00

Virtualization features: Virtualization: VT-x Caches (sum of all):
L1d: 1,1 MiB (24 instances) L1i: 768 KiB (24 instances) L2: 30 MiB (24 instances) L3: 36 MiB (2 instances) NUMA:
NUMA node(s): 2 NUMA node0 CPU(s): 0-11,24-35 NUMA node1 CPU(s): 12-23,36-47 Vulnerabilities:
Gather data sampling: Mitigation; Microcode Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Retbleed: Not affected Spec rstack overflow: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Srbds: Not affected Tsx async abort: Not affected

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0

4 GPU NVIDIA RTX 6000 Ada Generation 49140MiB Driver Version: 535.104.05 CUDA Driver Version: 12.2

The less workers I use, the faster training goes and GPU utilization is more stable.

With many workers: Screenshot from 2024-05-03 14-42-20

With only 2 workers: Screenshot from 2024-05-03 15-31-08

Using NVIDIA Nsight Systems profiler I see that many CPUs are just not utilized.

I have conducted the same experiment on another hardware setup and increasing number of workers also increase the train speed.

Could you give any advice? Shall I update any drivers?