ultralytics / ultralytics

NEW - YOLOv8 πŸš€ in PyTorch > ONNX > OpenVINO > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
23.76k stars 4.74k forks source link

When I was training with multiple GPUs, I kept getting stuck in ’AMP: checks passed βœ…β€˜ #11680

Open blue-q opened 1 week ago

blue-q commented 1 week ago

Search before asking

Question

This is my environmental information


------------------- --------------------
certifi             2024.2.2
charset-normalizer  3.3.2
contourpy           1.1.1
cycler              0.12.1
fonttools           4.51.0
idna                3.7
importlib_resources 6.4.0
kiwisolver          1.4.5
matplotlib          3.7.5
numpy               1.24.4
opencv-python       4.9.0.80
packaging           24.0
pandas              2.0.3
pillow              10.3.0
pip                 23.3.1
psutil              5.9.8
py-cpuinfo          9.0.0
pyparsing           3.1.2
python-dateutil     2.9.0.post0
pytz                2024.1
PyYAML              6.0.1
requests            2.31.0
scipy               1.10.1
seaborn             0.13.2
setuptools          68.2.2
six                 1.16.0
thop                0.1.1.post2209072238
torch               1.12.1+cu113
torchaudio          0.12.1+cu113
torchvision         0.13.1+cu113
tqdm                4.66.4
typing_extensions   4.11.0
tzdata              2024.1
ultralytics         8.2.9
urllib3             2.2.1
wheel               0.43.0
zipp                3.18.1```

This is my code    model.train(data='/home/ultralytics/ultralytics/cfg/datasets/20240506_flame_smoke_class2.yaml', epochs=500, imgsz=640, batch=128, device=[0,1,2,3])
blue-q commented 1 week ago

After I pip install ultralytics, if I allow the training command, I will keep getting stuck in

Transferred 469/475 items from pretrained weights
DDP: debug command /home/qiuzx/miniconda3/envs/yolov8/bin/python -m torch.distributed.run --nproc_per_node 4 --master_port 48025 /home/qiuzx/.config/Ultralytics/DDP/_temp_71rgtm97139853097381840.py
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Ultralytics YOLOv8.2.9 πŸš€ Python-3.8.13 torch-1.12.1+cu113 CUDA:0 (NVIDIA GeForce RTX 4090, 24217MiB)
                                                           CUDA:1 (NVIDIA GeForce RTX 4090, 24217MiB)
                                                           CUDA:2 (NVIDIA GeForce RTX 4090, 24217MiB)
                                                           CUDA:3 (NVIDIA GeForce RTX 4090, 24217MiB)
WARNING ⚠️ Upgrade to torch>=2.0.0 for deterministic training.
Overriding model.yaml nc=80 with nc=2
Transferred 469/475 items from pretrained weights
Freezing layer 'model.22.dfl.conv.weight'
AMP: running Automatic Mixed Precision (AMP) checks with YOLOv8n...
AMP: checks passed βœ…

and not go down. In addition, to make it easier for me to modify the code, I pip uninstall ultralytics and keep reporting errors during multi GPUs training,


File "/home/qiuzx/. config/Ultralytics/DDP/_temp_18uxnvwv139801431181536. py", line 6, in<module>
From ultralytics. models. yolo. detect. train import DetectionTrainer
ModuleNotFoundError: No module named 'ultralytics'. 
How to solve these two problems
glenn-jocher commented 1 week ago

@blue-q hello! It seems like you're encountering two separate issues here.

  1. Training Getting Stuck at AMP Checks: If your training consistently stops at the "AMP: checks passed βœ…" without proceeding, this could potentially be due to insufficient resources or a configuration oversight. First, ensure that there are no resource limitations or I/O bottlenecks. Also, check if updating to Torch>=2.0.0 as suggested by the warning improves the situation, as newer versions of Torch have better support and optimizations for multi-GPU setups.

  2. Errors After Uninstalling Ultralytics Library: When you uninstall the Ultralytics package, Python is unable to find the module because it no longer exists in your environment, leading to a ModuleNotFoundError. If you need to make code modifications frequently, consider working in a development environment where you clone the GitHub repository and run your modified code directly from source. This approach avoids the need to uninstall and reinstall the package. You can set up this environment by cloning the repo and using pip install -e . within the repository directory.

For both issues, ensuring that all dependencies are correctly installed and updating to the latest versions where possible often helps. If the problem persists, providing more specific logs or error messages could help in diagnosing the issue further!

blue-q commented 1 week ago

Hi @glenn-jocher ,my CUDA is now 12.1, and I have reinstalled the torch for 2.1.2. My environment information is as follows:

Package                  Version              Editable project location
------------------------ -------------------- -------------------------
certifi                  2024.2.2
charset-normalizer       3.3.2
cmake                    3.29.2
contourpy                1.1.1
cycler                   0.12.1
filelock                 3.14.0
fonttools                4.51.0
fsspec                   2024.3.1
idna                     3.7
importlib_resources      6.4.0
Jinja2                   3.1.4
kiwisolver               1.4.5
lit                      18.1.4
MarkupSafe               2.1.5
matplotlib               3.7.5
mpmath                   1.3.0
networkx                 3.1
numpy                    1.24.4
nvidia-cublas-cu12       12.1.3.1
nvidia-cuda-cupti-cu12   12.1.105
nvidia-cuda-nvrtc-cu12   12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12        8.9.2.26
nvidia-cufft-cu12        11.0.2.54
nvidia-curand-cu12       10.3.2.106
nvidia-cusolver-cu12     11.4.5.107
nvidia-cusparse-cu12     12.1.0.106
nvidia-nccl-cu12         2.18.1
nvidia-nvjitlink-cu12    12.4.127
nvidia-nvtx-cu12         12.1.105
opencv-python            4.9.0.80
packaging                24.0
pandas                   2.0.3
pillow                   10.3.0
pip                      23.3.1
psutil                   5.9.8
py-cpuinfo               9.0.0
pyparsing                3.1.2
python-dateutil          2.9.0.post0
pytz                     2024.1
PyYAML                   6.0.1
requests                 2.31.0
scipy                    1.10.1
seaborn                  0.13.2
setuptools               68.2.2
six                      1.16.0
sympy                    1.12
thop                     0.1.1.post2209072238
torch                    2.1.2
torchaudio               2.1.2
torchvision              0.16.2
tqdm                     4.66.4
triton                   2.1.0
typing_extensions        4.11.0
tzdata                   2024.1
ultralytics              8.1.44               /home/qiuzx/ultralytics
urllib3                  2.2.1
wheel                    0.43.0
zipp                     3.18.1

My training command is model.train(data='/home/qiuzx/ultralytics/ultralytics/cfg/datasets/20240506_flame_smoke_class2.yaml', epochs=500, imgsz=640, batch=128, device=[0,1,2,3]), and at this point he will still get stuck in

DDP: debug command /home/qiuzx/miniconda3/envs/yolov8/bin/python -m torch.distributed.run --nproc_per_node 4 --master_port 37947 /home/qiuzx/.config/Ultralytics/DDP/_temp_mak_nap2139734965595248.py
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Ultralytics YOLOv8.1.44 πŸš€ Python-3.8.13 torch-2.1.2+cu121 CUDA:0 (NVIDIA GeForce RTX 4090, 24217MiB)
                                                           CUDA:1 (NVIDIA GeForce RTX 4090, 24217MiB)
                                                           CUDA:2 (NVIDIA GeForce RTX 4090, 24217MiB)
                                                           CUDA:3 (NVIDIA GeForce RTX 4090, 24217MiB)
Overriding model.yaml nc=80 with nc=2
Transferred 469/475 items from pretrained weights
Freezing layer 'model.22.dfl.conv.weight'
AMP: running Automatic Mixed Precision (AMP) checks with YOLOv8n...
AMP: checks passed βœ…

At the same time, I checked the status of the graphics card through nvidia-smi and found that the usage rate of all four GPUs was 100%

blue-q commented 1 week ago

Is it possible that it is caused by torch.backups.cudnn.benchmark? After setting torch.backups.cudnn.enabled=False, I can run with two GPUs, but if I use four GPUs, it will still get stuck in

AMP: running Automatic Mixed Precision (AMP) checks with YOLOv8n...
AMP: checks passed βœ…
glenn-jocher commented 1 week ago

Hello! It sounds like you might be encountering an issue related to the CUDA cuDNN benchmarks when using multiple GPUs.

Disabling torch.backends.cudnn.benchmark can indeed help in some cases as it turns off certain optimizations that, although generally improve performance, can cause stalemates in specific situations, especially with a variable workload between different batches.

As you noticed, setting:

torch.backends.cudnn.enabled = False

helps when using two GPUs but doesn't solve the issue with four GPUs.

It could be beneficial to ensure all GPUs synchronize properly. You might want to try setting:

torch.backends.cudnn.benchmark = False
torch.cuda.synchronize()

before your training loop or right after the AMP check, to ensure all devices are in sync.

If the issue persists, please provide more details about your specific setup or configurations that might be contributing to this behavior! Happy coding! πŸš€

TomZhongJie commented 2 days ago

DDP: debug command /home/qiuzx/miniconda3/envs/yolov8/bin/python -m torch.distributed.run --nproc_per_node 4 --master_port 37947 /home/qiuzx/.config/Ultralytics/DDP/_temp_mak_nap2139734965595248.py WARNING:main:


Setting OMP_NUM_THREADS environment

θΏ™δΈͺηŽ©ζ„εˆ°εΊ•ζœ‰ζ²‘ζœ‰ε½±ε“ε•Š

glenn-jocher commented 1 day ago

@TomZhongJie hello! It looks like you're inquiring about the impact of the OMP_NUM_THREADS environment setting during your DDP (Distributed Data Parallel) training with YOLOv8. Setting OMP_NUM_THREADS=1 is generally recommended for avoiding potential issues with overly aggressive thread usage by PyTorch, which can lead to inefficient CPU usage in multi-threading environments, especially when using multiple GPUs. It can help to stabilize your training process by ensuring that parallel execution doesn't become a bottleneck.

If you're experiencing particular issues or slowdowns, you might consider adjusting this setting to better fit your hardware capabilities, balancing between CPU threads and GPU workload. Here's how you can experiment with it:

import os
os.environ['OMP_NUM_THREADS'] = '4'  # Adjust this as necessary for your machine

Add this to your script before importing any major libraries like PyTorch or starting the training process to see if it impacts performance. Happy experimenting! πŸš€