Open GuoQuanhao opened 1 week ago
@GuoQuanhao hello,
Thank you for providing detailed information about the issue you're encountering with multi-GPU training. It appears that there is a problem with the distributed training setup, as indicated by the torch.distributed.elastic.multiprocessing.errors.ChildFailedError
and the exitcode: -11 (pid: 1805549)
error.
To help us diagnose and resolve this issue more effectively, could you please provide a minimal reproducible example? This will allow us to replicate the problem on our end. You can find guidelines on how to create a minimal reproducible example here: Minimum Reproducible Example.
Additionally, please ensure that you are using the latest versions of all relevant packages, including PyTorch and Ultralytics YOLO. Sometimes, issues are resolved in newer releases, and updating might fix the problem.
Here are a few steps you can try to troubleshoot the issue:
Verify CUDA and NCCL Installation: Ensure that your CUDA and NCCL installations are correctly set up and compatible with your PyTorch version.
Reduce Batch Size: Sometimes, reducing the batch size can help if the issue is related to memory constraints.
Check Environment Variables: Ensure that CUDA_VISIBLE_DEVICES
is correctly set and that all GPUs are accessible.
Simplify the Setup: Try running the training with fewer GPUs to see if the issue persists. For example, start with 2 GPUs and then gradually increase the number.
Use Different Distributed Backend: You can try using a different distributed backend like gloo
instead of nccl
to see if it resolves the issue.
Here is an example of how you might modify your script to use fewer GPUs and a different backend:
import os
from ultralytics import YOLO
os.environ['CUDA_VISIBLE_DEVICES'] = '6,5'
# Load a model
model = YOLO("yolov8n.yaml").load("./pretrained_model/yolov8n.pt") # build from YAML and transfer weights
# Train the model
results = model.train(data="./ultralytics/cfg/datasets/layout.yaml", epochs=300, imgsz=672, device=os.getenv('CUDA_VISIBLE_DEVICES'), workers=0, batch=96)
If the issue persists, please provide any additional logs or error messages that might help us diagnose the problem further.
Thank you for your patience and cooperation. We look forward to resolving this issue together.
Search before asking
YOLOv8 Component
No response
Bug
Environment
Minimal Reproducible Example
yolo_train.py
python yolo_train.py
Additional
No response
Are you willing to submit a PR?