HUB not working correctly with Multi-GPU custom agent setup

sinchinpark commented 1 month ago

Search before asking

[x] I have searched the HUB issues and found no similar bug report.

HUB Component

Models, Training

Bug

Description

I am experiencing issues when using HUB portal for training on dataset with a multi-GPU custom agent setup. Specifically, I am using 2xGPUs and have modified the default parameters as follows:

device=0,1
workers=16

However, the HUB does not seem to process the training data correctly and gets stuck throughout the training process. This issue persists even after the training is supposedly finished, as shown in the attached screenshot.

swappy-20240523_131045

swappy-20240523_132213

Interestingly, using device=0 on the same machine with the same model works fine!

Logs and Errors:

Here are some potentially useful logs and errors from my custom agents:

Ultralytics HUB: View model at https://hub.ultralytics.com/models/zCnR3gSc9n1xTow1CTpS 🚀
Ultralytics YOLOv8.2.19 🚀 Python-3.10.12 torch-2.3.0+cu121 CUDA:0 (NVIDIA GeForce RTX 3090, 24253MiB)
                                                                CUDA:1 (NVIDIA GeForce RTX 3090, 24253MiB)
engine/trainer: task=detect, mode=train, model=yolov8m.pt, data=***, epochs=10, time=None, patience=100, batch=-1, imgsz=640, save=True, save_period=-1, cache=ram, device=[0, 1], workers=8, project=None, name=train2, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=True, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, multi_scale=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, vid_stride=1, stream_buffer=False, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, embed=None, show=False, save_frames=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, show_boxes=True, line_width=None, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=None, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, bgr=0.0, mosaic=1.0, mixup=0.0, copy_paste=0.0, auto_augment=randaugment, erasing=0.4, crop_fraction=1.0, cfg=None, tracker=botsort.yaml, save_dir=runs/detect/train2

WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

Also, I encountered the following warnings multiple times:

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py:456](https://jupyter81.backprop.co/usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py#line=455): UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:919.)

Expected Behavior:

The training should proceed without getting stuck, showing progress and metrics on Dashboard and allow to deploy/export after training finished (similar to the behavior observed when using device=0).

Custom Agent Env

Python: 3.10.12
PyTorch: 2.3.0+cu121
GPUs: 2x NVIDIA GeForce RTX 3090
Ultralytics YOLOv8.2.19

Environment

Ultralytics HUB Version v0.1.43 Client User Agent Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36 Operating System Linux x86_64 Server Timestamp 1716456982

Minimal Reproducible Example

No response

Additional

No response

github-actions[bot] commented 1 month ago

👋 Hello @sinchinpark, thank you for raising an issue about Ultralytics HUB 🚀! Please visit our HUB Docs to learn more:

Quickstart. Start training and deploying YOLO models with HUB in seconds.
Datasets: Preparing and Uploading. Learn how to prepare and upload your datasets to HUB in YOLO format.
Projects: Creating and Managing. Group your models into projects for improved organization.
Models: Training and Exporting. Train YOLOv5 and YOLOv8 models on your custom datasets and export them to various formats for deployment.
Integrations. Explore different integration options for your trained models, such as TensorFlow, ONNX, OpenVINO, CoreML, and PaddlePaddle.
Ultralytics HUB App. Learn about the Ultralytics App for iOS and Android, which allows you to run models directly on your mobile device.
- iOS. Learn about YOLO CoreML models accelerated on Apple's Neural Engine on iPhones and iPads.
- Android. Explore TFLite acceleration on mobile devices.
Inference API. Understand how to use the Inference API for running your trained models in the cloud to generate predictions.

If this is a 🐛 Bug Report, please provide screenshots and steps to reproduce your problem to help us get started working on a fix.

If this is a ❓ Question, please provide as much information as possible, including dataset, model, environment details etc. so that we might provide the most helpful response.

We try to respond to all issues as promptly as possible. Thank you for your patience!

sinchinpark commented 1 month ago

Sorry it's duplicate of #606

sergiuwaxmann commented 1 month ago

@sinchinpark Did you use the Custom option from the Advanced Model Configuration accordion (read more here) to change the device from 0 to 0,1? custom_device

sinchinpark commented 1 month ago

@sinchinpark Did you use the Custom option from the Advanced Model Configuration accordion ([read more here]

Yes, I'm using HUB portal for all operations (from importing dataset to training the model)

sergiuwaxmann commented 1 month ago

@sinchinpark Our team will investigate this issue and I will update you as soon as possible. Thank you for your patience!

sinchinpark commented 1 month ago

@sergiuwaxmann Thanks BTW this is the model ID if it helps the further investigation: https://hub.ultralytics.com/models/zCnR3gSc9n1xTow1CTpS

sergiuwaxmann commented 1 month ago

@sinchinpark Thank you!

github-actions[bot] commented 2 weeks ago

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Docs: https://docs.ultralytics.com
HUB: https://hub.ultralytics.com
Community: https://community.ultralytics.com

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

sergiuwaxmann commented 5 days ago

@sinchinpark Hey there! I apologize for the delay in replying. Multi-GPU training now works correctly with Ultralytics HUB.

ultralytics / hub