Stucked on 100% Optimizing weights

sdy623 commented 1 month ago

Search before asking

[X] I have searched the HUB issues and found no similar bug report.

HUB Component

Training

Bug

I used my own agent to train the model, but I can't find the model I trained on the HUB webpage. Some has this simailar problem, but I can find my results.csv file. For that problem no results.csv file. Here is my training log

Ultralytics HUB: Uploading checkpoint https://hub.ultralytics.com/models/GxzAXECMi65QweQlpTjs

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
     97/100      17.4G      1.365     0.5464      1.235        101        640: 100%|██████████| 264/264 [01:42<00:00,  2
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 29/29 [00:11<
                   all        747      49603      0.948      0.942      0.976      0.667

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
     98/100      17.3G      1.363     0.5444      1.235        139        640: 100%|██████████| 264/264 [01:40<00:00,  2
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 29/29 [00:11<
                   all        747      49603      0.949      0.941      0.977      0.667

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
     99/100      17.3G       1.36     0.5425      1.232        137        640: 100%|██████████| 264/264 [01:40<00:00,  2
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 29/29 [00:11<
                   all        747      49603      0.949      0.941      0.977      0.668

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
    100/100      17.3G      1.349     0.5367       1.22        156        640: 100%|██████████| 264/264 [01:41<00:00,  2
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 29/29 [00:11<
                   all        747      49603       0.95      0.941      0.977      0.668

100 epochs completed in 3.364 hours.
Optimizer stripped from runs/detect/train12/weights/last.pt, 311.6MB
Optimizer stripped from runs/detect/train12/weights/best.pt, 311.6MB

Validating runs/detect/train12/weights/best.pt...
Ultralytics YOLOv8.1.45 🚀 Python-3.10.12 torch-2.1.0a0+32f93b1 CUDA:0 (B1.gpu.large, 24118MiB)
YOLOv5x6u summary (fused): 463 layers, 155375236 parameters, 0 gradients, 250.3 GFLOPs
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 29/29 [00:42<
                   all        747      49603       0.95      0.941      0.977      0.668
Speed: 0.1ms preprocess, 8.5ms inference, 0.0ms loss, 1.8ms postprocess per image
Results saved to runs/detect/train12
Ultralytics HUB: Syncing final model...
100%|██████████| 297M/297M [02:44<00:00, 1.89MB/s]
Ultralytics HUB: Done ✅
Ultralytics HUB: View model at https://hub.ultralytics.com/models/GxzAXECMi65QweQlpTjs 🚀

After the View model at https://hub.ultralytics.com/models/GxzAXECMi65QweQlpTjs 🚀, the training exits, but shows Optimizing weights. After a while it quits, it become disconnected, and I can't find the model I trained in the HUB.

Could some can help me, I will appreciate their help.

Environment

Trainging agent: docker-conatiner
Kernel version: 5.15.146
Memory: 24GB
GPU Memory: 24GB
Python: 3.10.12
CUDA: 12.2.140
torch: 2.1.0a0+32f93b1
ultralytics: 8.1.45

Minimal Reproducible Example

Login to hub
Choose 'Bring your own agent' option to train the model
Exec the model on my trainging agent
Wait the train ends
The program comes up.

Additional

No response

github-actions[bot] commented 1 month ago

👋 Hello @sdy623, thank you for raising an issue about Ultralytics HUB 🚀! Please visit our HUB Docs to learn more:

Quickstart. Start training and deploying YOLO models with HUB in seconds.
Datasets: Preparing and Uploading. Learn how to prepare and upload your datasets to HUB in YOLO format.
Projects: Creating and Managing. Group your models into projects for improved organization.
Models: Training and Exporting. Train YOLOv5 and YOLOv8 models on your custom datasets and export them to various formats for deployment.
Integrations. Explore different integration options for your trained models, such as TensorFlow, ONNX, OpenVINO, CoreML, and PaddlePaddle.
Ultralytics HUB App. Learn about the Ultralytics App for iOS and Android, which allows you to run models directly on your mobile device.
- iOS. Learn about YOLO CoreML models accelerated on Apple's Neural Engine on iPhones and iPads.
- Android. Explore TFLite acceleration on mobile devices.
Inference API. Understand how to use the Inference API for running your trained models in the cloud to generate predictions.

If this is a 🐛 Bug Report, please provide screenshots and steps to reproduce your problem to help us get started working on a fix.

If this is a ❓ Question, please provide as much information as possible, including dataset, model, environment details etc. so that we might provide the most helpful response.

We try to respond to all issues as promptly as possible. Thank you for your patience!

sdy623 commented 1 month ago

sergiuwaxmann commented 1 month ago

@sdy623 Thank you for bringing this to our attention. It appears that the upload of final weights encountered a failure. Our team is currently investigating the issue to identify and resolve the underlying cause. I will keep you updated on our progress Your patience and understanding are greatly appreciated.

sergiuwaxmann commented 3 weeks ago

Hello @sdy623! Great news! Our team has released a fix for the issue you reported. You should no longer experience this problem in new Cloud Training sessions. Thanks for your patience!

sdy623 commented 3 weeks ago

Thank you very much. How can I deploy the fixed version

sergiuwaxmann commented 3 weeks ago

@sdy623 When using Ultralytics HUB, the system automatically utilizes the latest version. For local training, please ensure you are using the most recent ultralytics version (8.2.2). Unfortunately, the recent fix does not apply to models trained on earlier versions, so you will need to retrain your model with the latest version. We sincerely apologize for the inconvenience this causes.

ultralytics / hub