ultralytics / hub

Ultralytics HUB tutorials and support
https://hub.ultralytics.com
GNU Affero General Public License v3.0
107 stars 11 forks source link

Epochs remaining gone negative and stuck on optimising weights #622

Closed tlong123 closed 3 weeks ago

tlong123 commented 1 month ago

Search before asking

HUB Component

Models, Training

Bug

I have set up a model to train for 100 epochs. I left it running and when I came back the next day it said disconnected, checkpointed at 99th epoch so I've clicked resume training however now I'm stuck on 100% optimising weights, with minus 6 epochs remaining and the time estimate is stuck saying estimating...

I'm still being billed for this time and the number of negative epochs is only increasing with time. Is this expected or has something gone wrong with the model? I'm training it on your cloud. (its just changed to -7 epochs remaining now)

Environment

No response

Minimal Reproducible Example

No response

Additional

No response

github-actions[bot] commented 1 month ago

👋 Hello @tlong123, thank you for raising an issue about Ultralytics HUB 🚀! Please visit our HUB Docs to learn more:

If this is a 🐛 Bug Report, please provide screenshots and steps to reproduce your problem to help us get started working on a fix.

If this is a ❓ Question, please provide as much information as possible, including dataset, model, environment details etc. so that we might provide the most helpful response.

We try to respond to all issues as promptly as possible. Thank you for your patience!

tlong123 commented 1 month ago

Additionally, other than deleting the model I can't see any way to stop it from actively costing me money in its current state, and I don't want to delete it as I've paid money for it to be trained!

UltralyticsAssistant commented 1 month ago

@tlong123 hello! First off, thanks for reaching out and detailing the issue you're facing with the training process on Ultralytics HUB. 🌟 It sounds like you've stumbled upon a rare glitch, particularly with the training session not properly concluding and going into negative epochs – this indeed is not expected behavior.

Rest assured, we prioritize both the performance and the billing concerns of our users. Here's a couple of steps you can take:

  1. If you haven't already, please attempt to manually stop the training session. While you've mentioned an issue with stopping the model without deleting it, there should be a stop or pause option available in the UI for your training session.
  2. Regarding the unexpected billing, we absolutely understand the importance of fair billing practices. We recommend reaching out to our support through the official communication channels mentioned in our documentation. They will be able to look into your billing details and make necessary adjustments based on the glitch.

We're here to ensure a smooth and efficient training experience. Your detailed report is incredibly valuable for us to improve and rectify such issues. If anything else comes up or if you have further questions, don't hesitate to update the issue or reach out to our support team.

Thanks for your patience and understanding. 👍

tlong123 commented 1 month ago

ultralytics_glitch I've attached a screenshot of where its at now. It eventually disconnected by itself

sergiuwaxmann commented 1 month ago

Hello @tlong123! First of all, I am sorry for the inconvenience. This must be an issue on our end. Can you share your model ID (it is in the model page URL) so we can investigate this and prevent this from happening in the future? Also, I will refund the credits used for training this model.

tlong123 commented 1 month ago

yeah sure its : y9dPahdYO4ShpAfpD6pG

sergiuwaxmann commented 1 month ago

@tlong123 I have refunded the credits used for training this model. Our team is investigating the cause of the issue with your training and exploring ways to prevent such incidents in the future. I will keep you updated on our progress.

tlong123 commented 1 month ago

Thanks Sergiu! I think I should mention here that I've just had the same issue with another model I've tried to train - model ID is PmUMb8RaZufqpap3APXi Screenshot 2024-03-28 134453

Burhan-Q commented 1 month ago

@tlong123 I'm trying to help with troubleshooting the issue and wanted to ask for more information about you configurations. If you could share:

I can better help out with testing something more closely resembling your setup.

Burhan-Q commented 1 month ago

Looks like most of the info is actually in the screenshots you shared, so I'll test with those. Anything else you can recall that's not shown in the screenshot would be helpful to know

tlong123 commented 1 month ago

Hi Burhan! only other things I can think of are that they were trained on the Ultralytics cloud, and both disconnected at some point before 100 epochs so I had to click resume training

sergiuwaxmann commented 1 month ago

@tlong123 I have refunded the credits used for training the second model. Our team is investigating the cause of the issue with your trainings and exploring ways to prevent such incidents in the future. I will keep you updated on our progress.

Please accept our apologies for the inconvenience caused.

sergiuwaxmann commented 3 weeks ago

Hello @tlong123! Great news! Our team has released a fix for the issue you reported. You should no longer experience this problem in new Cloud Training sessions. Thanks for your patience!