ultralytics / hub

Ultralytics HUB tutorials and support
https://hub.ultralytics.com
GNU Affero General Public License v3.0
107 stars 11 forks source link

Custom model using timed training doesn't allow its use #627

Closed JhonFrederick closed 3 weeks ago

JhonFrederick commented 1 month ago

Search before asking

Question

Currently with our team are using YOLOV8 models and we decide to train our own model using cloud training with time training to test this option.

According to the documentation, we thought that at the end of the given time, the model would be trained to that point and allow it to be used, but this does not seem to be the case. I don't have much experience with training models and I was exploring the platform, for this reason, I want to know if there is something I am doing wrong or have misunderstood?

Timed Training: The timed training feature allows you to fix the time duration of the entire training process and also determines the estimated amount before the start of training.

This is the final status of the model training, we thought that buying more credits would enable the "Resume training" button, but it didn't happen.

WhatsApp Image 2024-04-01 at 11 04 28 AM

I appreciate your help in advance

Additional

No response

github-actions[bot] commented 1 month ago

๐Ÿ‘‹ Hello @JhonFrederick, thank you for raising an issue about Ultralytics HUB ๐Ÿš€! Please visit our HUB Docs to learn more:

If this is a ๐Ÿ› Bug Report, please provide screenshots and steps to reproduce your problem to help us get started working on a fix.

If this is a โ“ Question, please provide as much information as possible, including dataset, model, environment details etc. so that we might provide the most helpful response.

We try to respond to all issues as promptly as possible. Thank you for your patience!

sergiuwaxmann commented 1 month ago

Hello @JhonFrederick!

First of all, please accept our apologies for the inconvenience caused.

Based on the screenshot you shared, you used Epochs training (not Timed training) but I would like to investigate this further. Can you please share your model ID (you can find it in the URL) here?

Also, looking at the right side of your screenshot, I can see negative epochs which makes me think that you might face an issue we are currently trying to solve (#622).

JhonFrederick commented 1 month ago

I remember setting the timed training to a value of 1 day, but now I'm not sure. Mainly because I am currently running other test with Epoch Training and the way the information is displayed was not the same as the attempt shown in the screenshot. But since you mention the issue, it could be due to that. Model ID: FuVEbxOoAcWJCFA7fa9m

Edit When I ran my model (FuVEbxOoAcWJCFA7fa9m), a few minutes later I reviewed the billing data and the information corresponded to the time entered (1 day), the total value was already calculated. But with epoch training this is calculated over time. I don't know if it's relevant, but I noticed this now that I'm running other model (with Epoch Training).

UltralyticsAssistant commented 1 month ago

@JhonFrederick hello again, and thank you for providing the model ID and additional details. It clarifies your situation significantly.

Given the information and your experience with both timed and epoch training, it indeed sounds like the unusual behavior you encountered with the model FuVEbxOoAcWJCFA7fa9m might be related to the issue we're currently addressing.

I appreciate your patience and understanding as we work towards resolving this. In the meantime, it seems you've correctly identified different billing behaviors between timed and epoch trainingโ€”timed training estimates your total cost upfront based on the duration, whereas epoch training's cost accumulates over time.

Your observations are indeed relevant and help us ensure the platform works as expected for everyone. We'll keep you updated on our progress with the mentioned issue. Please, stay tuned! ๐Ÿ˜Š

JhonFrederick commented 1 month ago

image

According to the above screen, my second model finished (with epoch training) with ID: tj2HLEVdErYxgunZzH9Z, but when I go to preview or deployment tab, I get the following message "Model not trained". image

Attached a screenshot of the billing summary, which shows the different attempts to complete the training. image

Please tell me in this case what I could be doing wrong so that it doesn't allow me to use the trained model?

I appreciate your help again in advance

UltralyticsAssistant commented 1 month ago

@JhonFrederick hello again!

Thanks for reaching out with these details. It looks like an issue on our end where the model's training status hasn't correctly updated in the UI, despite the training completion. This misalignment is likely causing the "Model not trained" message you're seeing.

For now, could you try refreshing the page or logging out and back into the platform to see if that helps sync the status? Sometimes, a simple refresh can resolve such discrepancies.

If the issue persists, rest assured, we're here to help! We'll investigate further using the model ID tj2HLEVdErYxgunZzH9Z you provided and ensure your model becomes accessible for preview and deployment.

Again, we truly appreciate your patience and feedback as we work to improve the platform. Stay tuned! ๐ŸŒŸ

sergiuwaxmann commented 1 month ago

@JhonFrederick Something went wrong with the first model (FuVEbxOoAcWJCFA7fa9m) and we are not yet sure what. Our team is investigating this issue. Regarding the second model (tj2HLEVdErYxgunZzH9Z), it appears that although the model finished training, the final upload of weights failed, which is why the model is unusable. We have refunded the account balance you used and kindly ask you to start the training process again from scratch. Once again, our apologies for the inconvenience caused.

JhonFrederick commented 1 month ago

Hi,

I tried again with another test using epoch training, but again I had problems, I attached proof of this. Model ID: xrGz5bRPDQvMniPK8eIR image

Billing information image

In this case the training was going well up to a certain point, after 75%, I had to retry the training a couple of times until it was completed, but without the possibility of using the model, until it finally ended in the state shown

sergiuwaxmann commented 1 month ago

Hello @JhonFrederick!

I apologize once again for the inconvenience.

Based on our internal tests, we've observed that, in approximately 10% of cases, the final weights upload fails. This results in the model being stuck at 100%. If the training is resumed, the session fails since the training has already completed. Our team is currently working on updating the logic for uploading weights to the Ultralytics HUB to prevent this issue.

Meanwhile, we have refunded the account balance you used.

CC @hassaanfarooq01

sergiuwaxmann commented 3 weeks ago

Hello @JhonFrederick! Great news! Our team has released a fix for the issue you reported. You should no longer experience this problem in new Cloud Training sessions. Thanks for your patience!

JhonFrederick commented 3 weeks ago

Hi,

Was it released today? because I was doing tests with epoch Training and they all gave me negative epochs Model ID: Jw8BPBb2kmX0i7lErCiP, j8NsAEFksZH65pEmt0e1, xvvlSW8VL1YTsCGM5jxr, yEAC9FrDTMgJBxvk42wp, Each model had finished with 100% but did not allow using the model and after a retry ended with -1 epochs.

After finishing a epoch training that I'm running, I try the Timed training again and I will comment on my results

sergiuwaxmann commented 3 weeks ago

Hello @JhonFrederick! We released the fix today (when I sent you the message above). Unfortunately, the recent fix does not apply to models trained on earlier versions, so you will need to retrain your models. We sincerely apologize for the inconvenience this causes.

JhonFrederick commented 3 weeks ago

Hi,

Was it released today? because I was doing tests with epoch Training and they all gave me negative epochs Model ID: Jw8BPBb2kmX0i7lErCiP, j8NsAEFksZH65pEmt0e1, xvvlSW8VL1YTsCGM5jxr, yEAC9FrDTMgJBxvk42wp, Each model had finished with 100% but did not allow using the model and after a retry ended with -1 epochs.

After finishing a epoch training that I'm running, I try the Timed training again and I will comment on my results

Finally my test with epoch training was successful, the test was run with 5 epochs (I didn't want to lose credits like with the other models I mentioned above). But when execute a timing training, again get negative epochs and I cannot make click in Resumen Training button.

Model ID: M16CdXdrDQNg0h7hGWJw

image image image

I paid for the "Hub Pro" plan expecting to take advantage of the cloud training, but for the entire month I basically never took advantage of it. It's a shame because my plan ends today and I was hoping to test with a better trained model but the issue still persists.

It should be noted that my problem was initially due to timing training, as I did not know the number of epochs my training could take.

I hope that at least all the failed models, including this, allow the team to correct the problem.

Edit: Training started yesterday at 4 pm (Colombian time), so it didn't even last 24 hours.

sergiuwaxmann commented 3 weeks ago

I am sorry you had a negative experience with Cloud Training due to issues on our end and we hope these won't happen again. Our team will look at the failed models and refund the account balance spent on them.