ultralytics / hub

Ultralytics HUB tutorials and support
https://hub.ultralytics.com
GNU Affero General Public License v3.0
107 stars 11 forks source link

how to continue my interupted work #641

Closed Zero-start0 closed 3 weeks ago

Zero-start0 commented 1 month ago

Search before asking

Question

My work in ultralytics hub will be interupted after all epoch trained. How to continue my work from the interupted point

Additional

No response

github-actions[bot] commented 1 month ago

👋 Hello @Zero-start0, thank you for raising an issue about Ultralytics HUB 🚀! Please visit our HUB Docs to learn more:

If this is a 🐛 Bug Report, please provide screenshots and steps to reproduce your problem to help us get started working on a fix.

If this is a ❓ Question, please provide as much information as possible, including dataset, model, environment details etc. so that we might provide the most helpful response.

We try to respond to all issues as promptly as possible. Thank you for your patience!

sergiuwaxmann commented 1 month ago

Hi @Zero-start0! When training a model using Ultralytics HUB, we try to save a checkpoint every 15 minutes. If a checkpoint is saved, you can resume training - the resume option is shown on the model page automatically.

Zero-start0 commented 4 weeks ago

Hi @Zero-start0! When training a model using Ultralytics HUB, we try to save a checkpoint every 15 minutes. If a checkpoint is saved, you can resume training - the resume option is shown on the model page automatically.

However, I see the checkpoint has been upload to hub, I can't find the resume option

Zero-start0 commented 4 weeks ago

Hi @Zero-start0! When training a model using Ultralytics HUB, we try to save a checkpoint every 15 minutes. If a checkpoint is saved, you can resume training - the resume option is shown on the model page automatically.

Could you show a detail instruction? This question drive me mad

sergiuwaxmann commented 4 weeks ago

@Zero-start0 If your model is disconnected and a checkpoint is saved, the message on the model page should be "Resume training from epoch X". What do you see on the model page?

Zero-start0 commented 4 weeks ago

image image

Zero-start0 commented 4 weeks ago

@Zero-start0 If your model is disconnected and a checkpoint is saved, the message on the model page should be "Resume training from epoch X". What do you see on the model page?

What should I do now

Zero-start0 commented 4 weeks ago

image

@Zero-start0 If your model is disconnected and a checkpoint is saved, the message on the model page should be "Resume training from epoch X". What do you see on the model page?

I can see the checkpoint, but I don't have the option

Zero-start0 commented 4 weeks ago

@Zero-start0 If your model is disconnected and a checkpoint is saved, the message on the model page should be "Resume training from epoch X". What do you see on the model page?

If I reconnect, the model will train from epoch 1.

sergiuwaxmann commented 4 weeks ago

@Zero-start0 Based on the Ultralytics HUB screenshots you shared, there is no checkpoint saved in the Ultralytics HUB. I have attached an image of a model that has a checkpoint saved in the Ultralytics HUB. resume

Looking at the ultralytics logs you shared, I can see that a checkpoint began uploading - perhaps the process did not succeed or something went wrong. Our team will investigate if there is an issue on our end related to the upload, and I will keep you updated.

Zero-start0 commented 4 weeks ago

@Zero-start0 Based on the Ultralytics HUB screenshots you shared, there is no checkpoint saved in the Ultralytics HUB. I have attached an image of a model that has a checkpoint saved in the Ultralytics HUB. resume

Looking at the logs you shared, I can see that a checkpoint began uploading - perhaps the process did not succeed or something went wrong. Our team will investigate if there is an issue on our end related to the upload, and I will keep you updated.ultralytics

So how can I continue my work. Any method can I use to continue

Zero-start0 commented 4 weeks ago

Now I can only to use this method to train my model

from ultralytics import YOLO

Load a model

model = YOLO('../ultralytics/runs/detect/train/weights/last.pt') # load a partially trained model

Resume training

results = model.train(resume=True)

Zero-start0 commented 4 weeks ago

I have experienced this situation so many time. If there is any solution please contact me.

Zero-start0 commented 4 weeks ago

What's more, how can I upload my trained model to the hub?

sergiuwaxmann commented 4 weeks ago

Now I can only to use this method to train my model

from ultralytics import YOLO

Load a model

model = YOLO('../ultralytics/runs/detect/train/weights/last.pt') # load a partially trained model

Resume training

results = model.train(resume=True)

@Zero-start0 Yes, this is a valid temporary solution. As mentioned above, I will keep you updated. Thank you for understanding!

LightDex9 commented 4 weeks ago

Hello, I have the same problem, the training stops after 33 epochs and i can't resume it (I'm using Colab)

pderrenger commented 4 weeks ago

Hello! 😊 If your training in Colab stops and you're unable to resume it directly, make sure you're saving checkpoints at regular intervals during training. After a stoppage, you can resume training from the last saved checkpoint by specifying its path when initializing your training command. Please make sure your code for resuming training on Colab includes the path to the checkpoint. Remember, consistent checkpoints are key for smoothly resuming training, especially in environments like Colab that have time limits on sessions.

LightDex9 commented 4 weeks ago

Hello! 😊 If your training in Colab stops and you're unable to resume it directly, make sure you're saving checkpoints at regular intervals during training. After a stoppage, you can resume training from the last saved checkpoint by specifying its path when initializing your training command. Please make sure your code for resuming training on Colab includes the path to the checkpoint. Remember, consistent checkpoints are key for smoothly resuming training, especially in environments like Colab that have time limits on sessions.

Thanks for the reply, how can i see the path where the last checkpoint is saved on Colab? During training it says "Uploading Checkpoints https://hub.ultralytics.com/models/..." every 3 epochs, but I can't see saved checkpoints on Ultralytics Hub.

Edit: Now I've seen that on Colab it says "WARNING ⚠️ using HUB training arguments, ignoring local training arguments." and the argument "save_period" is equal to -1 in the HUB training

Screenshot 2024-04-16 071439

Zero-start0 commented 4 weeks ago

Hello! 😊 If your training in Colab stops and you're unable to resume it directly, make sure you're saving checkpoints at regular intervals during training. After a stoppage, you can resume training from the last saved checkpoint by specifying its path when initializing your training command. Please make sure your code for resuming training on Colab includes the path to the checkpoint. Remember, consistent checkpoints are key for smoothly resuming training, especially in environments like Colab that have time limits on sessions.

Thanks for the reply, how can i see the path where the last checkpoint is saved on Colab? During training it says "Uploading Checkpoints https://hub.ultralytics.com/models/..." every 3 epochs, but I can't see saved checkpoints on Ultralytics Hub

Yeah, I have the same question. Although the Uploading Checkpoint was shown in the notebook but I can't find any checkpoint in the Ultralytics Hub

sergiuwaxmann commented 4 weeks ago

Thanks for the reply, how can i see the path where the last checkpoint is saved on Colab? During training it says "Uploading Checkpoints https://hub.ultralytics.com/models/..." every 3 epochs, but I can't see saved checkpoints on Ultralytics Hub.

Edit: Now I've seen that on Colab it says "WARNING ⚠️ using HUB training arguments, ignoring local training arguments." and the argument "save_period" is equal to -1 in the HUB training

Screenshot 2024-04-16 071439

Hello @LightDex9! Indeed, the local training arguments are ignored when training a model from Ultralytics HUB.

The log you see (Ultralytics HUB: Uploading checkpoint...) is shown when the upload starts but it doesn't check if the upload was successful and it doesn't retry if the job is interrupted.

Our team will investigate if there is an issue on our end related to the upload, and I will keep you updated.

sergiuwaxmann commented 4 weeks ago

@Zero-start0 @LightDex9

I wanted to update you on the recent release from ultralytics, version 8.2.0, which addresses the issue you encountered. The checkpoints are now being uploaded correctly.

For verification, I conducted a test in my local virtual environment. I modified the "ckpt" value from 900.0 to 1.0 in the ultralytics/hub/session.py file and initiated training using my local agent. The results confirmed that the fix is effective.

Please feel free to reach out if you encounter any further issues.