ultralytics / hub

Ultralytics HUB tutorials and support
https://hub.ultralytics.com
GNU Affero General Public License v3.0
125 stars 11 forks source link

Hub training disconnected at 100% complete with 0/100 epochs remaining but prompts to resume training from epoch 98 #797

Open liamw9534 opened 1 month ago

liamw9534 commented 1 month ago

Search before asking

HUB Component

Training

Bug

I have been training a yolov8m model using an ultralytics GPU instance. The training dialog indicates "disconnected" at 100% complete with 0 epochs remaining but there is no trained model output available. The dialog is prompting me to resume training from epoch 98 but training has already completed for 100 epochs.

Environment

No response

Minimal Reproducible Example

Additional

liamw9534 commented 1 month ago

I am also trying to resume the training using an agent but getting the following error when running the provided python code fragment:

2024-08-08 11:32:25,366 - hub_sdk.helpers.logger - ERROR - Unknown error occurred. 2024-08-08 11:32:25,366 - hub_sdk.helpers.logger - ERROR - Failed to start heartbeats: 'NoneType' object has no attribute 'json' Exception in thread Thread-1 (_start_heartbeats): Traceback (most recent call last): File "/opt/conda/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/opt/conda/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/opt/conda/lib/python3.10/site-packages/hub_sdk/base/server_clients.py", line 151, in _start_heartbeats raise e File "/opt/conda/lib/python3.10/site-packages/hub_sdk/base/server_clients.py", line 139, in _start_heartbeats res = self.post(endpoint, json=payload).json() AttributeError: 'NoneType' object has no attribute 'json'

The hub login step worked fine but attaching to the remote model step resulted in the above error. I am able to resume training on the model but without sync back to ultralytics hub, since it doesn't appears to show my agent has connected.

liamw9534 commented 1 month ago

Looking at the output of running model.train() in an agent, I can see that there is a problem with dataset.yaml and its path attribute causing errors in the validation step. I suspect this is why the ultralytics training server disconnect happened occurred after 100 epochs.

So I do now have a fully trained model on the agent I am using but I am not able to sync this back to ultralytics because of the previous python run-time error from AttributeError: 'NoneType' object has no attribute 'json'.

liamw9534 commented 1 month ago

Ok, I ran the training final epoch again. I got the same AttributeError: 'NoneType' object has no attribute 'json' error but this time the training has synced back into ultralytics cloud. Not sure what the cause of this python error is, but it does not seem to be fatal.

liamw9534 commented 1 month ago

Ok, I think the summary of this is: