ultralytics / hub

Ultralytics HUB tutorials and support
https://hub.ultralytics.com
GNU Affero General Public License v3.0
138 stars 14 forks source link

Hub training disconnected at 100% complete with 0/100 epochs remaining but prompts to resume training from epoch 98 #797

Open liamw9534 opened 3 months ago

liamw9534 commented 3 months ago

Search before asking

HUB Component

Training

Bug

I have been training a yolov8m model using an ultralytics GPU instance. The training dialog indicates "disconnected" at 100% complete with 0 epochs remaining but there is no trained model output available. The dialog is prompting me to resume training from epoch 98 but training has already completed for 100 epochs.

Environment

No response

Minimal Reproducible Example

Additional

liamw9534 commented 3 months ago

I am also trying to resume the training using an agent but getting the following error when running the provided python code fragment:

2024-08-08 11:32:25,366 - hub_sdk.helpers.logger - ERROR - Unknown error occurred. 2024-08-08 11:32:25,366 - hub_sdk.helpers.logger - ERROR - Failed to start heartbeats: 'NoneType' object has no attribute 'json' Exception in thread Thread-1 (_start_heartbeats): Traceback (most recent call last): File "/opt/conda/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/opt/conda/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/opt/conda/lib/python3.10/site-packages/hub_sdk/base/server_clients.py", line 151, in _start_heartbeats raise e File "/opt/conda/lib/python3.10/site-packages/hub_sdk/base/server_clients.py", line 139, in _start_heartbeats res = self.post(endpoint, json=payload).json() AttributeError: 'NoneType' object has no attribute 'json'

The hub login step worked fine but attaching to the remote model step resulted in the above error. I am able to resume training on the model but without sync back to ultralytics hub, since it doesn't appears to show my agent has connected.

liamw9534 commented 3 months ago

Looking at the output of running model.train() in an agent, I can see that there is a problem with dataset.yaml and its path attribute causing errors in the validation step. I suspect this is why the ultralytics training server disconnect happened occurred after 100 epochs.

So I do now have a fully trained model on the agent I am using but I am not able to sync this back to ultralytics because of the previous python run-time error from AttributeError: 'NoneType' object has no attribute 'json'.

liamw9534 commented 3 months ago

Ok, I ran the training final epoch again. I got the same AttributeError: 'NoneType' object has no attribute 'json' error but this time the training has synced back into ultralytics cloud. Not sure what the cause of this python error is, but it does not seem to be fatal.

liamw9534 commented 3 months ago

Ok, I think the summary of this is:

pderrenger commented 3 months ago

@liamw9534 thank you for the detailed summary and for sharing your experience. It sounds like you've encountered a few issues that could benefit from further investigation. Here are some steps and suggestions to address the points you've raised:

  1. Incorrect path Attribute in dataset.yaml:

    • You're correct that an incorrect path attribute can cause validation steps to fail. To prevent this, please ensure that the dataset.yaml file is correctly configured before uploading it to the Ultralytics HUB. This includes verifying that all paths are accurate and accessible.
  2. Detection of Issues During Dataset Upload:

    • We appreciate your feedback on this. We will look into improving our dataset validation checks during the upload process to catch such issues earlier.
  3. Training Disconnect and Odd State at 100%:

    • This seems to be a bug. Please ensure you are using the latest version of the Ultralytics packages. If the issue persists, we encourage you to open a new issue with detailed logs and steps to reproduce the problem.
  4. Recovery with External Agent and Python Runtime Error:

    • The AttributeError: 'NoneType' object has no attribute 'json' error you encountered is concerning. It appears to be related to the heartbeat mechanism in the SDK. While it didn't prevent the sync in your case, it should be addressed. Please ensure your SDK is up-to-date. If the error continues, providing detailed logs in a new issue would be helpful for us to diagnose and fix the problem.
  5. Manual Correction of dataset.yaml:

    • It's good to hear that you were able to recover by manually correcting the dataset.yaml file. For future reference, always double-check the dataset configuration before starting the training process.

We appreciate your patience and understanding as we work to improve the Ultralytics HUB experience. If you have any further questions or need additional assistance, please don't hesitate to reach out here.

Thank you for being a part of the YOLO community! 😊