Open liamw9534 opened 3 months ago
I am also trying to resume the training using an agent but getting the following error when running the provided python code fragment:
2024-08-08 11:32:25,366 - hub_sdk.helpers.logger - ERROR - Unknown error occurred. 2024-08-08 11:32:25,366 - hub_sdk.helpers.logger - ERROR - Failed to start heartbeats: 'NoneType' object has no attribute 'json' Exception in thread Thread-1 (_start_heartbeats): Traceback (most recent call last): File "/opt/conda/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/opt/conda/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/opt/conda/lib/python3.10/site-packages/hub_sdk/base/server_clients.py", line 151, in _start_heartbeats raise e File "/opt/conda/lib/python3.10/site-packages/hub_sdk/base/server_clients.py", line 139, in _start_heartbeats res = self.post(endpoint, json=payload).json() AttributeError: 'NoneType' object has no attribute 'json'
The hub login step worked fine but attaching to the remote model step resulted in the above error. I am able to resume training on the model but without sync back to ultralytics hub, since it doesn't appears to show my agent has connected.
Looking at the output of running model.train()
in an agent, I can see that there is a problem with dataset.yaml
and its path
attribute causing errors in the validation step. I suspect this is why the ultralytics training server disconnect happened occurred after 100 epochs.
So I do now have a fully trained model on the agent I am using but I am not able to sync this back to ultralytics because of the previous python run-time error from AttributeError: 'NoneType' object has no attribute 'json'
.
Ok, I ran the training final epoch again. I got the same AttributeError: 'NoneType' object has no attribute 'json'
error but this time the training has synced back into ultralytics cloud. Not sure what the cause of this python error is, but it does not seem to be fatal.
Ok, I think the summary of this is:
dataset.yaml
can cause the final training validation step to fail.@liamw9534 thank you for the detailed summary and for sharing your experience. It sounds like you've encountered a few issues that could benefit from further investigation. Here are some steps and suggestions to address the points you've raised:
Incorrect path
Attribute in dataset.yaml
:
path
attribute can cause validation steps to fail. To prevent this, please ensure that the dataset.yaml
file is correctly configured before uploading it to the Ultralytics HUB. This includes verifying that all paths are accurate and accessible.Detection of Issues During Dataset Upload:
Training Disconnect and Odd State at 100%:
Recovery with External Agent and Python Runtime Error:
AttributeError: 'NoneType' object has no attribute 'json'
error you encountered is concerning. It appears to be related to the heartbeat mechanism in the SDK. While it didn't prevent the sync in your case, it should be addressed. Please ensure your SDK is up-to-date. If the error continues, providing detailed logs in a new issue would be helpful for us to diagnose and fix the problem.Manual Correction of dataset.yaml
:
dataset.yaml
file. For future reference, always double-check the dataset configuration before starting the training process.We appreciate your patience and understanding as we work to improve the Ultralytics HUB experience. If you have any further questions or need additional assistance, please don't hesitate to reach out here.
Thank you for being a part of the YOLO community! 😊
Search before asking
HUB Component
Training
Bug
I have been training a yolov8m model using an ultralytics GPU instance. The training dialog indicates "disconnected" at 100% complete with 0 epochs remaining but there is no trained model output available. The dialog is prompting me to resume training from epoch 98 but training has already completed for 100 epochs.
Environment
No response
Minimal Reproducible Example
path:
attribute.path:
attribute not finding validation images.Additional