Open liamw9534 opened 1 month ago
I am also trying to resume the training using an agent but getting the following error when running the provided python code fragment:
2024-08-08 11:32:25,366 - hub_sdk.helpers.logger - ERROR - Unknown error occurred. 2024-08-08 11:32:25,366 - hub_sdk.helpers.logger - ERROR - Failed to start heartbeats: 'NoneType' object has no attribute 'json' Exception in thread Thread-1 (_start_heartbeats): Traceback (most recent call last): File "/opt/conda/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/opt/conda/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/opt/conda/lib/python3.10/site-packages/hub_sdk/base/server_clients.py", line 151, in _start_heartbeats raise e File "/opt/conda/lib/python3.10/site-packages/hub_sdk/base/server_clients.py", line 139, in _start_heartbeats res = self.post(endpoint, json=payload).json() AttributeError: 'NoneType' object has no attribute 'json'
The hub login step worked fine but attaching to the remote model step resulted in the above error. I am able to resume training on the model but without sync back to ultralytics hub, since it doesn't appears to show my agent has connected.
Looking at the output of running model.train()
in an agent, I can see that there is a problem with dataset.yaml
and its path
attribute causing errors in the validation step. I suspect this is why the ultralytics training server disconnect happened occurred after 100 epochs.
So I do now have a fully trained model on the agent I am using but I am not able to sync this back to ultralytics because of the previous python run-time error from AttributeError: 'NoneType' object has no attribute 'json'
.
Ok, I ran the training final epoch again. I got the same AttributeError: 'NoneType' object has no attribute 'json'
error but this time the training has synced back into ultralytics cloud. Not sure what the cause of this python error is, but it does not seem to be fatal.
Ok, I think the summary of this is:
dataset.yaml
can cause the final training validation step to fail.
Search before asking
HUB Component
Training
Bug
I have been training a yolov8m model using an ultralytics GPU instance. The training dialog indicates "disconnected" at 100% complete with 0 epochs remaining but there is no trained model output available. The dialog is prompting me to resume training from epoch 98 but training has already completed for 100 epochs.
Environment
No response
Minimal Reproducible Example
path:
attribute.path:
attribute not finding validation images.Additional