Can't finish training, can't load model after it finished tranining

rolurq commented 3 months ago

Search before asking

[X] I have searched the HUB issues and found no similar bug report.

HUB Component

Models, Training

Bug

I trained my model using Collab and after it finished the model in the hub says 100% but that training hasn't finish. When I try to run training again on Collab to maybe trigger completion once more but when I do so it raises and exception and it can't run.

Ultralytics HUB: New authentication successful ✅
Ultralytics HUB: View model at https://hub.ultralytics.com/models/q38rJZFi6qwbaiJpRL6K 🚀
Found https://storage.googleapis.com/ultralytics-hub.appspot.com/users/n0Mwq1AC3KVneaklMrauVkIsozJ3/models/q38rJZFi6qwbaiJpRL6K/epoch-291.pt locally at weights/epoch-291.pt
---------------------------------------------------------------------------
UnpicklingError                           Traceback (most recent call last)
[<ipython-input-3-252a0c1dfed1>](https://localhost:8080/#) in <cell line: 3>()
      1 hub.login('...')
      2 
----> 3 model = YOLO('https://hub.ultralytics.com/models/q38rJZFi6qwbaiJpRL6K')
      4 results = model.train()

7 frames
[/usr/local/lib/python3.10/dist-packages/ultralytics/models/yolo/model.py](https://localhost:8080/#) in __init__(self, model, task, verbose)
     21         else:
     22             # Continue with default YOLO initialization
---> 23             super().__init__(model=model, task=task, verbose=verbose)
     24 
     25     @property

[/usr/local/lib/python3.10/dist-packages/ultralytics/engine/model.py](https://localhost:8080/#) in __init__(self, model, task, verbose)
    140             self._new(model, task=task, verbose=verbose)
    141         else:
--> 142             self._load(model, task=task)
    143 
    144     def __call__(

[/usr/local/lib/python3.10/dist-packages/ultralytics/engine/model.py](https://localhost:8080/#) in _load(self, weights, task)
    292 
    293         if Path(weights).suffix == ".pt":
--> 294             self.model, self.ckpt = attempt_load_one_weight(weights)
    295             self.task = self.model.args["task"]
    296             self.overrides = self.model.args = self._reset_ckpt_args(self.model.args)

[/usr/local/lib/python3.10/dist-packages/ultralytics/nn/tasks.py](https://localhost:8080/#) in attempt_load_one_weight(weight, device, inplace, fuse)
    853 def attempt_load_one_weight(weight, device=None, inplace=True, fuse=False):
    854     """Loads a single model weights."""
--> 855     ckpt, weight = torch_safe_load(weight)  # load ckpt
    856     args = {**DEFAULT_CFG_DICT, **(ckpt.get("train_args", {}))}  # combine model and default args, preferring model args
    857     model = (ckpt.get("ema") or ckpt["model"]).to(device).float()  # FP32 model

[/usr/local/lib/python3.10/dist-packages/ultralytics/nn/tasks.py](https://localhost:8080/#) in torch_safe_load(weight)
    779             },
    780         ):
--> 781             ckpt = torch.load(file, map_location="cpu")
    782 
    783     except ModuleNotFoundError as e:  # e.name is missing module name

[/usr/local/lib/python3.10/dist-packages/ultralytics/utils/patches.py](https://localhost:8080/#) in torch_load(*args, **kwargs)
     84         kwargs["weights_only"] = False
     85 
---> 86     return _torch_load(*args, **kwargs)
     87 
     88 

[/usr/local/lib/python3.10/dist-packages/torch/serialization.py](https://localhost:8080/#) in load(f, map_location, pickle_module, weights_only, mmap, **pickle_load_args)
   1038             except RuntimeError as e:
   1039                 raise pickle.UnpicklingError(UNSAFE_MESSAGE + str(e)) from None
-> 1040         return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
   1041 
   1042 

[/usr/local/lib/python3.10/dist-packages/torch/serialization.py](https://localhost:8080/#) in _legacy_load(f, map_location, pickle_module, **pickle_load_args)
   1260             "functionality.")
   1261 
-> 1262     magic_number = pickle_module.load(f, **pickle_load_args)
   1263     if magic_number != MAGIC_NUMBER:
   1264         raise RuntimeError("Invalid magic number; corrupt file?")

UnpicklingError: invalid load key, '<'.

Environment

Ultralytics HUB Version v0.1.46 Client User Agent Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36 Operating System Win32 Browser Window Size 2352 x 1352 Server Timestamp 1722690165

Minimal Reproducible Example

No response

Additional

pderrenger commented 3 months ago

Hi there,

Thank you for reaching out and providing detailed information about the issue you're facing.

It looks like you're encountering an UnpicklingError when trying to load your model after training. This error typically indicates that the file you're trying to load is corrupted or not in the expected format.

Here are a few steps you can take to troubleshoot and resolve this issue:

Verify Model File Integrity: Ensure that the model file (epoch-291.pt) is not corrupted. You can try downloading the file again from the Ultralytics HUB to see if the issue persists.
Update Packages: Make sure you are using the latest versions of the Ultralytics and PyTorch packages. You can update them using the following commands:
```
pip install --upgrade ultralytics
pip install --upgrade torch
```
Re-run Training: Sometimes, re-running the training process can help resolve issues with corrupted files. Ensure that you have a stable internet connection during the training process to avoid any interruptions.
Check File Path: Ensure that the file path provided is correct and that the file exists at the specified location.
Use Local File: If the file is available locally, you can try loading it directly from your local system instead of using the URL:
```
model = YOLO('weights/epoch-291.pt')
```

If the issue persists after trying these steps, please provide additional details such as any error messages or logs you encounter. This will help us further diagnose the problem.

For more detailed guidance, you can refer to our Ultralytics HUB Quickstart Guide.

Feel free to reach out if you have any more questions or need further assistance. We're here to help! 😊

sergiuwaxmann commented 3 months ago

@rolurq It looks like you have a checkpoint for epoch 291. Can you try resuming training?

rolurq commented 3 months ago

@sergiuwaxmann As I mentioned in the post, when I try to resume training it throws an exception, the exception is also in the post.

ultralytics / hub