Search before asking

[X] I have searched the YOLOv5 issues and discussions and found no similar questions.

Question

Hi, i'm working on a project that train two diffent models on two different clients (A and B). The models have se same classes and structures. Every n time A share the weights from his model with the B. When B receive A's weights he want to merge them in one new model with a simple average of the two.

This is the code that create the merge and it seems to work fine, it create a new state_dict with ne average from the previous two.

import torch

first = torch.hub.load('ultralytics/yolov5', 'custom', "./models/user.pt")
sdA = first.state_dict()

second = torch.hub.load('ultralytics/yolov5', 'custom', "./models/central.pt")
sdB = second.state_dict()

sdC = sdA

for key in sdA:
    sdC[key] = (sdA[key] + sdB[key]) / 2

merge = first
merge.load_state_dict(sdC)

torch.save(merge.state_dict(), "./models/merge.pt")

merge = torch.hub.load('ultralytics/yolov5', 'custom', "./models/merge.pt")
sdD = merge

The code work fine until the last torch.hub.load() when i try to load the model just created from the merge. The traceback return is the following:

Traceback (most recent call last):
  File "C:\Users\Davide/.cache\torch\hub\ultralytics_yolov5_master\hubconf.py", line 49, in _create
    model = DetectMultiBackend(path, device=device, fuse=autoshape)  # detection model
  File "C:\Users\Davide/.cache\torch\hub\ultralytics_yolov5_master\models\common.py", line 356, in __init__
    model = attempt_load(weights if isinstance(weights, list) else w, device=device, inplace=True, fuse=fuse)
  File "C:\Users\Davide/.cache\torch\hub\ultralytics_yolov5_master\models\experimental.py", line 80, in attempt_load
    ckpt = (ckpt.get('ema') or ckpt['model']).to(device).float()  # FP32 model
KeyError: 'model'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Davide/.cache\torch\hub\ultralytics_yolov5_master\hubconf.py", line 60, in _create
    model = attempt_load(path, device=device, fuse=False)  # arbitrary model
  File "C:\Users\Davide/.cache\torch\hub\ultralytics_yolov5_master\models\experimental.py", line 80, in attempt_load
    ckpt = (ckpt.get('ema') or ckpt['model']).to(device).float()  # FP32 model
KeyError: 'model'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "E:\TirocinioVero\text_and_drive\Terza_Fase\Progetto\UserA\prova.py", line 19, in <module>
    merge = torch.hub.load('ultralytics/yolov5', 'custom', "./models/merge.pt")
  File "C:\Users\Davide\.virtualenvs\Progetto-ca1l6rVE\lib\site-packages\torch\hub.py", line 558, in load
    model = _load_local(repo_or_dir, model, *args, **kwargs)
  File "C:\Users\Davide\.virtualenvs\Progetto-ca1l6rVE\lib\site-packages\torch\hub.py", line 587, in _load_local
    model = entry(*args, **kwargs)
  File "C:\Users\Davide/.cache\torch\hub\ultralytics_yolov5_master\hubconf.py", line 83, in custom
    return _create(path, autoshape=autoshape, verbose=_verbose, device=device)
  File "C:\Users\Davide/.cache\torch\hub\ultralytics_yolov5_master\hubconf.py", line 78, in _create
    raise Exception(s) from e
Exception: 'model'. Cache may be out of date, try `force_reload=True` or see https://docs.ultralytics.com/yolov5/tutorials/pytorch_hub_model_loading for help.

I've tried to use the torch.load but it doesen't work when loading A and B. I've analysed the weights file and i noticed that the beginning of A and B differ from the one i saved with torch.save(). I think the problem is in what object i save but online i've seen other using this method to create a new average weights file.

Additional

No response

👋 Hello @gazzadi, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Requirements

Python>=3.8.0 with all requirements.txt installed including PyTorch>=1.8. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training, validation, inference, export and benchmarks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

Introducing YOLOv8 🚀

We're excited to announce the launch of our latest state-of-the-art (SOTA) object detection model for 2023 - YOLOv8 🚀!

Designed to be fast, accurate, and easy to use, YOLOv8 is an ideal choice for a wide range of object detection, image segmentation and image classification tasks. With YOLOv8, you'll be able to quickly and accurately detect objects in real-time, streamline your workflows, and achieve new levels of accuracy in your projects.

Check out our YOLOv8 Docs for details and get started with:

pip install ultralytics

@gazzadi thank you for reaching out with your question.

The error you encountered occurs because the merge.pt file you saved does not contain the necessary information to load the model. When you save the state dictionary using torch.save(), it only saves the weights and not the entire model architecture.

To correctly merge the weights from two models, you need to create a new model and load the state dictionaries into it. Here's an updated code snippet:

import torch
from models.common import Detect

modelA = torch.hub.load('ultralytics/yolov5', 'custom', path='./models/user.pt')
sdA = modelA.state_dict()

modelB = torch.hub.load('ultralytics/yolov5', 'custom', path='./models/central.pt')
sdB = modelB.state_dict()

sdC = {}
for key in sdA:
    sdC[key] = (sdA[key] + sdB[key]) / 2

modelC = Detect()
modelC.load_state_dict(sdC)

torch.save(modelC.state_dict(), "./models/merge.pt")

merged_model = torch.hub.load('ultralytics/yolov5', 'custom', path='./models/merge.pt')

In the updated code, we create a new model modelC with the same architecture as the original models but without loading any weights. We then load the merged state dictionary sdC into modelC. Finally, we load the merged model using torch.hub.load() by providing the path to the merge.pt file.

This should resolve the error and allow you to load the merged model successfully.

Let me know if you have any further questions!

Thank for your answer, i seem to have problem with this solution either. The import doesn't find Detect in the common file. The code that i've run its the same you wrote me but it return this error:

Traceback (most recent call last):
  File "E:\TirocinioVero\text_and_drive\Terza_Fase\Progetto\UserA\yolov5\prova.py", line 2, in <module>
    from models.common import Detect
ImportError: cannot import name 'Detect' from 'models.common' (E:\TirocinioVero\text_and_drive\Terza_Fase\Progetto\UserA\yolov5\models\common.py)

I haven't work or modified the yolo files so i don't know what is the problem. I read the common.py file and i haven't found the definition of Detect in it. The only similar classes i've found are in the yolo.py file, vut they don't work either.

@gazzadi the Detect class is not a part of the original YOLOv5 codebase. It seems that there might have been some confusion or misunderstanding in the previous solution provided.

To correctly merge the weights from two YOLOv5 models, you can follow the steps below:

import torch
from torch import nn

# Load the model state dictionaries
modelA = torch.hub.load('ultralytics/yolov5', 'custom', path='./models/user.pt')
sdA = modelA.state_dict()

modelB = torch.hub.load('ultralytics/yolov5', 'custom', path='./models/central.pt')
sdB = modelB.state_dict()

# Merge the state dictionaries
sdC = {}
for key in sdA:
    sdC[key] = (sdA[key] + sdB[key]) / 2

# Create a new model with the merged weights
modelC = torch.hub.load('ultralytics/yolov5', 'custom')
modelC.load_state_dict(sdC)

# Save the merged model
torch.save(modelC.state_dict(), "./models/merge.pt")

# Load the merged model
merged_model = torch.hub.load('ultralytics/yolov5', 'custom', path='./models/merge.pt')

In this updated code, we create a new model (modelC) using torch.hub.load('ultralytics/yolov5', 'custom') without loading any weights. We then merge the state dictionaries sdA and sdB by averaging their values. Finally, we load the merged model using torch.hub.load() and the path to the saved merge.pt file.

This should resolve the issue you're facing and allow you to merge the weights of the two YOLOv5 models successfully.

Let me know if you have any further questions!

I've tried multiple times the code that you suggested but it's not working like we're hoping.

This command doesn't work write in this format. modelC = torch.hub.load('ultralytics/yolov5', 'custom')

For retrieving the default yolov5s model i found that i can write modelC = torch.hub.load('ultralytics/yolov5', 'yolov5s')

But there's another problem that i'm facing, the model just retrieved and the ones that i have trained differ from numbers of layers. I specipy that modelA and modelB were trained starting with the base weights yolov5s.pt. I don't know if there is a method to train that non alterate the layers of the base model.

Next i show the code that i have used for the merge:

import torch
from torch import nn

# Load the model state dictionaries
modelA = torch.hub.load('ultralytics/yolov5', 'custom', path='./models/user.pt')
sdA = modelA.state_dict()

modelB = torch.hub.load('ultralytics/yolov5', 'custom', path='./models/central.pt')
sdB = modelB.state_dict()

# Merge the state dictionaries
sdC = {}
for key in sdA:
    sdC[key] = (sdA[key] + sdB[key]) / 2

# Create a new model with the merged weights
modelC = torch.hub.load('ultralytics/yolov5', 'yolov5s')
modelC.load_state_dict(sdC)

# Save the merged model
torch.save(modelC.state_dict(), "./models/merge.pt")

# Load the merged model
merged_model = torch.hub.load('ultralytics/yolov5', 'custom', path='./models/merge.pt')

And here is the traceback that came out:

Using cache found in C:\Users\Davide/.cache\torch\hub\ultralytics_yolov5_master
YOLOv5  2023-8-16 Python-3.10.0 torch-2.0.1+cu117 CUDA:0 (NVIDIA GeForce GTX 1050 Ti, 4096MiB)

Fusing layers...
Model summary: 157 layers, 7020913 parameters, 0 gradients, 15.8 GFLOPs
Adding AutoShape...
Using cache found in C:\Users\Davide/.cache\torch\hub\ultralytics_yolov5_master
YOLOv5  2023-8-16 Python-3.10.0 torch-2.0.1+cu117 CUDA:0 (NVIDIA GeForce GTX 1050 Ti, 4096MiB)

Fusing layers...
Model summary: 157 layers, 7020913 parameters, 0 gradients, 15.8 GFLOPs
Adding AutoShape...
Using cache found in C:\Users\Davide/.cache\torch\hub\ultralytics_yolov5_master
YOLOv5  2023-8-16 Python-3.10.0 torch-2.0.1+cu117 CUDA:0 (NVIDIA GeForce GTX 1050 Ti, 4096MiB)

Fusing layers...
YOLOv5s summary: 213 layers, 7225885 parameters, 0 gradients
Adding AutoShape...
Traceback (most recent call last):
  File "E:\TirocinioVero\text_and_drive\Terza_Fase\Progetto\UserA\prova.py", line 18, in <module>
    modelC.load_state_dict(sdC)
  File "C:\Users\Davide\.virtualenvs\Progetto-ca1l6rVE\lib\site-packages\torch\nn\modules\module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for AutoShape:
        size mismatch for model.model.model.24.m.0.weight: copying a param with shape torch.Size([27, 128, 1, 1]) from checkpoint, the shape in current model is torch.Size([255, 128, 1, 1]).
        size mismatch for model.model.model.24.m.0.bias: copying a param with shape torch.Size([27]) from checkpoint, the shape in current model is torch.Size([255]).
        size mismatch for model.model.model.24.m.1.weight: copying a param with shape torch.Size([27, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([255, 256, 1, 1]).
        size mismatch for model.model.model.24.m.1.bias: copying a param with shape torch.Size([27]) from checkpoint, the shape in current model is torch.Size([255]).
        size mismatch for model.model.model.24.m.2.weight: copying a param with shape torch.Size([27, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([255, 512, 1, 1]).
        size mismatch for model.model.model.24.m.2.bias: copying a param with shape torch.Size([27]) from checkpoint, the shape in current model is torch.Size([255]).

For knowledge, i also put here the command that i use for training modelA and modelB: python .\yolov5\train.py --epochs 150 --batch-size -1 --img 384 --data completo384.yaml --weights yolov5s.pt

@gazzadi i apologize for the confusion. It seems that the torch.hub.load('ultralytics/yolov5', 'custom') method might not work for loading a YOLOv5 model without weights. Thank you for finding an alternative solution using torch.hub.load('ultralytics/yolov5', 'yolov5s') to load the default yolov5s model.

Regarding the mismatched number of layers between the models, this can happen if the structure of the models (e.g., the number of layers) has been modified during training. The merged model should have the same structure as the base model you used for training (yolov5s.pt). Make sure that the models you are merging were trained with the same base model and haven't undergone any modifications.

Regarding the error message you received (size mismatch for ...), it suggests that the size of the layers in the merged state dictionary sdC does not match the size of the equivalent layers in the yolov5s model. This could happen if the models being merged have different architecture configurations.

To address this issue, you may need to modify the merging process to handle the differences in layer sizes between the models. This could involve resizing or reshaping the weights appropriately.

Please note that modifying the YOLOv5 codebase, such as altering layer sizes, can lead to unexpected behavior or loss of accuracy. It's recommended to use the same base model architecture and weights for training and merging to ensure compatibility.

If the issue persists, please provide more information about the exact steps and configurations you used for training modelA and modelB, including any modifications or differences from the base yolov5s model.

Thank you for the information.

I haven't done some changes on the base model when i created modelA and modelB. The two, in fact, are the same model created in two different directory but with the same version of yolov5s as basic weights. I haven't modified or alterated the file when i downloaded Yolo and i've only used the weights to train my model. The only change that i can think of is the number of classes but i thougth that was not correlated with the layers of the model.

I've tried to replicate the setting in a new environment and this are all the steps and passage that i have done:

First, i explain the dataset. I'm working on window and belt detector in selfies or photo taken inside the car to distinguish if a person is the driver or a passenger.
There are 4 different classes: (PassengerBelt, DriverBelt, PassengerWindow and DriverWindow)
For this test i have 2 little dataset of two different person (A and B):
- A --> 104 train images and 44 for validation
- B --> 76 train images and 30 for validation All images have only Driver or Passenger classes, and the distribution of the objects is nearly equal.

Here the steps to replicate the environment:

Downloaded, from the latest version "v7", the model "yolov5s.pt" and the source code for the requirments
Create in a new directory a virtual environment with pipenv
Installed in the virtual environment the requirements necessary for the correct function of the model, with the command pipenv install --requirements ./yolov5-7.0
I've installed the correct version of pytorch for working with cuda, with the command pipenv install torch torchvision torchaudio --index https://download.pytorch.org/whl/cu117
Training of ModelA python .\yolov5-7.0\train.py --epochs 150 --batch-size -1 --img 384 --data modelA.yaml --name modelA --weights yolov5s.pt --project train
Training of ModelB python .\yolov5-7.0\train.py --epochs 150 --batch-size -1 --img 384 --data modelB.yaml --name modelB --weights yolov5s.pt --project train

At the start of the training i've received this warning

C:\Users\Davide\.virtualenvs\EnvTest-85QvB3Ib\lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
C:\Users\Davide\.virtualenvs\EnvTest-85QvB3Ib\lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):

And this Traceback at the end

Traceback (most recent call last):
File "C:\Users\Davide\AppData\Local\Programs\Python\Python310\lib\threading.py", line 1009, in _bootstrap_inner
Exception in thread Thread-18 (plot_images):
Traceback (most recent call last):
File "C:\Users\Davide\AppData\Local\Programs\Python\Python310\lib\threading.py", line 1009, in _bootstrap_inner
self.run()
File "C:\Users\Davide\AppData\Local\Programs\Python\Python310\lib\threading.py", line 946, in run
self.run()
File "C:\Users\Davide\AppData\Local\Programs\Python\Python310\lib\threading.py", line 946, in run
self._target(*self._args, **self._kwargs)
self._target(*self._args, **self._kwargs)
File "E:\TirocinioVero\text_and_drive\Terza_Fase\EnvTest\yolov5-7.0\utils\plots.py", line 305, in plot_images
File "E:\TirocinioVero\text_and_drive\Terza_Fase\EnvTest\yolov5-7.0\utils\plots.py", line 305, in plot_images
annotator.box_label(box, label, color=color)
File "E:\TirocinioVero\text_and_drive\Terza_Fase\EnvTest\yolov5-7.0\utils\plots.py", line 91, in box_label
annotator.box_label(box, label, color=color)
File "E:\TirocinioVero\text_and_drive\Terza_Fase\EnvTest\yolov5-7.0\utils\plots.py", line 91, in box_label
w, h = self.font.getsize(label)  # text width, height
AttributeError: 'FreeTypeFont' object has no attribute 'getsize'
w, h = self.font.getsize(label)  # text width, height
AttributeError: 'FreeTypeFont' object has no attribute 'getsize'

But it seems the training ad gone well.

I've started the merging program merge.py
```
import torch
from torch import nn
```

Load the model state dictionaries

modelA = torch.hub.load('ultralytics/yolov5', 'custom', path='./models/modelA.pt') sdA = modelA.state_dict()

modelB = torch.hub.load('ultralytics/yolov5', 'custom', path='./models/modelA.pt') sdB = modelB.state_dict()

Merge the state dictionaries

sdC = {} for key in sdA: sdC[key] = (sdA[key] + sdB[key]) / 2

Create a new model with the merged weights

modelC = torch.hub.load('ultralytics/yolov5', 'custom', "./yolov5s.pt") modelC.load_state_dict(sdC)

Save the merged model

torch.save(modelC.state_dict(), "./models/merge.pt")

Load the merged model

merged_model = torch.hub.load('ultralytics/yolov5', 'custom', path='./models/merge.pt')


Traceback

Using cache found in C:\Users\Davide/.cache\torch\hub\ultralytics_yolov5_master WARNING invalid check_version(5.9.5, ) requested, please check values. YOLOv5 2023-8-16 Python-3.10.0 torch-2.0.1+cu117 CUDA:0 (NVIDIA GeForce GTX 1050 Ti, 4096MiB)

Fusing layers... Model summary: 157 layers, 7020913 parameters, 0 gradients, 15.8 GFLOPs Adding AutoShape... Using cache found in C:\Users\Davide/.cache\torch\hub\ultralytics_yolov5_master WARNING invalid check_version(5.9.5, ) requested, please check values. YOLOv5 2023-8-16 Python-3.10.0 torch-2.0.1+cu117 CUDA:0 (NVIDIA GeForce GTX 1050 Ti, 4096MiB)

Fusing layers... YOLOv5s summary: 213 layers, 7225885 parameters, 0 gradients Adding AutoShape... Traceback (most recent call last): File "E:\TirocinioVero\text_and_drive\Terza_Fase\EnvTest\merge.py", line 18, in modelC.load_state_dict(sdC) File "C:\Users\Davide.virtualenvs\EnvTest-85QvB3Ib\lib\site-packages\torch\nn\modules\module.py", line 2041, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for AutoShape: size mismatch for model.model.model.24.m.0.weight: copying a param with shape torch.Size([27, 128, 1, 1]) from checkpoint, the shape in current model is torch.Size([255, 128, 1, 1]). size mismatch for model.model.model.24.m.0.bias: copying a param with shape torch.Size([27]) from checkpoint, the shape in current model is torch.Size([255]). size mismatch for model.model.model.24.m.1.weight: copying a param with shape torch.Size([27, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([255, 256, 1, 1]). size mismatch for model.model.model.24.m.1.bias: copying a param with shape torch.Size([27]) from checkpoint, the shape in current model is torch.Size([255]). size mismatch for model.model.model.24.m.2.weight: copying a param with shape torch.Size([27, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([255, 512, 1, 1]). size mismatch for model.model.model.24.m.2.bias: copying a param with shape torch.Size([27]) from checkpoint, the shape in current model is torch.Size([255]).


The error is the same, but i don't know why the layer of my trained models have changed.

Sorry for the long comment but i wanted to explain better the whole situation, and thanks for the time @glenn-jocher already put in this thread.

@gazzadi thank you for providing detailed information about your issue.

Based on the information you provided, it seems that you have trained two models, ModelA and ModelB, using the same base model (yolov5s) but with different directories and the same version of yolov5s as basic weights. You mentioned that the only change you made was the number of classes.

To replicate your environment, you followed these steps:

Downloaded the latest version "v7" of yolov5s.pt and the source code for requirements.
Created a virtual environment using pipenv.
Installed the necessary requirements for the model to function correctly.
Installed the correct version of PyTorch for working with CUDA.
Trained ModelA with the command python .\yolov5-7.0\train.py --epochs 150 --batch-size -1 --img 384 --data modelA.yaml --name modelA --weights yolov5s.pt --project train.
Trained ModelB with the command python .\yolov5-7.0\train.py --epochs 150 --batch-size -1 --img 384 --data modelB.yaml --name modelB --weights yolov5s.pt --project train.
Received a warning and a traceback during training, but the training seemed to have gone well.
You attempted to merge the two models using the provided code in merge.py, but encountered an error with mismatches in the model layers.

Based on the error message you received during the merging process, it seems that there are size mismatches between the model layers of ModelA and ModelB, and the layers of the base model. This can happen if the number of classes or the structure of the model has changed.

To further investigate the issue and provide a solution, it would be helpful to have access to the specific files and code you used for training and merging the models. Additionally, it would be useful to know which version of YOLOv5 you are using.

Please provide these details, and I'll be happy to assist you further in resolving the issue.

Thank you, i hope i'm providing the correct informations.

Version: i'm using yolov5s, from version7 of YOLO.

I've done the training directly from the console, and i haven't modified the files in the source code.

The file i've worked with are those and no other to create this example.

@gazzadi thank you for providing the additional information.

Based on your explanation, it seems that you trained two models, ModelA and ModelB, using the yolov5s base model. The training was done directly from the console without modifying any files in the source code. The files you used for this training example are the ones you mentioned and no others.

To better understand the issue you are experiencing, it would be helpful to have access to the specific files and code you used for training and merging the models. Furthermore, please confirm that you are using version 7 of YOLOv5 and yolov5s as the base model.

With this additional information, I will be able to assist you further in troubleshooting and resolving the issue.

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Docs: https://docs.ultralytics.com
HUB: https://hub.ultralytics.com
Community: https://community.ultralytics.com

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

ultralytics / yolov5

Issues with merging weights #12054