ultralytics / yolov3

YOLOv3 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
10.16k stars 3.44k forks source link

Transfer Learning with differing number of classes #152

Closed hxy1051653358 closed 5 years ago

hxy1051653358 commented 5 years ago

I trained the voc dataset by myself and wanted to train new dataset with my own weight. The categories are different and the following errors occur during recovery training:

Traceback (most recent call last):
  File "/home/hxy/yolov3-pytorch-annotation/train1.py", line 204, in <module>
    var=opt.var,
  File "/home/hxy/yolov3-pytorch-annotation/train1.py", line 49, in train
    model.load_state_dict(checkpoint['model'])
  File "/home/hxy/anaconda3/envs/py1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 769, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Darknet:
    size mismatch for module_list.104.conv_104.weight: copying a param with shape torch.Size([75, 1024, 1, 1]) from checkpoint, the shape in current model is torch.Size([30, 1024, 1, 1]).
    size mismatch for module_list.104.conv_104.bias: copying a param with shape torch.Size([75]) from checkpoint, the shape in current model is torch.Size([30]).
    size mismatch for module_list.116.conv_116.weight: copying a param with shape torch.Size([75, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([10, 512, 1, 1]).
    size mismatch for module_list.116.conv_116.bias: copying a param with shape torch.Size([75]) from checkpoint, the shape in current model is torch.Size([10]).
    size mismatch for module_list.128.conv_128.weight: copying a param with shape torch.Size([75, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([30, 256, 1, 1]).
    size mismatch for module_list.128.conv_128.bias: copying a param with shape torch.Size([75]) from checkpoint, the shape in current model is torch.Size([30]).

How can I solve it?

gabrielloye commented 5 years ago

Hi @hxy1051653358 , Not sure if this is the best practice but what I did was to remove the weights for the yolo layers from your own weights. This is because the sizes of the yolo layers in your new model and old weights do not match (as described in the error message). Adding this line after loading your own weights in the training script will remove the mismatched weights:

 if resume:
        checkpoint = torch.load(latest, map_location='cpu')
        mod_weights = removekey(checkpoint['model'],['module_list.104.conv_104.weight', 'module_list.104.conv_104.bias', 'module_list.116.conv_116.weight', 'module_list.116.conv_116.bias', 'module_list.128.conv_128.weight', 'module_list.128.conv_128.bias'])
        model.load_state_dict(mod_weights, strict=False)
hxy1051653358 commented 5 years ago

@gabrielloye Thanks for your guidance, I will try your approach

hxy1051653358 commented 5 years ago

@gabrielloye How can I define removekey?

gabrielloye commented 5 years ago

@hxy1051653358 Oh I forgot to include that part as well, my bad. Here:

def removekey(d, listofkeys):
    r = dict(d)
    for key in listofkeys:
        print('key: {} is removed'.format(key))
        r.pop(key)
    return r
perry0418 commented 5 years ago

make sure your cfg file end with 'yolov3.cfg', and it will load the darknet53 weight for transfer learning. see the train.py line 58 if cfg.endswith('yolov3.cfg'): cutoff = load_darknet_weights(model, weights + 'darknet53.conv.74')

100330706 commented 5 years ago

@gabrielloye So with your method you use the old weights (randomly initialized or whatever) for layers 104, 116 and 128 whereas the rest of the network uses the new transferred weights, isn't it? Or does this remove any layer?

gabrielloye commented 5 years ago

@100330706 Nope, it doesn't remove any layer, what I did was to load the transferred weights and remove the yolo layers (104, 116, 128 in this case) first. I then load the old weights that fit the model (depends on the cfg you're using) and call the .update() method on it with the transferred weights. This will transfer all the layers of the transferred weights to the old weights except the ones we removed earlier. Finally, you can load the model with this "new" set of weights. The code snippet below should be able to work when you add it in train.py

checkpoint = torch.load(latest, map_location=device)  # load checkpoint
mod_weights = removekey(checkpoint['model'],[--list of layers to remove ( i.e. 104, 116, 128)--])
load_darknet_weights(model, weights + 'darknet53.conv.74')
model_dict = model.state_dict()
model_dict.update(mod_weights)
model.load_state_dict(model_dict)

Remember to change the number in the below line according to your config as well since you're doing transfer learning:

    #Transfer learning (train only YOLO layers)
    for i, (name, p) in enumerate(model.named_parameters()):
        p.requires_grad = True if (p.shape[0] == 30) else False

Note: I'm using the latest version, and I had to comment out the scheduler to make this work.

Ownmarc commented 5 years ago

@gabrielloye getting this error following the changes you suggested :

File "C:\Users\marcp\Anaconda3\envs\yolov3\lib\site-packages\torch\optim\sgd.py", line 101, in step
    buf.mul_(momentum).add_(1 - dampening, d_p)
RuntimeError: The size of tensor a (255) must match the size of tensor b (78) at non-singleton dimension 0

Did this happen to you ?

Edit, I commented out this and now its working :

`if checkpoint['optimizer'] is not None:

optimizer.load_state_dict(checkpoint['optimizer'])

    #    best_loss = checkpoint['best_loss']`
glenn-jocher commented 5 years ago

@hxy1051653358 @gabrielloye @perry0418 @100330706 @Ownmarc the latest commit should handle transfer learning for various class sizes automatically using a new --transfer flag in train.py.

Transfer learning is performed only on YOLO layers of yolov3.pt, and these YOLO layers may now be any size specified in your *.cfg file. Note that you need to download yolov3.pt first from our Google Drive folder (https://github.com/ultralytics/yolov3#pretrained-weights) to your yolov3/weights/directory.

Here is a transfer learning example with a single class (18-length YOLO layers) using yolov3-1cls.cfg and coco_1cls.data with are now added to the repo. coco_1cls.data points to coco_1cls.txt for training and testing, which is available in the Google Drive folder, and can be placed in your coco folder to follow our 1-class tutorial: https://github.com/ultralytics/yolov3/wiki/Example:-Train-Single-Class.

Namespace(accumulate=1, backend='nccl', batch_size=16, cfg='cfg/yolov3-1cls.cfg', data_cfg='data/coco_1cls.data', dist_url='tcp://127.0.0.1:9999', epochs=270, img_size=416, multi_scale=False, num_workers=4, rank=0, resume=False, transfer=True, world_size=1)

Using cpu 

layer                                     name  gradient   parameters                shape         mu      sigma
    0                          0.conv_0.weight     False          864        [32, 3, 3, 3]  -8.67e-05      0.112
    1                    0.batch_norm_0.weight     False           32                 [32]      0.538        0.3
    2                      0.batch_norm_0.bias     False           32                 [32]          0          0
    3                          1.conv_1.weight     False        18432       [64, 32, 3, 3]   0.000231      0.034
...
  218                104.batch_norm_104.weight     False          256                [256]      0.519      0.286
  219                  104.batch_norm_104.bias     False          256                [256]          0          0
  220                      105.conv_105.weight      True         4608      [18, 256, 1, 1]   0.000118     0.0359
  221                        105.conv_105.bias      True           18                 [18]    0.00203     0.0387
Model Summary: 222 layers, 6.15237e+07 parameters, 32310 gradients

   Epoch       Batch        xy        wh      conf       cls     total  nTargets      time
   0/269         0/0      2.11      6.65       140         0       148        17      6.06
      Image      Total          P          R        mAP
Calculating mAP: 100%|██████████| 1/1 [00:05<00:00,  5.50s/it]
          5          5          0          0          0
vivian-wong commented 5 years ago

When I tried transfer learning with yolov3-1cls.cfg and coco_1cls.data, I get the following error.

python train.py --cfg cfg/yolov3-1cls.cfg --data-cfg data/coco_1cls.data --transfer
Namespace(accumulate=1, backend='nccl', batch_size=16, cfg='cfg/yolov3-1cls.cfg', data_cfg='data/coco_1cls.data', dist_url='tcp://127.0.0.1:9999', epochs=273, img_size=416, multi_scale=False, nosave=False, num_workers=4, rank=0, resume=False, transfer=True, world_size=1)

Using CUDA device0 _CudaDeviceProperties(name='GeForce GTX 1080 Ti', total_memory=11171MB)
Traceback (most recent call last):
  File "train.py", line 250, in <module>
    num_workers=opt.num_workers
  File "train.py", line 58, in train
    strict=False)
  File "/home/vivian/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 769, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Darknet:
    size mismatch for module_list.84.conv_84.weight: copying a param with shape torch.Size([512, 2048, 1, 1]) from checkpoint, the shape in current model is torch.Size([256, 512, 1, 1]).
    size mismatch for module_list.84.batch_norm_84.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
    size mismatch for module_list.84.batch_norm_84.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
    size mismatch for module_list.84.batch_norm_84.running_mean: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
    size mismatch for module_list.84.batch_norm_84.running_var: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
    size mismatch for module_list.87.conv_87.weight: copying a param with shape torch.Size([1024, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([256, 768, 1, 1]).
    size mismatch for module_list.87.batch_norm_87.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([256]).
    size mismatch for module_list.87.batch_norm_87.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([256]).
    size mismatch for module_list.87.batch_norm_87.running_mean: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([256]).
    size mismatch for module_list.87.batch_norm_87.running_var: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([256]).
    size mismatch for module_list.96.conv_96.weight: copying a param with shape torch.Size([256, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([128, 256, 1, 1]).
    size mismatch for module_list.96.batch_norm_96.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]).
    size mismatch for module_list.96.batch_norm_96.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]).
    size mismatch for module_list.96.batch_norm_96.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]).
    size mismatch for module_list.96.batch_norm_96.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]).
    size mismatch for module_list.99.conv_99.weight: copying a param with shape torch.Size([512, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 384, 1, 1]).
    size mismatch for module_list.99.batch_norm_99.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]).
    size mismatch for module_list.99.batch_norm_99.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]).
    size mismatch for module_list.99.batch_norm_99.running_mean: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]).
    size mismatch for module_list.99.batch_norm_99.running_var: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]).

How should I go about fixing this? Thank you!

glenn-jocher commented 5 years ago

@vivian-wong yolov3-1cls.cfg is a 1cls derivative of yolov3.cfg, but yolov3-spp.pt was loaded as the default, which is a better performing, newer variant of yolov3. I've changed the default back to yolov3.pt, so if you git pull and retry it will work.

Personally, I would not use transfer learning though, it doesn't save you much time, and you will get better results training normally from darknet53.

vivian-wong commented 5 years ago

Thank you for modifying the default. I am using transfer learning because I would like to train on a smaller dataset which has just one class. I have configured my *.data file as indicated in the tutorial. Now I get:

python train.py --cfg cfg/yolov3-1cls.cfg --data-cfg data/mydata.data --transfer
Namespace(accumulate=1, backend='nccl', batch_size=16, cfg='cfg/yolov3-1cls.cfg', data_cfg='data/mydata.data', dist_url='tcp://127.0.0.1:9999', epochs=273, img_size=416, multi_scale=False, nosave=False, num_workers=4, rank=0, resume=False, transfer=True, world_size=1)

Using CUDA device0 _CudaDeviceProperties(name='GeForce GTX 1080 Ti', total_memory=11171MB)
Traceback (most recent call last):
  File "train.py", line 250, in <module>
    num_workers=opt.num_workers
  File "train.py", line 56, in train
    chkpt = torch.load(weights + 'yolov3.pt', map_location=device)
  File "/home/vivian/anaconda3/lib/python3.7/site-packages/torch/serialization.py", line 368, in load
    return _load(f, map_location, pickle_module)
  File "/home/vivian/anaconda3/lib/python3.7/site-packages/torch/serialization.py", line 549, in _load
    deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly)
RuntimeError: unexpected EOF, expected 4417085 more bytes. The file might be corrupted.
*** Error in `python': corrupted double-linked list: 0x000055bd3f4cc7e0 ***

I also read your comment that

@hac135 most people don't realize this, and it's not the recommended method to go about things, but you can technically use the existing YOLOv3 architecture (and hence the pretrained yolov3.pt) to train any model with n<=80 classes with no changes. The unused conf outputs will learn to simply default to zero, and the rest of the unused outputs (the box and class conf associated with those unused classes) will no longer matter.

For example, our single class tutorial operates just as well with no modifications to the cfg file: https://github.com/ultralytics/yolov3/wiki/Example:-Train-Single-Class

It's not clean and its not optimal, but it works.

So I tried doing python train.py --data-cfg data/mydata.data --transfer which uses the default yolov3-spp.cfg. It worked (though with pretty bad results...). But this should do the job right?

MuhammadAsadJaved commented 5 years ago

Hi, I have the same error while using MobileNet-YOLO-V3 with caffe, I am using this repo. https://github.com/eric612/MobileNet-YOLO

Here are my error details. Model : yolov3 darknet_yolov3

Questions : I have a pre-trained model on 80 classes , now I am using this model to retrain on 2 classes. I have made the necessary changes (classes and output) in the yolov3_train.prototxt , yolov3_test.prototxt , solver.prototxt. But when I am running the train_yolov3.sh file it throw following error. Maybe the error is because previous weights are about 80 classes, can I use these weights to retrain model on 2 classes? Here is the error output. F0916 19:53:43.311493 4202 net.cpp:760] Cannot copy param 0 weights from layer 'layer82-conv'; shape mismatch. Source param shape is 255 1024 1 1 (261120); target param shape is 21 1024 1 1 (21504). To learn this layer's parameters from scratch rather than copying from a saved net, rename the layer. Check failure stack trace: @ 0x7fdf3929b0cd google::LogMessage::Fail() @ 0x7fdf3929cf33 google::LogMessage::SendToLog() @ 0x7fdf3929ac28 google::LogMessage::Flush() @ 0x7fdf3929d999 google::LogMessageFatal::~LogMessageFatal() @ 0x7fdf39a64594 caffe::Net<>::CopyTrainedLayersFrom() @ 0x7fdf39a67645 caffe::Net<>::CopyTrainedLayersFromBinaryProto() @ 0x7fdf39a772be caffe::LoadNetWeights<>() @ 0x7fdf39a798b0 caffe::Solver<>::InitTrainNet() @ 0x7fdf39a79e34 caffe::Solver<>::Init() @ 0x7fdf39a7a11f caffe::Solver<>::Solver() @ 0x7fdf39a9cd31 caffe::Creator_SGDSolver<>() @ 0x564ec97ce4d2 train() @ 0x564ec97cacc5 main @ 0x7fdf37fdfb97 __libc_start_main @ 0x564ec97cb63a _start Aborted (core dumped)

any suggestions to resolve this issue?

glenn-jocher commented 5 years ago

@MuhammadAsadJaved your issue should be posted on the relevant repo, not this one.

MuhammadAsadJaved commented 5 years ago

@glenn-jocher Thank you for your advice. I also posted there but there was no response. So I post here as well to find some help because the issue is similar.

glenn-jocher commented 5 years ago

@MuhammadAsadJaved all transfer learning works correctly in this repo. See https://github.com/ultralytics/yolov3/wiki/Example:-Transfer-Learning

jayant3297 commented 3 years ago

I want to do face detection using YOLOv3 as the model and the Darknet object detection pre-trained weights, taken from (https://pjreddie.com/darknet/yolo/)but I am unable to train it for single class "face" Can anyone help me ?

MuhammadAsadJaved commented 3 years ago

What is the problem? Post your error

On Thu, Oct 8, 2020 at 1:45 AM jayant3297 notifications@github.com wrote:

I want to do face detection using YOLOv3 as the model and the Darknet object detection pre-trained weights, taken from (https://pjreddie.com/darknet/yolo/)but I am unable to train it for single class "face" Can anyone help me ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ultralytics/yolov3/issues/152#issuecomment-705093353, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG4GR5H6JKKZEOYIV6D32ULSJSSM5ANCNFSM4HATYJFQ .

MuhammadAsadJaved commented 3 years ago

Post your procedure and error

On Thu, Oct 8, 2020 at 1:47 AM Asad Javed asadjavedgujar@gmail.com wrote:

What is the problem? Post your error

On Thu, Oct 8, 2020 at 1:45 AM jayant3297 notifications@github.com wrote:

I want to do face detection using YOLOv3 as the model and the Darknet object detection pre-trained weights, taken from (https://pjreddie.com/darknet/yolo/)but I am unable to train it for single class "face" Can anyone help me ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ultralytics/yolov3/issues/152#issuecomment-705093353, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG4GR5H6JKKZEOYIV6D32ULSJSSM5ANCNFSM4HATYJFQ .

jayant3297 commented 3 years ago

The bounding boxes I am getting are around the whole image and not around the face

pankaja0285 commented 2 years ago

Is there any working example/code for adding additional classes (say in my case I want to add 4 classes) to a pretrained yolov4 model (which I had trained for 20 classes) with darknet framework and the weights are saved every 10000 steps, so in all I have 4 weights saved. I see bits and pieces of code, as to passing the number of layers to freeze (in my case 20). But after that what are the next steps - to add the new classes and train for may be 100 iterations, stop and save the weights. Then once that is done I guess, I have to unfreeze all the 20 layers and retrain on all the classes (24). If a working example is there or if someone can help me with mostly code and some pseudo code, that will be helpful.

TIA

glenn-jocher commented 2 years ago

@pankaja0285 for darknet training you probably want to head over to https://github.com/AlexeyAB/darknet

pankaja0285 commented 2 years ago

I already checked there, not much help @glenn-jocher. Hence asking here if someone can shed some light.

It's not that I need help with darknet. I need help with the transfer learning for additional classes. I can probably convert darknet trained yolov4 weights to say py format (pytorch) and then proceed. It's the proceed after that point, that I need help with.

NOTE: FYI, I trained the 20 classes of a VOC dataset.

glenn-jocher commented 2 years ago

@pankaja0285 VOC training is very simple with YOLOv5. All models and datasets download automatically:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

python train.py --data VOC.yaml --weights yolov5s.pt
pankaja0285 commented 2 years ago

@glenn-jocher I think there's a miscommunication. I already trained for 20 classes and have the yolov4 weights for it, now I how to add additional classes (in my 4 additional classes). From whatever I have read so far I need to provide the layers to freeze in --freeze parameter and then is what I am asking - is everything done behind the scenes?

Or for e.g. in your ultralytics repo - train.py how does it go about doing the transfer learning? What do I need to do?

Again this is for my own additional dataset that contains the 4 classes.

glenn-jocher commented 2 years ago

@pankaja0285 YOLOv5 automatically handles class differences. Starting a training from any other pretrained weights is the default workflow, no action is required on your part. i.e. the command below trains a 20-class model starting from 80-class COCO weights:

python train.py --data VOC.yaml --weights yolov5s.pt
pankaja0285 commented 2 years ago

Ok,

glenn-jocher commented 2 years ago

@pankaja0285 see https://docs.ultralytics.com/yolov5/tutorials/train_custom_data to get started

pankaja0285 commented 2 years ago

Also, I have a NVIDIA GPU, Cuda and CUDNN setup done and all installed. How do I run the training on GPU I guess you have a specific flag setting for it that I have to pass in the
python train.py....

pankaja0285 commented 2 years ago

@pankaja0285 see https://docs.ultralytics.com/yolov5/tutorials/train_custom_data to get started - Agreed,

But you are still not answering my question about the additional classes - how to add and how to further train with the existing weights. Doing the first training and I do see your repo that the best model is getting saved in .pt format. But how do I enhance the model for additional classes is my question.

Also, an FYI even though CUDA is available on my laptop, while training the device is not getting recognized. Do I have to modify and configuration settings or where do I need to make any changes, please let me know. I just started to train and I am getting the message that says "... CUDA is not available".... I am running from Pycharm terminal.

glenn-jocher commented 7 months ago

@pankaja0285 apologies for the confusion earlier. To clarify:

  1. Training on GPU: If you have CUDA installed, PyTorch should automatically use your GPU for training. Make sure your PyTorch installation is compatible with your CUDA version. You don't need to set any specific flags; the code will default to GPU if it's available and configured correctly.

  2. Adding Additional Classes: To add more classes to an existing model, you need to modify your dataset to include the new classes and update your .yaml file accordingly. Then you can start training with the new dataset and the pre-trained weights. The model will adjust its final layer to accommodate the new number of classes.

Here's a simplified example command for continuing training with additional classes:

python train.py --data VOC_addnl_4.yaml --weights path/to/your/previous/best_model.pt

The weights will be saved in .pt format by default.

  1. CUDA Not Available: If CUDA is not being recognized, it could be due to several reasons:

    • Your PyTorch installation might not be compatible with your CUDA version.
    • Your environment variables for CUDA might not be set correctly.
    • PyCharm's terminal might not be recognizing your system's environment variables.

    To troubleshoot, try running the training script from a regular terminal or command prompt outside of PyCharm. If it works there, the issue might be with PyCharm's configuration.

If you continue to have issues, please provide more details about your setup, including the versions of CUDA, cuDNN, and PyTorch you're using, and I'll do my best to assist you further.