Closed hxy1051653358 closed 5 years ago
Hi @hxy1051653358 , Not sure if this is the best practice but what I did was to remove the weights for the yolo layers from your own weights. This is because the sizes of the yolo layers in your new model and old weights do not match (as described in the error message). Adding this line after loading your own weights in the training script will remove the mismatched weights:
if resume:
checkpoint = torch.load(latest, map_location='cpu')
mod_weights = removekey(checkpoint['model'],['module_list.104.conv_104.weight', 'module_list.104.conv_104.bias', 'module_list.116.conv_116.weight', 'module_list.116.conv_116.bias', 'module_list.128.conv_128.weight', 'module_list.128.conv_128.bias'])
model.load_state_dict(mod_weights, strict=False)
@gabrielloye Thanks for your guidance, I will try your approach
@gabrielloye How can I define removekey?
@hxy1051653358 Oh I forgot to include that part as well, my bad. Here:
def removekey(d, listofkeys):
r = dict(d)
for key in listofkeys:
print('key: {} is removed'.format(key))
r.pop(key)
return r
make sure your cfg file end with 'yolov3.cfg', and it will load the darknet53 weight for transfer learning.
see the train.py line 58 if cfg.endswith('yolov3.cfg'): cutoff = load_darknet_weights(model, weights + 'darknet53.conv.74')
@gabrielloye So with your method you use the old weights (randomly initialized or whatever) for layers 104, 116 and 128 whereas the rest of the network uses the new transferred weights, isn't it? Or does this remove any layer?
@100330706 Nope, it doesn't remove any layer, what I did was to load the transferred weights and remove the yolo layers (104, 116, 128 in this case) first. I then load the old weights that fit the model (depends on the cfg you're using) and call the .update() method on it with the transferred weights. This will transfer all the layers of the transferred weights to the old weights except the ones we removed earlier. Finally, you can load the model with this "new" set of weights. The code snippet below should be able to work when you add it in train.py
checkpoint = torch.load(latest, map_location=device) # load checkpoint
mod_weights = removekey(checkpoint['model'],[--list of layers to remove ( i.e. 104, 116, 128)--])
load_darknet_weights(model, weights + 'darknet53.conv.74')
model_dict = model.state_dict()
model_dict.update(mod_weights)
model.load_state_dict(model_dict)
Remember to change the number in the below line according to your config as well since you're doing transfer learning:
#Transfer learning (train only YOLO layers)
for i, (name, p) in enumerate(model.named_parameters()):
p.requires_grad = True if (p.shape[0] == 30) else False
Note: I'm using the latest version, and I had to comment out the scheduler to make this work.
@gabrielloye getting this error following the changes you suggested :
File "C:\Users\marcp\Anaconda3\envs\yolov3\lib\site-packages\torch\optim\sgd.py", line 101, in step
buf.mul_(momentum).add_(1 - dampening, d_p)
RuntimeError: The size of tensor a (255) must match the size of tensor b (78) at non-singleton dimension 0
Did this happen to you ?
Edit, I commented out this and now its working :
`if checkpoint['optimizer'] is not None:
# best_loss = checkpoint['best_loss']`
@hxy1051653358 @gabrielloye @perry0418 @100330706 @Ownmarc the latest commit should handle transfer learning for various class sizes automatically using a new --transfer
flag in train.py.
Transfer learning is performed only on YOLO layers of yolov3.pt
, and these YOLO layers may now be any size specified in your *.cfg file. Note that you need to download yolov3.pt
first from our Google Drive folder (https://github.com/ultralytics/yolov3#pretrained-weights) to your yolov3/weights/
directory.
Here is a transfer learning example with a single class (18-length YOLO layers) using yolov3-1cls.cfg
and coco_1cls.data
with are now added to the repo. coco_1cls.data
points to coco_1cls.txt
for training and testing, which is available in the Google Drive folder, and can be placed in your coco folder to follow our 1-class tutorial: https://github.com/ultralytics/yolov3/wiki/Example:-Train-Single-Class.
Namespace(accumulate=1, backend='nccl', batch_size=16, cfg='cfg/yolov3-1cls.cfg', data_cfg='data/coco_1cls.data', dist_url='tcp://127.0.0.1:9999', epochs=270, img_size=416, multi_scale=False, num_workers=4, rank=0, resume=False, transfer=True, world_size=1)
Using cpu
layer name gradient parameters shape mu sigma
0 0.conv_0.weight False 864 [32, 3, 3, 3] -8.67e-05 0.112
1 0.batch_norm_0.weight False 32 [32] 0.538 0.3
2 0.batch_norm_0.bias False 32 [32] 0 0
3 1.conv_1.weight False 18432 [64, 32, 3, 3] 0.000231 0.034
...
218 104.batch_norm_104.weight False 256 [256] 0.519 0.286
219 104.batch_norm_104.bias False 256 [256] 0 0
220 105.conv_105.weight True 4608 [18, 256, 1, 1] 0.000118 0.0359
221 105.conv_105.bias True 18 [18] 0.00203 0.0387
Model Summary: 222 layers, 6.15237e+07 parameters, 32310 gradients
Epoch Batch xy wh conf cls total nTargets time
0/269 0/0 2.11 6.65 140 0 148 17 6.06
Image Total P R mAP
Calculating mAP: 100%|██████████| 1/1 [00:05<00:00, 5.50s/it]
5 5 0 0 0
When I tried transfer learning with yolov3-1cls.cfg and coco_1cls.data, I get the following error.
python train.py --cfg cfg/yolov3-1cls.cfg --data-cfg data/coco_1cls.data --transfer
Namespace(accumulate=1, backend='nccl', batch_size=16, cfg='cfg/yolov3-1cls.cfg', data_cfg='data/coco_1cls.data', dist_url='tcp://127.0.0.1:9999', epochs=273, img_size=416, multi_scale=False, nosave=False, num_workers=4, rank=0, resume=False, transfer=True, world_size=1)
Using CUDA device0 _CudaDeviceProperties(name='GeForce GTX 1080 Ti', total_memory=11171MB)
Traceback (most recent call last):
File "train.py", line 250, in <module>
num_workers=opt.num_workers
File "train.py", line 58, in train
strict=False)
File "/home/vivian/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 769, in load_state_dict
self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Darknet:
size mismatch for module_list.84.conv_84.weight: copying a param with shape torch.Size([512, 2048, 1, 1]) from checkpoint, the shape in current model is torch.Size([256, 512, 1, 1]).
size mismatch for module_list.84.batch_norm_84.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for module_list.84.batch_norm_84.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for module_list.84.batch_norm_84.running_mean: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for module_list.84.batch_norm_84.running_var: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for module_list.87.conv_87.weight: copying a param with shape torch.Size([1024, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([256, 768, 1, 1]).
size mismatch for module_list.87.batch_norm_87.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for module_list.87.batch_norm_87.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for module_list.87.batch_norm_87.running_mean: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for module_list.87.batch_norm_87.running_var: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([256]).
size mismatch for module_list.96.conv_96.weight: copying a param with shape torch.Size([256, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([128, 256, 1, 1]).
size mismatch for module_list.96.batch_norm_96.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for module_list.96.batch_norm_96.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for module_list.96.batch_norm_96.running_mean: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for module_list.96.batch_norm_96.running_var: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for module_list.99.conv_99.weight: copying a param with shape torch.Size([512, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 384, 1, 1]).
size mismatch for module_list.99.batch_norm_99.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for module_list.99.batch_norm_99.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for module_list.99.batch_norm_99.running_mean: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]).
size mismatch for module_list.99.batch_norm_99.running_var: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([128]).
How should I go about fixing this? Thank you!
@vivian-wong yolov3-1cls.cfg is a 1cls derivative of yolov3.cfg, but yolov3-spp.pt was loaded as the default, which is a better performing, newer variant of yolov3. I've changed the default back to yolov3.pt, so if you git pull and retry it will work.
Personally, I would not use transfer learning though, it doesn't save you much time, and you will get better results training normally from darknet53.
Thank you for modifying the default. I am using transfer learning because I would like to train on a smaller dataset which has just one class. I have configured my *.data file as indicated in the tutorial. Now I get:
python train.py --cfg cfg/yolov3-1cls.cfg --data-cfg data/mydata.data --transfer
Namespace(accumulate=1, backend='nccl', batch_size=16, cfg='cfg/yolov3-1cls.cfg', data_cfg='data/mydata.data', dist_url='tcp://127.0.0.1:9999', epochs=273, img_size=416, multi_scale=False, nosave=False, num_workers=4, rank=0, resume=False, transfer=True, world_size=1)
Using CUDA device0 _CudaDeviceProperties(name='GeForce GTX 1080 Ti', total_memory=11171MB)
Traceback (most recent call last):
File "train.py", line 250, in <module>
num_workers=opt.num_workers
File "train.py", line 56, in train
chkpt = torch.load(weights + 'yolov3.pt', map_location=device)
File "/home/vivian/anaconda3/lib/python3.7/site-packages/torch/serialization.py", line 368, in load
return _load(f, map_location, pickle_module)
File "/home/vivian/anaconda3/lib/python3.7/site-packages/torch/serialization.py", line 549, in _load
deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly)
RuntimeError: unexpected EOF, expected 4417085 more bytes. The file might be corrupted.
*** Error in `python': corrupted double-linked list: 0x000055bd3f4cc7e0 ***
I also read your comment that
@hac135 most people don't realize this, and it's not the recommended method to go about things, but you can technically use the existing YOLOv3 architecture (and hence the pretrained
yolov3.pt
) to train any model withn<=80
classes with no changes. The unusedconf
outputs will learn to simply default to zero, and the rest of the unused outputs (the box and class conf associated with those unused classes) will no longer matter.For example, our single class tutorial operates just as well with no modifications to the cfg file: https://github.com/ultralytics/yolov3/wiki/Example:-Train-Single-Class
It's not clean and its not optimal, but it works.
So I tried doing
python train.py --data-cfg data/mydata.data --transfer
which uses the default yolov3-spp.cfg. It worked (though with pretty bad results...). But this should do the job right?
Hi, I have the same error while using MobileNet-YOLO-V3 with caffe, I am using this repo. https://github.com/eric612/MobileNet-YOLO
Here are my error details. Model : yolov3 darknet_yolov3
Questions : I have a pre-trained model on 80 classes , now I am using this model to retrain on 2 classes. I have made the necessary changes (classes and output) in the yolov3_train.prototxt , yolov3_test.prototxt , solver.prototxt. But when I am running the train_yolov3.sh file it throw following error. Maybe the error is because previous weights are about 80 classes, can I use these weights to retrain model on 2 classes? Here is the error output. F0916 19:53:43.311493 4202 net.cpp:760] Cannot copy param 0 weights from layer 'layer82-conv'; shape mismatch. Source param shape is 255 1024 1 1 (261120); target param shape is 21 1024 1 1 (21504). To learn this layer's parameters from scratch rather than copying from a saved net, rename the layer. Check failure stack trace: @ 0x7fdf3929b0cd google::LogMessage::Fail() @ 0x7fdf3929cf33 google::LogMessage::SendToLog() @ 0x7fdf3929ac28 google::LogMessage::Flush() @ 0x7fdf3929d999 google::LogMessageFatal::~LogMessageFatal() @ 0x7fdf39a64594 caffe::Net<>::CopyTrainedLayersFrom() @ 0x7fdf39a67645 caffe::Net<>::CopyTrainedLayersFromBinaryProto() @ 0x7fdf39a772be caffe::LoadNetWeights<>() @ 0x7fdf39a798b0 caffe::Solver<>::InitTrainNet() @ 0x7fdf39a79e34 caffe::Solver<>::Init() @ 0x7fdf39a7a11f caffe::Solver<>::Solver() @ 0x7fdf39a9cd31 caffe::Creator_SGDSolver<>() @ 0x564ec97ce4d2 train() @ 0x564ec97cacc5 main @ 0x7fdf37fdfb97 __libc_start_main @ 0x564ec97cb63a _start Aborted (core dumped)
any suggestions to resolve this issue?
@MuhammadAsadJaved your issue should be posted on the relevant repo, not this one.
@glenn-jocher Thank you for your advice. I also posted there but there was no response. So I post here as well to find some help because the issue is similar.
@MuhammadAsadJaved all transfer learning works correctly in this repo. See https://github.com/ultralytics/yolov3/wiki/Example:-Transfer-Learning
I want to do face detection using YOLOv3 as the model and the Darknet object detection pre-trained weights, taken from (https://pjreddie.com/darknet/yolo/)but I am unable to train it for single class "face" Can anyone help me ?
What is the problem? Post your error
On Thu, Oct 8, 2020 at 1:45 AM jayant3297 notifications@github.com wrote:
I want to do face detection using YOLOv3 as the model and the Darknet object detection pre-trained weights, taken from (https://pjreddie.com/darknet/yolo/)but I am unable to train it for single class "face" Can anyone help me ?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ultralytics/yolov3/issues/152#issuecomment-705093353, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG4GR5H6JKKZEOYIV6D32ULSJSSM5ANCNFSM4HATYJFQ .
Post your procedure and error
On Thu, Oct 8, 2020 at 1:47 AM Asad Javed asadjavedgujar@gmail.com wrote:
What is the problem? Post your error
On Thu, Oct 8, 2020 at 1:45 AM jayant3297 notifications@github.com wrote:
I want to do face detection using YOLOv3 as the model and the Darknet object detection pre-trained weights, taken from (https://pjreddie.com/darknet/yolo/)but I am unable to train it for single class "face" Can anyone help me ?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ultralytics/yolov3/issues/152#issuecomment-705093353, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG4GR5H6JKKZEOYIV6D32ULSJSSM5ANCNFSM4HATYJFQ .
The bounding boxes I am getting are around the whole image and not around the face
Is there any working example/code for adding additional classes (say in my case I want to add 4 classes) to a pretrained yolov4 model (which I had trained for 20 classes) with darknet framework and the weights are saved every 10000 steps, so in all I have 4 weights saved. I see bits and pieces of code, as to passing the number of layers to freeze (in my case 20). But after that what are the next steps - to add the new classes and train for may be 100 iterations, stop and save the weights. Then once that is done I guess, I have to unfreeze all the 20 layers and retrain on all the classes (24). If a working example is there or if someone can help me with mostly code and some pseudo code, that will be helpful.
TIA
@pankaja0285 for darknet training you probably want to head over to https://github.com/AlexeyAB/darknet
I already checked there, not much help @glenn-jocher. Hence asking here if someone can shed some light.
It's not that I need help with darknet. I need help with the transfer learning for additional classes. I can probably convert darknet trained yolov4 weights to say py format (pytorch) and then proceed. It's the proceed after that point, that I need help with.
NOTE: FYI, I trained the 20 classes of a VOC dataset.
@pankaja0285 VOC training is very simple with YOLOv5. All models and datasets download automatically:
git clone https://github.com/ultralytics/yolov5 # clone
cd yolov5
pip install -r requirements.txt # install
python train.py --data VOC.yaml --weights yolov5s.pt
@glenn-jocher I think there's a miscommunication. I already trained for 20 classes and have the yolov4 weights for it, now I how to add additional classes (in my 4 additional classes). From whatever I have read so far I need to provide the layers to freeze in --freeze parameter and then is what I am asking - is everything done behind the scenes?
Or for e.g. in your ultralytics repo - train.py how does it go about doing the transfer learning? What do I need to do?
Again this is for my own additional dataset that contains the 4 classes.
@pankaja0285 YOLOv5 automatically handles class differences. Starting a training from any other pretrained weights is the default workflow, no action is required on your part. i.e. the command below trains a 20-class model starting from 80-class COCO weights:
python train.py --data VOC.yaml --weights yolov5s.pt
Ok,
Also do the weights of training - quoting from your response above python train.py --data VOC.yaml --weights yolov5s.pt -does it get saved in .pt format? -also how do I handle to run on GPU
then if I want to add 4 new classes to the above trained yolov5 weights, do I just give the new yaml file and the yolo weights Something like this _python train.py --data VOC_addnl_4.yaml --weights new_weights_
@pankaja0285 see https://docs.ultralytics.com/yolov5/tutorials/train_custom_data to get started
Also, I have a NVIDIA GPU, Cuda and CUDNN setup done and all installed. How do I run the training on GPU I guess you have a specific flag setting for it that I have to pass in the
python train.py....
@pankaja0285 see https://docs.ultralytics.com/yolov5/tutorials/train_custom_data to get started - Agreed,
But you are still not answering my question about the additional classes - how to add and how to further train with the existing weights. Doing the first training and I do see your repo that the best model is getting saved in .pt format. But how do I enhance the model for additional classes is my question.
Also, an FYI even though CUDA is available on my laptop, while training the device is not getting recognized. Do I have to modify and configuration settings or where do I need to make any changes, please let me know. I just started to train and I am getting the message that says "... CUDA is not available".... I am running from Pycharm terminal.
@pankaja0285 apologies for the confusion earlier. To clarify:
Training on GPU: If you have CUDA installed, PyTorch should automatically use your GPU for training. Make sure your PyTorch installation is compatible with your CUDA version. You don't need to set any specific flags; the code will default to GPU if it's available and configured correctly.
Adding Additional Classes: To add more classes to an existing model, you need to modify your dataset to include the new classes and update your .yaml
file accordingly. Then you can start training with the new dataset and the pre-trained weights. The model will adjust its final layer to accommodate the new number of classes.
Here's a simplified example command for continuing training with additional classes:
python train.py --data VOC_addnl_4.yaml --weights path/to/your/previous/best_model.pt
The weights will be saved in .pt
format by default.
CUDA Not Available: If CUDA is not being recognized, it could be due to several reasons:
To troubleshoot, try running the training script from a regular terminal or command prompt outside of PyCharm. If it works there, the issue might be with PyCharm's configuration.
If you continue to have issues, please provide more details about your setup, including the versions of CUDA, cuDNN, and PyTorch you're using, and I'll do my best to assist you further.
I trained the voc dataset by myself and wanted to train new dataset with my own weight. The categories are different and the following errors occur during recovery training:
How can I solve it?