Closed glenn-jocher closed 3 years ago
Hi @glenn-jocher , I have a question about this. I want to change the configuration of yolo layers(remove some layer, change the number of filters, etc..) and apply transfer learning. In this case, is it possible to use transfer learning using the official weight? If it's possible, could you give me the way or just a keyword about this?
@jw-pyo you can do anything you want, but you have to do it, we can't "give you a way". Recommend you visit our tutorials to get started, and the PyTorch tutorials for more general customization questions.
https://docs.ultralytics.com/yolov5/tutorials/train_custom_data https://github.com/ultralytics/yolov3/wiki/Example:-Transfer-Learning https://pytorch.org/tutorials/
I hava a problem, I want to train some new classes and pictures using transfer learning. but my classes number=7. so if I use darknet53.conv.74 as pretrained model, it doesn't work ! what should I do
@hac135 If you want to use pretrained model as transfer learning but your own model has different shape, what I know is just copying the weights which are same shape with pretrained model, and about layers of different shape, you just manually initialize the corresponding layer.
@hac135 most people don't realize this, and it's not the recommended method to go about things, but you can technically use the existing YOLOv3 architecture (and hence the pretrained yolov3.pt
) to train any model with n<=80
classes with no changes. The unused conf
outputs will learn to simply default to zero, and the rest of the unused outputs (the box and class conf associated with those unused classes) will no longer matter.
For example, our single class tutorial operates just as well with no modifications to the cfg file: https://github.com/ultralytics/yolov3/wiki/Example:-Train-Single-Class
It's not clean and its not optimal, but it works.
@hac135 most people don't realize this, and it's not the recommended method to go about things, but you can technically use the existing YOLOv3 architecture (and hence the pretrained
yolov3.pt
) to train any model withn<=80
classes with no changes. The unusedconf
outputs will learn to simply default to zero, and the rest of the unused outputs (the box and class conf associated with those unused classes) will no longer matter.For example, our single class tutorial operates just as well with no modifications to the cfg file: https://github.com/ultralytics/yolov3/wiki/Example:-Train-Single-Class
It's not clean and its not optimal, but it works.
Thank you ! it did works!
@hac135 If you want to use pretrained model as transfer learning but your own model has different shape, what I know is just copying the weights which are same shape with pretrained model, and about layers of different shape, you just manually initialize the corresponding layer.
that's a good suggestion, thanks
@shahidammer try training from scratch, and observe your training results in results.txt.
@shahidammer please note that most technical problems are due to:
git clone
version of this repository we can not debug it. Before going further run this code and ensure your issue persists:
sudo rm -rf yolov3 # remove exising repo
git clone https://github.com/ultralytics/yolov3 && cd yolov3 # git clone latest
python3 detect.py # verify detection
python3 train.py # verify training (a few batches only)
# CODE TO REPRODUCE YOUR ISSUE HERE
train_batch0.jpg
and test_batch0.jpg
for a sanity check of training and testing data.If none of these apply to you, we suggest you close this issue and raise a new one using the Bug Report template, providing screenshots and minimum viable code to reproduce your issue. Thank you!
@hac135 most people don't realize this, and it's not the recommended method to go about things, but you can technically use the existing YOLOv3 architecture (and hence the pretrained
yolov3.pt
) to train any model withn<=80
classes with no changes. The unusedconf
outputs will learn to simply default to zero, and the rest of the unused outputs (the box and class conf associated with those unused classes) will no longer matter. For example, our single class tutorial operates just as well with no modifications to the cfg file: https://github.com/ultralytics/yolov3/wiki/Example:-Train-Single-Class It's not clean and its not optimal, but it works.Thank you ! it did works!
i want to retain the existing classes and add new class i.e total of 80+1=81 class in coco dataset.Please tell me how to do it using transfer learning
@parul19 you create a new 81 class cfg. Follow the directions in the example above.
Do we still need COCO dataset if we only do transfer-learning?
@sooonism you need whatever dataset you want to train on.
@glenn-jocher
I am interested in extracting the vehicles on the road. So my interested Motorbike
Bicycle
Bus
Car
and truck
.
I have a vehicle that is not truck
but is being detected as truck
. I have collected the new data for this vehicle in COCO format. I want to this add a new class to the existing pre trained network.
Planning to
truck
to this new class My question is how do i, do it?
@Santhosh1509 well I would start by reviewing the examples in the wiki, such as the custom training tutorial: https://github.com/ultralytics/yolov3/wiki
@glenn-jocher Need your opinion on this. I just saw a post called transfer learning tutorial for SSD using keras.
Its mentioned in
Option 1: Just ignore the fact that we need only 8 classes
This would work, and it wouldn't even be a terrible option. Since only 8 out of the 80 classes would get trained, the model might get gradually worse at predicting the other 72 clases
in the second paragraph.
So I feel, even if i could some how train as i mentioned above for a particular new class, the prediction for the other classes might get affected.
Is my approach, right? Is there an alternative way where I could preserve the prediction of the other classes introducing this new class in the same neural network? I feel it needs to be trained from scratch then. What do you think?
@Santhosh1509 training normally will produce the best results. Transfer learning produces mediocre results quickly.
@glenn-jocher How do I get to know the training loss
,training accuracy
,validation loss
and validation accuracy
?
All i get is this during training
Please guide how do I tune my hyper parameters with this data that is being displayed here?
I could have increase the batch size I have more memory on the GPU
I do not understand the comment on these line
PS: latest training image
obj
and cls
values decreasing, is it good for this training?
@Santhosh1509 all of the information you mention is recorded in results.txt. You can plot this with from utils.utils import *; plot_results()
. You should use batch_size 64 accumulate 1 if possible, if not compensate with smaller batch sizes and larger accumulation counts, i.e. batch_size 32 accumulate 2.
obj and cls are training losses, they are supposed to decrease during training. See https://github.com/ultralytics/yolov3/issues/392 for hyperparameter evolution, and explore the open issues for answers to your questions.
@glenn-jocher This is what is stored in results.txt
obj
cls
total
targets
, I am confused as to how these relate to training loss
,training accuracy
,validation loss
and validation accuracy
Don't we have a graph which is easy to visualize, rather than just numbers.
Something like this
Now we can use even tensor board support inside pytorch to visualize the values
As the name mentions HYPERPARAMETER EVOLUTION is to plot those not how these (training loss
,training accuracy
,validation loss
and validation accuracy
) changed per epoch
@Santhosh1509 Tensorboard logs automatically in this repo if you have it installed. See https://github.com/ultralytics/yolov3/pull/435
@glenn-jocher Please explain how obj
cls
total
targets
being displayed here relate to training loss
,training accuracy
,validation loss
and validation accuracy
?
I can only relate terms P -> Precision R -> Recall mAP -> mean Average Precision F1 ->F1 score
@glenn-jocher accuracy is a classification metric, it is not used here. The metrics displayed during training are training losses and the number of targets per batch.
@glenn-jocher obj
or cls
which one of these is training loss
and what does the other terms mean because both of them decrease during training.
object loss and class loss. training loss is the total of all training losses.
Hi @glenn-jocher, so I followed the instruction above, tried to transfer learning with the original coco dataset. However, I found out that sometimes, some element of the loss from bbox_iou function is infinity. Apparently the variable 'pbox' has an extremely high value (3.438e+35) which cause it to infinity when calculating c_area.
From what I checked, variable 'ps' has value in range of [-1895, 80.24] and when I checked
pbox = torch.cat((pxy, torch.exp(ps[:, 2:4]) * anchor_vec[i]), 1)
'pbox' has value range from [2.54e-21, 3.44e+35]
so I guess this is where the problem comes from but I don't know how to fix this problem. Any ideas? Thanks.
@jobpasin is transfer learning does not converge simply train normally (which will produce better results anyways).
@jobpasin Hope this helps few points to note
I have to collect more data since my obj
loss don't go below 0.86
even of 273 epochs
These videos below might be of some use though they are in general for improving the NN
@glenn-jocher Unfortunately, I am going to train with a much smaller dataset afterward so I need to use transfer learning. On the other hand, with smaller batch size, the model sometimes converges.
@Santhosh1509 Thanks for the tips. My case is feature detection like a circle, star, an alphabet in a photo, so I think it is kind of similar I think? Currently adjusting the learning rate as you said hoping I can get some good result.
@jobpasin you could try Adam as well with an lr0 of about 1.5E-4.
. You should use batch_size 64 accumulate 1 if possible, if not compensate with smaller batch sizes and larger accumulation counts, i.e. batch_size 32 accumulate 2.
I tried with batch_size 64 and accumulate 1 but i am getting an Warning
WARNING: non-finite loss, ending training tensor([ nan, 1.89257e+00, 3.75394e+04, nan], device='cuda:0')
and it crashed.
I have to 1080tis and I want to increase the batch size from 32 to 128 or more, but it crashes for all values except bs=16 and accumulate=2. Any suggestions?
. You should use batch_size 64 accumulate 1 if possible, if not compensate with smaller batch sizes and larger accumulation counts, i.e. batch_size 32 accumulate 2.
I tried with batch_size 64 and accumulate 1 but i am getting an Warning
WARNING: non-finite loss, ending training tensor([ nan, 1.89257e+00, 3.75394e+04, nan], device='cuda:0')
and it crashed.I have to 1080tis and I want to increase the batch size from 32 to 128 or more, but it crashes for all values except bs=16 and accumulate=2. Any suggestions?
learning rate is too high
Hope this image helps you out understand.
@Santhosh1509 yes that's a good example. High LR's may be an advantage at the beginning of training, but later on they will bounce around local minima without descending into them properly just as in the charts you show, though ironically they may also prevent overtraining as a positive side effect. In general though best practices is to start with an LR of 1E-3 SGD or 1E-4 Adam and reduce after 80% of epochs have been completed by a gain of around 0.1 to 0.01.
Thank you for the prompt response.
I am using 'lr0': 0.00025
for --batch-size 192 --accumulate 2 --transfer --weights weights/yolov3.pt
are there any other settings which i need to alter?
@shahidammer train with default settings, and then look at your results.png for guidance on tune your training settings.
@glenn-jocher default settings does not work as it gives me
tried with batch_size 64 and accumulate 1 but i am getting an Warning WARNING: non-finite loss, ending training tensor([ nan, 1.89257e+00, 3.75394e+04, nan], device='cuda:0')
Thanks to @Santhosh1509 response, i decrease the Lr to 0.001 to 0.00025 but after 20 epoch, the map is still zero.
@glenn-jocher default settings does not work as it gives me
tried with batch_size 64 and accumulate 1 but i am getting an Warning WARNING: non-finite loss, ending training tensor([ nan, 1.89257e+00, 3.75394e+04, nan], device='cuda:0')
Thanks to @Santhosh1509 response, i decrease the Lr to 0.001 to 0.00025 but after 20 epoch, the map is still zero.
Use
This is one of the ways of learning rate decay after specific number of epochs, you can try it out.
torch.optim.lr_scheduler.StepLR(optimizer, step_size, gamma=0.1, last_epoch=-1)
>>> # Assuming optimizer uses lr = 0.05 for all groups
>>> # lr = 0.05 if epoch < 30
>>> # lr = 0.005 if 30 <= epoch < 60
>>> # lr = 0.0005 if 60 <= epoch < 90
>>> # ...
>>> scheduler = StepLR(optimizer, step_size=30, gamma=0.1)
>>> for epoch in range(100):
>>> train(...)
>>> validate(...)
>>> scheduler.step()
Source: torch.optim.lr_scheduler.StepLR
@Santhosh1509 ah this could be caused by the aggressive LR gain we have on transfer learning. Sorry, we haven't been making transfer learning a priority, yes this makes sense then that you ended up with such a tiny lr0.
@Santhosh1509 training normally will produce the best results. Transfer learning produces mediocre results quickly.
Are you sure? @glenn-jocher
@aquiire this shows the coco_16img.data tutorial starting from a few different options, including transfer learning. Transfer learning as shown below typically freezes the main pretrained weights, which constrains its performance. You can replicate these results with this code and looking at the resultant results.png file.
python3 train.py --data data/coco_64img.data --batch-size 16 --accumulate 1 --nosave --weights weights/ultralytics49.pt --name ultralytics49_start
python3 train.py --data data/coco_64img.data --batch-size 16 --accumulate 1 --nosave --weights weights/darknet53.conv.74 --name darknet53.conv.74_start
python3 train.py --data data/coco_64img.data --batch-size 16 --accumulate 1 --nosave --weights weights/yolov3-spp.weights --name yolov3-spp_start
python3 train.py --data data/coco_64img.data --batch-size 16 --accumulate 1 --nosave --weights weights/yolov3-spp.weights --transfer --name yolov3-spp_transfer
python3 train.py --data data/coco_64img.data --batch-size 16 --accumulate 1 --nosave --weights '' --name from_scratch
@glenn-jocher Thanks for the explanation. By training normally, did you mean training from scratch? If yes, then we have to compare orange with red neh?
@aquiire orange is with randomly intialized weights. Blue starts from darknet53.conv.74 backbone, red and green both start from yolov3-spp.weights (red freezes all layers except outputs, which is typically called "transfer learning").
I am doing person detection from cctv footage however there are instances where people are not detected, probably lighting, camera angle or warping (and more reasons).
My question is thus: How would I increase my networks accuracy? is it okay to get these false detections and just train on the new images, via transfer learning
Or must i download the coco data set for people and merge my new Images and retrain completely
How many images are recommended for transfer learning
Regards Andrew
@SurionAndrew of course there are instances of FPs and missed detections. If not your mAP would need to be 100%. Transfer learning is a waste of time. Train from scratch with no backbone untill validation losses beging to increase, then set your --epochs to that epoch and retrain to lock in LR drops. See https://github.com/ultralytics/yolov3/issues/310
I try to start transfer learning with downloaded from google drive yolov3.pt, but immideately get this error
File "D:/AI/yolov3/train.py", line 113, in <dictcomp> chkpt['model'] = {k: v for k, v in chkpt['model'].items() if model.state_dict()[k].numel() == v.numel()} KeyError: module_list.78.Conv2d.weight
If I try yolov3.weights file then I get another error
File "D:\ai\yolov3\models.py", line 342, in load_darknet_weights conv_w = torch.from_numpy(weights[ptr:ptr + num_w]).view_as(conv_layer.weight) RuntimeError: shape '[256, 128, 3, 3]' is invalid for input of size 282007
When train from scratch with my cfg and data files no errors occured.
Can anybody help me to resolve it?
What is the full command you used? to initialize the training?
@coolmarat your repo is out of date, git pull and try again.
Here is my result after 300 epoch, and this is video of the result: https://www.youtube.com/watch?v=8r4BNEMv_2Y
You can see it have detected a car door ? How can i solve this problem ?
@SHikumo I don't understand your question. Your GIoU loss looks strange, you should ensure your boxes are labelled correctly.
@SHikumo I don't understand your question. Your GIoU loss looks strange, you should ensure your boxes are labelled correctly.
Thanks for reviewing my transfer learning result. I'm sure that we have label "one_class" object right, i have cleared bad dataset too. I have done with about 1718 images of person ( different size, different angles), but the result still acceptable. If you don't mind please watch my result video: https://youtu.be/8r4BNEMv_2Y
This guide explains how to train your data with YOLOv3 using Transfer Learning. Transfer learning can be a useful way to quickly retrain YOLOv3 on new data without needing to retrain the entire network. We accomplish this by starting from the official YOLOv3 weights, and setting each layer's
.requires_grad
field to false that we do not want to calculate gradients for and optimize.Before You Start
git clone https://github.com/ultralytics/yolov3
bash yolov3/data/get_coco2017.sh
Transfer Learning
1. Download pretrained weights from our Google Drive folder that you want to use to transfer learn, and place them in
yolov3/weights/
.*2. Update `.cfg
file** (optional). Each YOLO layer has 255 outputs: 85 outputs per anchor [4 box coordinates + 1 object confidence + 80 class confidences], times 3 anchors. If you use fewer classes, reduce filters to
filters=[4 + 1 + n] * 3, where
nis your class count. This modification should be made to the layer preceding each of the 3 YOLO layers. Also modify
classes=80to
classes=nin each YOLO layer, where
n` is your class count.3. Train.
Run the above code to transfer learn on COCO, or specify your own data as
--data data/custom.data
(See https://github.com/ultralytics/yolov3/wiki/Train-Custom-Data).If you created a custom
*.cfg
file, specify it as--cfg custom.cfg
.You can observe in the Model Summary (using
model_info(model, report='full')
in train.py) that only the 3 YOLO layers have their gradients activated now (all other layers are frozen for duration of training):Reproduce Our Environment
To access an up-to-date working environment (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled), consider a: