How to train Faster R-CNN on my own dataset ?

JohnnyY8 commented 8 years ago

Hi everyone: I want to train Faster R-CNN on my own dataset. Because Faster R-CNN does not use selective search method, I comment the code about selective. However, there are still some errors about roidb, and so on. Can anybody help me ? I am not quite sure what should I do for training Faster R-CNN. It is a little complicated for me. Thanks so much!

ednarb29 commented 8 years ago

@JohnnyY8

Hi, I did the same thing. At first you should work through the code and check out, where which functions are called and you should try the demo.py. Afterwards in the readme is a section called "Beyond the demo" which explains the basic proceeding.

Additionally, you should search for issues in this repo. There are actually quite a lot similar issues that ask the same question.

Furthermore, here is a really good documentation of the "how to train on own dataset". This helped me a lot.

Finally, I'll sum up the main steps for you:

Copy the structure of the pascal voc dataset into the FRCN_ROOT/data/, create a symbolic link and place your data in a similar manner as the pascal voc data set. That's actually the best way to prevent you from huge code changes in the following steps.
Create a FRCN_ROOT/lib/datasets/.py and a _eval.py corresponding the pascal_voc.py and voc_eval.py
Update the FRCN_ROOT/lib/datasets/factory.py by adding a new entry for your own dataset.
Adapt the models under FRCN_ROOT/models/ by copying and changing an existing one like pascal_voc. Note, that you have to take care of the path within the solver and the amount of classes in the train and test prototxt. I can recommend you to start with the ZF model and the end2end algorithm. The alt_opt is more complex and better if you have more experience later.
Create a config file under FRCN_ROOT/experiments/cfgs also by copying and updating an existing one.
Create or update an experiment script under FRCN_ROOT/experiments/scripts by modifying it to your dataset
Start training and testing by running the experiment script created in the previous step.

There are just the main steps I figured out during my work with the framework. It will take some time to get into it and several problems will occur by using the framework with your own dataset. The most problems are already addressed within other issues in this repo.

It might also be very helpful to use a python IDE that supports debugging.

Hope that helps. =)

JohnnyY8 commented 8 years ago

Hi @ednarb29 , thanks for you answer sincerely, I will try it now. Hope I can do it. In addition, VID dataset has a lot of frames, more than one million. I am not quite sure if the code will create cache file for VID dataset ? Every time, it will takes me much time to load frames ? Thank you again!

ednarb29 commented 8 years ago

You can easily check that out, the file should be under FRCN_ROOT/data/cache/

Of course if this file is huge it needs some time even to load the cache file I guess. Maybe you should debug that. Naively you can delete the cache file and start training again. So you can compare the time it needs to create the dataset / load the cache file.

JohnnyY8 commented 8 years ago

Hi @ednarb29 , I have tried method you said. There are some errors about selective_search I can't handle like following. In my opinion, Faster R-CNN doesn't use selective search, so I prefer to comment some codes about selective search such as "self.selective_search_roidb". But maybe it is not a right way to solve. Could you please give me some suggestions?

tiepnh commented 8 years ago

@JohnnyY8 : Can you paste here your configuration information which are printed on terminal. I guess that your configuration file still choose the proposal method is selective search

JohnnyY8 commented 8 years ago

@tiepnh Hi! You are right. According to tutorial "https://github.com/deboc/py-faster-rcnn/tree/master/help", I use command ($ echo 'MODELS_DIR: "$PY_FASTER_RCNN/models"' >> config.yml) to generate config.yml. But if I change it to "experiments/cfgs/faster_rcnn_end2end.yml", it looks ok.

JohnnyY8 commented 8 years ago

@tiepnh @ednarb29 I can starting training, it looks close to right way. I will check it on validation set after finishing training. Thanks for you guys' help!!! Another question is in factory.py like following. What does the split mean? If there are ["train", "val", "test"], what do they use for ? train for training, val and test for what ?

tiepnh commented 8 years ago

@JohnnyY8 : This array will point to your image set files. As your pasted code, there are no image set file for testing or they use same image set for both training and testing. Example: for the pascal_voc The script file will call the this command for training time ./tools/train_net.py --gpu ${GPU_ID} \ --solver models/${PT_DIR}/${NET}/faster_rcnn_end2end/solver.prototxt \ --weights data/prdcv_models/${NET}.v2.caffemodel \ --imdb ${TRAIN_IMDB} \ --iters ${ITERS} \ --cfg experiments/cfgs/faster_rcnn_end2end.yml \ ${EXTRA_ARGS} The TRAIN_IMDB is "voc_2007_trainval" => they will load all image in image set files ".....trainval.txt" For the testing, they will use TEST_IMDB="voc_2007_test" => load image in image set file "....test.txt" to test the trained network

JohnnyY8 commented 8 years ago

@tiepnh Cool! Your answer is very useful and clear! Thanks so much! That means the ground truth of PASCAL VOC 2007 test set is under "Annotaions" folder, right? Otherwise, it can't get mAP after finish training. But I do not have the ground truth of VID test set and use TEST_IMDB="VID_val", does that mean it will test on validation set?

JohnnyY8 commented 8 years ago

@tiepnh Hi! I use command to start training:

sudo ./tools/train_net.py --gpu 0 --iters 100000 --weights data/imagenet_models/ZF.v2.caffemodel --imdb VID_train --cfg ./experiments/cfgs/faster_rcnn_end2end.yml --solver models/pascal_voc/ZF/faster_rcnn_end2end/solver.prototxt

but still got following errors: Traceback (most recent call last): File "./tools/train_net.py", line 112, in max_iters=args.max_iters) File "/usr/local/caffes/xlw/faster-rcnn-third/tools/../lib/fast_rcnn/train.py", line 155, in train_net roidb = filter_roidb(roidb) File "/usr/local/caffes/xlw/faster-rcnn-third/tools/../lib/fast_rcnn/train.py", line 145, in filter_roidb filtered_roidb = [entry for entry in roidb if is_valid(entry)] File "/usr/local/caffes/xlw/faster-rcnn-third/tools/../lib/fast_rcnn/train.py", line 134, in is_valid overlaps = entry['max_overlaps'] KeyError: 'max_overlaps'

Is there something wrong ?

tiepnh commented 8 years ago

@JohnnyY8 :

That means the ground truth of PASCAL VOC 2007 test set is under "Annotaions" folder, right? For both, test set/ train set, the ground truth of Pascal_voc is under Annotations.

For the TEST_IMDB, it just point to set of image use to test. So, if your use same image set for TRAIN_IMDB and TEST_IMDB, it will train and test the network in same dataset. Secondly, you have to write your test function. See this tuto https://github.com/deboc/py-faster-rcnn/tree/master/lib/datasets

The error "max_overlaps" it seem that your data have no foreground ROI or background ROI. So, please check again your py file, which use to read your dataset

JohnnyY8 commented 8 years ago

@tiepnh Thank you so much! You are so nice. I have found some bugs and restart training. Let's waiting for the results. Really, thanks for your help!

JohnnyY8 commented 8 years ago

@tiepnh @ednarb29 Hi! I restarted training, but some strange problem occurred. I printed some path in train.txt, like this: When I see the printed information in terminal, I notice that the data has been loaded for many times! My teammate and me are pretty sure it has finished the whole training set for at least once. But this information shows it start from 0000 again. Could you please help me? We have loaded training data for more than 20 hours. Thank you so much!

ednarb29 commented 8 years ago

At first I would suggest you to start training and testing with a very little data set (100 images and 1k iterations), that you can debug the training and testing quite fast.

Does the problem occur during creation of the data set or during training?

JohnnyY8 commented 8 years ago

@ednarb29 I am not quite sure, several times before, I can load data about 2~4 hours (also load repeatly). But this time is stranger. We do not change any codes, just restart the training. The time for loading data is very long!

JohnnyY8 commented 8 years ago

@ednarb29 Do you just load data for once after start traing ?

ednarb29 commented 8 years ago

I am not sure about that because this kind of problem did not occur for me... If I had problems with loading the data set I just removed the cache file and that solved the problem in most cases because changes on the original data set are not updated in the cache file. Sorry dude.

deboc commented 8 years ago

Hi @JohnnyY8, I completely agree with the idea of ednarb29, you should test with a (very) small dataset at first. Moreover, I'm pretty sure that it's a bad idea to print anything for each data input. That may be the cause of the enormous additional loading time you got.

JohnnyY8 commented 8 years ago

@ednarb29 Not to be sorry, I should thank you! I will remove the cache file and restart training! Really thanks for your help!

JohnnyY8 commented 8 years ago

@deboc That is right. I will try it. Thank you! If I print anything, that will cause huge loading time ?

deboc commented 8 years ago

I just bet it's not negligible. You were saying the loading time had raised from 4h to 20h right ? What did you change beside adding this print ?

JohnnyY8 commented 8 years ago

@deboc Oh, I see. Only add print codes. So that is stranger for us.

ednarb29 commented 8 years ago

Did removing the print command speed up the process?

And did removing the cache file and build the database again solve your problem with the KeyError: 'max_overlaps'?

JohnnyY8 commented 8 years ago

@ednarb29 I don't try to remove the print command. Because I really want to know the process, I guss this time consuming is negligible. And removing the cache file works, my training restarts into iteration. Thanks a lot!

ednarb29 commented 8 years ago

Cool, so if it works fine you can close the issue? =)

JohnnyY8 commented 8 years ago

@ednarb29 Sure, thank you very much!

GeorgiAngelov commented 8 years ago

@deboc , I have a quick question. I get the following error when I executed the following command:

Command: ./tools/train_faster_rcnn_alt_opt.py --gpu 0 --net_name INRIA_Person --weights data/faster_rcnn_models/VGG16_faster_rcnn_final.caffemodel --imdb inria_train --cfg config.yml

Error:

.....
I0725 04:10:00.437233  3494 net.cpp:816] Ignoring source layer conv4_3
I0725 04:10:00.437252  3494 net.cpp:816] Ignoring source layer relu4_3
I0725 04:10:00.437268  3494 net.cpp:816] Ignoring source layer pool4
I0725 04:10:00.437296  3494 net.cpp:816] Ignoring source layer conv5_1
I0725 04:10:00.437314  3494 net.cpp:816] Ignoring source layer relu5_1
I0725 04:10:00.437331  3494 net.cpp:816] Ignoring source layer conv5_2
I0725 04:10:00.437350  3494 net.cpp:816] Ignoring source layer relu5_2
I0725 04:10:00.437366  3494 net.cpp:816] Ignoring source layer conv5_3
I0725 04:10:00.437384  3494 net.cpp:816] Ignoring source layer relu5_3
I0725 04:10:00.437397  3494 net.cpp:816] Ignoring source layer conv5_3_relu5_3_0_split
I0725 04:10:00.437405  3494 net.cpp:816] Ignoring source layer roi_pool5
F0725 04:10:00.737687  3494 net.cpp:829] Cannot copy param 0 weights from layer 'fc6'; shape mismatch.  Source param shape is 4096 25088 (102760448); target param shape is 4096 18432 (75497472). To learn this layer's parameters from scratch rather than copying from a saved net, rename the layer.
*** Check failure stack trace: ***

I read that there's basically a difference in the expected size that the network has been setup to expect. The one thing that I can imagine is that I am using the faster-rcnn VGG16 model( data/faster_rcnn_models/VGG16_faster_rcnn_final.caffemodel )? Is it possible to use this model instead of the one you mentioned( data/imagenet_models/VGG_CNN_M_1024.v2.caffemodel ) ?

P.S. Thank you for that awesome tutorial !

deboc commented 8 years ago

Hi GeorgiAngelov, I see you are using a final faster-rcnn caffemodel as pretrained network, but those ones doesn't have any fc6 layer, thus your issue. The classical way for another dataset would be to use a pretrained caffe classifier for your data, and use its train.prototxt to build a faster-rcnn model. So I suggest you investigate which classifier was used in your pretrained model, and provide this caffemodel (e.g. VGG_CNN_M_1024.v2.caffemodel) instead of the faster-rcnn one in the weights option

JohnnyY8 commented 8 years ago

@GeorgiAngelov Hi！ I think the weight should be assigned imagenet pretrained model, not faster rcnn final model. Hope it can help you.

GeorgiAngelov commented 8 years ago

@deboc, is the VGG_CNN_M_1024.v2.caffemodel considered a pre-trained model ? I am wondering if this model in itself is already capable of classifying objects. My basic idea is that I would like to start training a model with my own data but I would like that model to already be a trained model so I can leverage the weights.

My idea is that you can pretty much start with a trained .caffemodel file such as the VGG16_faster_rcnn_final.caffemodel and then train it even further. It appears that this might not be possible with this model in particular.

My question is: What does the v2 stand for in VGG_CNN_M_1024.v2.caffemodel and can I get a final model from this model to actually use it with tools/demo.py for example?

@JohnnyY8 , thank you for clarifying that. Until now, I was assuming that a model is a model is a model. I did not differentiate between pretrained model and a final model. I guess I am still not clear on the distinction.

JohnnyY8 commented 8 years ago

@GeorgiAngelov If you want to train on final caffemodel and go further, it may be OK. Just pay attention to the difference of architecture of networks. I also do not know what v2 meas. But according to tutorial I consider it as pre-trained model, when I train faster r-cnn on my own dataset. And the final caffemodel can be directly utilized to classify objects.

deboc commented 8 years ago

Some confusion here. Every .caffemodel contains a pretrained model, with the weights of a converged neural network. The ones of faster-rcnn just also happen to be called "final" models.

Before touching faster-rcnn I suggest you start by getting more used to the caffe deep learning framework. A lot of pre-trained models can be found on the zoo, and are ready to use. Most of them are classifier that can infer an object class from an image. VGG_CNN_M_1024.v2.caffemodel is one of those (sorry, don't know about the v2 neither but the originals are from there). Indeed you can finetune a classifier by removing the last layer and adapt it for another dataset. For that you can carefully change the learning rate of each layer in order to balance between "start from scratch policy" and "reuse the former network policy". Good tutorials about caffe can be found on the Berkeley Vision website

Now about faster-rcnn. It's a framework for object detection, developed by R. Girshick. It's using the convnet classifier of your choice and the training phase learns how to detect the objects classified by the underlying classifier. That's why you need to reuse or finetune a classifier for your data, before even considering detection (and faster-rcnn).

So :

If your objects are already classified by a converged model from the caffe zoo (e.g. 'aeroplane', 'bicycle', 'bird', 'person', etc for VGG), you can directly use this model to launch a faster-rcnn training
If not forget faster-rcnn for now and take a look on caffe tutorials to build your own classifier

vikiboy commented 8 years ago

@JohnnyY8 : Hey, could you share how you managed to solve the "max_overlaps" issue ?

JohnnyY8 commented 8 years ago

@vikiboy Hi, I do not remember it clearly, it seems that there are a little of xml files of gt that do not contain any objects. I remove them and corresponding images. Hope it can help you.

JohnnyY8 commented 8 years ago

@vikiboy In addition, please pay attention to the coordinates of imagenet, it is starting from 1 not 0. I remember that there are two places nee to be modified. First one is lib/dataset/your_dataset.py. Second one is lib/dataset/imdb.py. I am not quite sure what I remember, please try them.

miyamon11 commented 7 years ago

Hi, I carried out ednarb29's method, but when I ran ./tools/train_faster_rcnn_alt_opt.py --gpu 0 --net_name INRIA_Person --weights data/imagenet_models/VGG_CNN_M_1024.v2.caffemodel --imdb inria_train --cfg config.yml , I got error as below.

Output will be saved to /home/keisan/py-faster-rcnn/output/default/train Filtered 0 roidb entries: 1228 -> 1228 WARNING: Logging before InitGoogleLogging() is written to STDERR F1107 12:32:17.155658 12497 io.cpp:36] Check failed: fd != -1 (-1 vs. -1) File not found: ~/py-faster-rcnn/models/INRIA_Person/faster_rcnn_alt_optpt/stage1_rpn_solver60k80k.pt *** Check failure stack trace: *** The file of "stage1_rpn_solver60k80k.pt" exist in the~/py-faster-rcnn/models/INRIA_Person/faster_rcnn_alt_opt .

What should I do?

JohnnyY8 commented 7 years ago

@miyamon11 Hi: I did not try to train model in alt_opt. But according to the error info "~/py-faster-rcnn/models/INRIA_Person/faster_rcnn_alt_optpt/stage1_rpn_solver60k80k.pt", is here any problem? I mean optpt?

xinleipan commented 7 years ago

I followed this tutorial but got the following errors:

Traceback (most recent call last): File "./tools/train_net.py", line 113, in max_iters=args.max_iters) File "/media/username/DC1A-EA60/git14/py-faster-rcnn/tools/../lib/fast_rcnn/train.py", line 157, in train_net pretrained_model=pretrained_model) File "/media/username/DC1A-EA60/git14/py-faster-rcnn/tools/../lib/fast_rcnn/train.py", line 53, in init self.solver.net.layers[0].set_roidb(roidb) File "/media/username/DC1A-EA60/git14/py-faster-rcnn/tools/../lib/roi_data_layer/layer.py", line 68, in set_roidb self._shuffle_roidb_inds() File "/media/username/DC1A-EA60/git14/py-faster-rcnn/tools/../lib/roi_data_layer/layer.py", line 35, in _shuffle_roidb_inds inds = np.reshape(inds, (-1, 2)) File "/usr/local/lib/python2.7/dist-packages/numpy/core/fromnumeric.py", line 224, in reshape return reshape(newshape, order=order) ValueError: total size of new array must be unchanged

Any ideas?

vaklyuenkov commented 7 years ago

inds = np.reshape(inds, (-1, 2)) because of second demotion of reshaping is 2 you should use only even numbers of images in data set.

dantp-ai commented 7 years ago

@GeorgiAngelov The tutorial of @deboc uses the image_net model VGG_CNN_M_1024.v2.caffemodel. You can get it by following the steps here https://github.com/deboc/py-faster-rcnn#download-pre-trained-imagenet-models.

arasharchor commented 7 years ago

@ednarb29

first I would suggest you to start training and testing with a very little data set (100 images and 1k iterations), that you can debug the training and testing quite fast.

Does the problem occur during creation of the data set or during training?

Thanks I had the same problem:

overlaps = entry['max_overlaps']
KeyError: 'max_overlaps'

I deleted the cache file and it is now running.

MyVanitar commented 7 years ago

@ednarb29

What tool should I should to create imdb files?

ArturoDeza commented 7 years ago

@ednarb29 , removing cache file fixed problem for me regarding the max_overlaps

MyVanitar commented 7 years ago

@ArturoDeza What tool/code have you used to make imdb file for training?

ArturoDeza commented 7 years ago

@VanitarNordic , I don't think there's a quick recipe for that. I've been following this setup: https://github.com/smallcorgi/Faster-RCNN_TF You will have to modify some lines of code in the factory.py, and copy the pascal_voc.py file to your my_dataset.py file and modify the lines of code regarding the number of training classes. *Besides also annotating all your images with .xml files

MyVanitar commented 7 years ago

@ArturoDeza Thanks, actually I have annotated files but I've stuck in imdb creation :-(

ArturoDeza commented 7 years ago

@VanitarNordic What is the error you've been getting? You should create a new issue with the error you get when you run the end2end training script, that way we can be more helpful.

MyVanitar commented 7 years ago

@ArturoDeza No, but I don't understand the fact that when we have a custom dataset, then when the model should be trained on that?! because end to end training does not have the dataset input parameter.

roshanpati commented 7 years ago

Hi! I am getting the following error: Traceback (most recent call last): File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run self._target(*self._args, **self._kwargs) File "./tools/train_faster_rcnn_alt_opt.py", line 129, in train_rpn max_iters=max_iters) File "/home/siplab/py-faster-rcnn/tools/../lib/fast_rcnn/train.py", line 160, in train_net model_paths = sw.train_model(max_iters) File "/home/siplab/py-faster-rcnn/tools/../lib/fast_rcnn/train.py", line 101, in train_model self.solver.step(1) File "/home/siplab/py-faster-rcnn/tools/../lib/roi_data_layer/layer.py", line 144, in forward blobs = self._get_next_minibatch() File "/home/siplab/py-faster-rcnn/tools/../lib/roi_data_layer/layer.py", line 63, in _get_next_minibatch return get_minibatch(minibatch_db, self._num_classes) File "/home/siplab/py-faster-rcnn/tools/../lib/roi_data_layer/minibatch.py", line 22, in get_minibatch assert(cfg.TRAIN.BATCH_SIZE % num_images == 0), \ ZeroDivisionError: integer division or modulo by zero

Can anyone help me with that?

medhani commented 7 years ago

I"m using INRIA Person data set. After running below command

./tools/train_faster_rcnn_alt_opt.py --gpu 0 --net_name INRIA_Person --weights data/imagenet_models/VGG_CNN_M_1024.v2.caffemodel --imdb inria_train --cfg config.yml

I got a error File "./tools/train_faster_rcnn_alt_opt.py", line 62 print 'Loaded dataset {:s} for training'.format(imdb.name) ^ SyntaxError: invalid syntax

Can you please let me know reason behind this error

rbgirshick / py-faster-rcnn

How to train Faster R-CNN on my own dataset ? #243