Open JohnnyY8 opened 8 years ago
@JohnnyY8
Hi, I did the same thing. At first you should work through the code and check out, where which functions are called and you should try the demo.py. Afterwards in the readme is a section called "Beyond the demo" which explains the basic proceeding.
Additionally, you should search for issues in this repo. There are actually quite a lot similar issues that ask the same question.
Furthermore, here is a really good documentation of the "how to train on own dataset". This helped me a lot.
Finally, I'll sum up the main steps for you:
There are just the main steps I figured out during my work with the framework. It will take some time to get into it and several problems will occur by using the framework with your own dataset. The most problems are already addressed within other issues in this repo.
It might also be very helpful to use a python IDE that supports debugging.
Hope that helps. =)
Hi @ednarb29 , thanks for you answer sincerely, I will try it now. Hope I can do it. In addition, VID dataset has a lot of frames, more than one million. I am not quite sure if the code will create cache file for VID dataset ? Every time, it will takes me much time to load frames ? Thank you again!
You can easily check that out, the file should be under FRCN_ROOT/data/cache/
Of course if this file is huge it needs some time even to load the cache file I guess. Maybe you should debug that. Naively you can delete the cache file and start training again. So you can compare the time it needs to create the dataset / load the cache file.
Hi @ednarb29 , I have tried method you said. There are some errors about selective_search I can't handle like following. In my opinion, Faster R-CNN doesn't use selective search, so I prefer to comment some codes about selective search such as "self.selective_search_roidb". But maybe it is not a right way to solve. Could you please give me some suggestions?
@JohnnyY8 : Can you paste here your configuration information which are printed on terminal. I guess that your configuration file still choose the proposal method is selective search
@tiepnh Hi! You are right. According to tutorial "https://github.com/deboc/py-faster-rcnn/tree/master/help", I use command ($ echo 'MODELS_DIR: "$PY_FASTER_RCNN/models"' >> config.yml) to generate config.yml. But if I change it to "experiments/cfgs/faster_rcnn_end2end.yml", it looks ok.
@tiepnh @ednarb29 I can starting training, it looks close to right way. I will check it on validation set after finishing training. Thanks for you guys' help!!! Another question is in factory.py like following. What does the split mean? If there are ["train", "val", "test"], what do they use for ? train for training, val and test for what ?
@JohnnyY8 : This array will point to your image set files. As your pasted code, there are no image set file for testing or they use same image set for both training and testing.
Example: for the pascal_voc
The script file will call the this command for training
time ./tools/train_net.py --gpu ${GPU_ID} \ --solver models/${PT_DIR}/${NET}/faster_rcnn_end2end/solver.prototxt \ --weights data/prdcv_models/${NET}.v2.caffemodel \ --imdb ${TRAIN_IMDB} \ --iters ${ITERS} \ --cfg experiments/cfgs/faster_rcnn_end2end.yml \ ${EXTRA_ARGS}
The TRAIN_IMDB is "voc_2007_trainval" => they will load all image in image set files ".....trainval.txt"
For the testing, they will use TEST_IMDB="voc_2007_test" => load image in image set file "....test.txt" to test the trained network
@tiepnh Cool! Your answer is very useful and clear! Thanks so much! That means the ground truth of PASCAL VOC 2007 test set is under "Annotaions" folder, right? Otherwise, it can't get mAP after finish training. But I do not have the ground truth of VID test set and use TEST_IMDB="VID_val", does that mean it will test on validation set?
@tiepnh Hi! I use command to start training:
but still got following errors:
Traceback (most recent call last):
File "./tools/train_net.py", line 112, in
Is there something wrong ?
@JohnnyY8 :
That means the ground truth of PASCAL VOC 2007 test set is under "Annotaions" folder, right? For both, test set/ train set, the ground truth of Pascal_voc is under Annotations.
For the TEST_IMDB, it just point to set of image use to test. So, if your use same image set for TRAIN_IMDB and TEST_IMDB, it will train and test the network in same dataset. Secondly, you have to write your test function. See this tuto https://github.com/deboc/py-faster-rcnn/tree/master/lib/datasets
The error "max_overlaps" it seem that your data have no foreground ROI or background ROI. So, please check again your py file, which use to read your dataset
@tiepnh Thank you so much! You are so nice. I have found some bugs and restart training. Let's waiting for the results. Really, thanks for your help!
@tiepnh @ednarb29 Hi! I restarted training, but some strange problem occurred. I printed some path in train.txt, like this: When I see the printed information in terminal, I notice that the data has been loaded for many times! My teammate and me are pretty sure it has finished the whole training set for at least once. But this information shows it start from 0000 again. Could you please help me? We have loaded training data for more than 20 hours. Thank you so much!
At first I would suggest you to start training and testing with a very little data set (100 images and 1k iterations), that you can debug the training and testing quite fast.
Does the problem occur during creation of the data set or during training?
@ednarb29 I am not quite sure, several times before, I can load data about 2~4 hours (also load repeatly). But this time is stranger. We do not change any codes, just restart the training. The time for loading data is very long!
@ednarb29 Do you just load data for once after start traing ?
I am not sure about that because this kind of problem did not occur for me... If I had problems with loading the data set I just removed the cache file and that solved the problem in most cases because changes on the original data set are not updated in the cache file. Sorry dude.
Hi @JohnnyY8, I completely agree with the idea of ednarb29, you should test with a (very) small dataset at first. Moreover, I'm pretty sure that it's a bad idea to print anything for each data input. That may be the cause of the enormous additional loading time you got.
@ednarb29 Not to be sorry, I should thank you! I will remove the cache file and restart training! Really thanks for your help!
@deboc That is right. I will try it. Thank you! If I print anything, that will cause huge loading time ?
I just bet it's not negligible. You were saying the loading time had raised from 4h to 20h right ? What did you change beside adding this print ?
@deboc Oh, I see. Only add print codes. So that is stranger for us.
Did removing the print command speed up the process?
And did removing the cache file and build the database again solve your problem with the KeyError: 'max_overlaps'
?
@ednarb29 I don't try to remove the print command. Because I really want to know the process, I guss this time consuming is negligible. And removing the cache file works, my training restarts into iteration. Thanks a lot!
Cool, so if it works fine you can close the issue? =)
@ednarb29 Sure, thank you very much!
@deboc , I have a quick question. I get the following error when I executed the following command:
Command:
./tools/train_faster_rcnn_alt_opt.py --gpu 0 --net_name INRIA_Person --weights data/faster_rcnn_models/VGG16_faster_rcnn_final.caffemodel --imdb inria_train --cfg config.yml
Error:
.....
I0725 04:10:00.437233 3494 net.cpp:816] Ignoring source layer conv4_3
I0725 04:10:00.437252 3494 net.cpp:816] Ignoring source layer relu4_3
I0725 04:10:00.437268 3494 net.cpp:816] Ignoring source layer pool4
I0725 04:10:00.437296 3494 net.cpp:816] Ignoring source layer conv5_1
I0725 04:10:00.437314 3494 net.cpp:816] Ignoring source layer relu5_1
I0725 04:10:00.437331 3494 net.cpp:816] Ignoring source layer conv5_2
I0725 04:10:00.437350 3494 net.cpp:816] Ignoring source layer relu5_2
I0725 04:10:00.437366 3494 net.cpp:816] Ignoring source layer conv5_3
I0725 04:10:00.437384 3494 net.cpp:816] Ignoring source layer relu5_3
I0725 04:10:00.437397 3494 net.cpp:816] Ignoring source layer conv5_3_relu5_3_0_split
I0725 04:10:00.437405 3494 net.cpp:816] Ignoring source layer roi_pool5
F0725 04:10:00.737687 3494 net.cpp:829] Cannot copy param 0 weights from layer 'fc6'; shape mismatch. Source param shape is 4096 25088 (102760448); target param shape is 4096 18432 (75497472). To learn this layer's parameters from scratch rather than copying from a saved net, rename the layer.
*** Check failure stack trace: ***
I read that there's basically a difference in the expected size that the network has been setup to expect. The one thing that I can imagine is that I am using the faster-rcnn VGG16 model( data/faster_rcnn_models/VGG16_faster_rcnn_final.caffemodel )? Is it possible to use this model instead of the one you mentioned( data/imagenet_models/VGG_CNN_M_1024.v2.caffemodel ) ?
P.S. Thank you for that awesome tutorial !
Hi GeorgiAngelov, I see you are using a final faster-rcnn caffemodel as pretrained network, but those ones doesn't have any fc6 layer, thus your issue. The classical way for another dataset would be to use a pretrained caffe classifier for your data, and use its train.prototxt to build a faster-rcnn model. So I suggest you investigate which classifier was used in your pretrained model, and provide this caffemodel (e.g. VGG_CNN_M_1024.v2.caffemodel) instead of the faster-rcnn one in the weights option
@GeorgiAngelov Hi! I think the weight should be assigned imagenet pretrained model, not faster rcnn final model. Hope it can help you.
@deboc, is the VGG_CNN_M_1024.v2.caffemodel considered a pre-trained model ? I am wondering if this model in itself is already capable of classifying objects. My basic idea is that I would like to start training a model with my own data but I would like that model to already be a trained model so I can leverage the weights.
My idea is that you can pretty much start with a trained .caffemodel file such as the VGG16_faster_rcnn_final.caffemodel and then train it even further. It appears that this might not be possible with this model in particular.
My question is: What does the v2 stand for in VGG_CNN_M_1024.v2.caffemodel and can I get a final model from this model to actually use it with tools/demo.py for example?
@JohnnyY8 , thank you for clarifying that. Until now, I was assuming that a model is a model is a model. I did not differentiate between pretrained model and a final model. I guess I am still not clear on the distinction.
@GeorgiAngelov If you want to train on final caffemodel and go further, it may be OK. Just pay attention to the difference of architecture of networks. I also do not know what v2 meas. But according to tutorial I consider it as pre-trained model, when I train faster r-cnn on my own dataset. And the final caffemodel can be directly utilized to classify objects.
Some confusion here. Every .caffemodel contains a pretrained model, with the weights of a converged neural network. The ones of faster-rcnn just also happen to be called "final" models.
Before touching faster-rcnn I suggest you start by getting more used to the caffe deep learning framework. A lot of pre-trained models can be found on the zoo, and are ready to use. Most of them are classifier that can infer an object class from an image. VGG_CNN_M_1024.v2.caffemodel is one of those (sorry, don't know about the v2 neither but the originals are from there). Indeed you can finetune a classifier by removing the last layer and adapt it for another dataset. For that you can carefully change the learning rate of each layer in order to balance between "start from scratch policy" and "reuse the former network policy". Good tutorials about caffe can be found on the Berkeley Vision website
Now about faster-rcnn. It's a framework for object detection, developed by R. Girshick. It's using the convnet classifier of your choice and the training phase learns how to detect the objects classified by the underlying classifier. That's why you need to reuse or finetune a classifier for your data, before even considering detection (and faster-rcnn).
So :
@JohnnyY8 : Hey, could you share how you managed to solve the "max_overlaps" issue ?
@vikiboy Hi, I do not remember it clearly, it seems that there are a little of xml files of gt that do not contain any objects. I remove them and corresponding images. Hope it can help you.
@vikiboy In addition, please pay attention to the coordinates of imagenet, it is starting from 1 not 0. I remember that there are two places nee to be modified. First one is lib/dataset/your_dataset.py. Second one is lib/dataset/imdb.py. I am not quite sure what I remember, please try them.
Hi, I carried out ednarb29's method, but when I ran ./tools/train_faster_rcnn_alt_opt.py --gpu 0 --net_name INRIA_Person --weights data/imagenet_models/VGG_CNN_M_1024.v2.caffemodel --imdb inria_train --cfg config.yml
, I got error as below.
Output will be saved to /home/keisan/py-faster-rcnn/output/default/train Filtered 0 roidb entries: 1228 -> 1228 WARNING: Logging before InitGoogleLogging() is written to STDERR F1107 12:32:17.155658 12497 io.cpp:36] Check failed: fd != -1 (-1 vs. -1) File not found: ~/py-faster-rcnn/models/INRIA_Person/faster_rcnn_alt_optpt/stage1_rpn_solver60k80k.pt *** Check failure stack trace: ***
The file of "stage1_rpn_solver60k80k.pt" exist in the~/py-faster-rcnn/models/INRIA_Person/faster_rcnn_alt_opt .
What should I do?
@miyamon11 Hi: I did not try to train model in alt_opt. But according to the error info "~/py-faster-rcnn/models/INRIA_Person/faster_rcnn_alt_optpt/stage1_rpn_solver60k80k.pt", is here any problem? I mean optpt?
I followed this tutorial but got the following errors:
Traceback (most recent call last):
File "./tools/train_net.py", line 113, in
Any ideas?
inds = np.reshape(inds, (-1, 2)) because of second demotion of reshaping is 2 you should use only even numbers of images in data set.
@GeorgiAngelov The tutorial of @deboc uses the image_net model VGG_CNN_M_1024.v2.caffemodel. You can get it by following the steps here https://github.com/deboc/py-faster-rcnn#download-pre-trained-imagenet-models.
@ednarb29
first I would suggest you to start training and testing with a very little data set (100 images and 1k iterations), that you can debug the training and testing quite fast.
Does the problem occur during creation of the data set or during training?
Thanks I had the same problem:
overlaps = entry['max_overlaps']
KeyError: 'max_overlaps'
I deleted the cache file and it is now running.
@ednarb29
What tool should I should to create imdb files?
@ednarb29 , removing cache file fixed problem for me regarding the max_overlaps
@ArturoDeza What tool/code have you used to make imdb file for training?
@VanitarNordic , I don't think there's a quick recipe for that. I've been following this setup: https://github.com/smallcorgi/Faster-RCNN_TF You will have to modify some lines of code in the factory.py, and copy the pascal_voc.py file to your my_dataset.py file and modify the lines of code regarding the number of training classes. *Besides also annotating all your images with .xml files
@ArturoDeza Thanks, actually I have annotated files but I've stuck in imdb creation :-(
@VanitarNordic What is the error you've been getting? You should create a new issue with the error you get when you run the end2end training script, that way we can be more helpful.
@ArturoDeza No, but I don't understand the fact that when we have a custom dataset, then when the model should be trained on that?! because end to end training does not have the dataset input parameter.
Hi! I am getting the following error: Traceback (most recent call last): File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run self._target(*self._args, **self._kwargs) File "./tools/train_faster_rcnn_alt_opt.py", line 129, in train_rpn max_iters=max_iters) File "/home/siplab/py-faster-rcnn/tools/../lib/fast_rcnn/train.py", line 160, in train_net model_paths = sw.train_model(max_iters) File "/home/siplab/py-faster-rcnn/tools/../lib/fast_rcnn/train.py", line 101, in train_model self.solver.step(1) File "/home/siplab/py-faster-rcnn/tools/../lib/roi_data_layer/layer.py", line 144, in forward blobs = self._get_next_minibatch() File "/home/siplab/py-faster-rcnn/tools/../lib/roi_data_layer/layer.py", line 63, in _get_next_minibatch return get_minibatch(minibatch_db, self._num_classes) File "/home/siplab/py-faster-rcnn/tools/../lib/roi_data_layer/minibatch.py", line 22, in get_minibatch assert(cfg.TRAIN.BATCH_SIZE % num_images == 0), \ ZeroDivisionError: integer division or modulo by zero
Can anyone help me with that?
I"m using INRIA Person data set. After running below command
./tools/train_faster_rcnn_alt_opt.py --gpu 0 --net_name INRIA_Person --weights data/imagenet_models/VGG_CNN_M_1024.v2.caffemodel --imdb inria_train --cfg config.yml
I got a error
File "./tools/train_faster_rcnn_alt_opt.py", line 62
print 'Loaded dataset {:s}
for training'.format(imdb.name)
^
SyntaxError: invalid syntax
Can you please let me know reason behind this error
Hi everyone: I want to train Faster R-CNN on my own dataset. Because Faster R-CNN does not use selective search method, I comment the code about selective. However, there are still some errors about roidb, and so on. Can anybody help me ? I am not quite sure what should I do for training Faster R-CNN. It is a little complicated for me. Thanks so much!