Py-Faster-Rcnn using Resnet

hoticevijay commented 8 years ago

Based on https://github.com/rbgirshick/py-faster-rcnn/issues/62 I am trying to train my own dataset using resnet+py-faster-rcnn (using @siddharthm83 train.txt). I am getting the following error.

I0321 07:29:44.037149 1892 solver.cpp:60] Solver scaffolding done. Loading pretrained model weights from data/imagenet_models/resnet.caffemodel I0321 07:29:44.240974 1892 net.cpp:816] Ignoring source layer fc1000 I0321 07:29:44.241065 1892 net.cpp:816] Ignoring source layer prob Solving... F0321 07:29:45.412804 1892 syncedmem.cpp:56] Check failed: error == cudaSuccess (2 vs. 0) out of memory * Check failure stack trace: * Aborted (core dumped)

I am using AWS instance. I was able to train resnet-50 (without fast-rcnn) using the same instance with same dataset. But when I tried using py-faster-rcnn, I am getting this error. I know this error could possibly be due to insufficient memory. So I changed the batch size in deploy.prototxt (iter_size: 1). But still I am getting the error. Can someone help me out?

abhirevan commented 8 years ago

Experiment with different batch sizes in the yml file or in config.py

happyharrycn commented 8 years ago

ResNet 101 on Caffe requires >10G GPU memory for an input with VGA resoltuion (640*480) during training (even when you fixed all conv1_x/conv2_x/conv3_x layers). A couple of memory optimizations can be done easily though.

Always compile Caffe with cuDNN to avoid internal buffer for conv layers.
Merge Batchnorm + Scale into a single Scale layer. The current implementation of Batchnorm in Caffe takes way too much memory. Or more radically, you can fold conv + batchnorm + scale into a single conv with decreased detection performance (as now you can not freeze batchnorm+scale layers).
Using the in-place eltwise sum within the PR here

A more fundamental solution is to allow Caffe to reuse the gradients (diff) for each blob. One can safely rewrite the diff of a blob when the weights of all layers including the blob had been updated. And that's the way to train ResNet 101 using a batch size of 32 on 12G memory as mentioned in the original paper.

kl2005ad commented 8 years ago

@happyharrycn I find something interesting when I train resnet + faster rcnn on my own dataset. If I fix all batchNorm+scale layers on conv1 ~ conv4, and only allow updates on conv layers, the resulting model is far from the paper claims. If I allow batchNorm+scale to update, it gets much better performance (close to vgg16). But faster rcnn only uses one image per batch and is not supposed to update batchNorm+scale properly. What am I missing?

happyharrycn commented 8 years ago

@kl2005ad ResNet for detection was re-produced at multiple sites. And I am not sure why you are getting worse performance based on your description here. The last time I tried on VOC, it is working better than they claimed in the paper :) Here are some implementation details I used for training.

Freeze all batchNorm+Scale layers (from conv1_x ~ conv5_x)
Optional: freeze conv1_x to conv3_x to save some memory / time
Put a ROI pooling layer at the end of conv4_x, use conv5_x as the classifier (similar to FC layers in VGG16) and attach two branch prediction (softmax with loss for classification and smooth L1 loss for box regression) after the average pooling.
You might want to change the ROI pooling from 14 * 14 -> 7 * 7 and increase the resolution in conv5_x (change the downsample layers by setting their stride to 1). This is not exactly equivalent to the orginal paper but helps to detect smaller objects.

banxiaduhuo commented 8 years ago

@kl2005ad @happyharrycn Hi! I try to train resnet50 with faster-rcnn. And I got a very low result on voc2007, about 0.47, even lower than ZF model's 0.62. What's the result you got on voc2007? I didn't freeze the batchNorm+Scale layers.

Is my solver correct? base_lr: 0.001 lr_policy: "multistep" gamma: 0.1 stepvalue: 300000 stepvalue: 500000 display: 20 momentum: 0.9 weight_decay: 0.0001 snapshot: 0

Thank you very much!

happyharrycn commented 8 years ago

I was getting mAP ~0.73-0.74 on VOC07 test when using ResNet101 (trained on VOC07 trainval) with 60K iterations. Training details can be found in my previous post in this thread. By a quick look at your solver file, I think you probably had too many iterations (500K is way too much).

banxiaduhuo commented 8 years ago

@happyharrycn Thank you~ I was getting mAP 0.65 on VOC2007 using ResNet50 with 70k iterations, according your implementation details to fix all batchNorm+scale and conv_x. I will try to use ResNet101. It seems BN must be fixed when fine-tuning, right? And how to merge Batchnorm + Scale and let Caffe to reuse the gradients for each blob？Can you give me some guidance?

c149028 commented 8 years ago

@banxiaduhuo, maybe a BatchNorm layer acts as a Scale layer during test time, and we can merge two consecutive Scale layers into one. @happyharrycn, thank you for sharing your knowledge. It's really helpful! I have a question about reusing the diff blobs. Does that mean we need to modify caffe so the backprop (gradients computation and parameters update) processes layer-by-layer?

ice-pice commented 8 years ago

@happyharrycn I am trying to finte-tune resnet-50 and faster-rcnn for COCO dataset as mentioned in Kaiming's paper by using a learning rate of 0.001 for 240K iterations and 0.0001 for next 80K iterations (using the provided end2end training). It appears that these number of iterations are way too much because val AP score starts decreasing from iteration 150K onwards. Can you share some insights on how many iterations are required to train on 80K images of COCO datset? Thanks!

happyharrycn commented 8 years ago

@ice-pice I think 240K + 80K is actually not enough iterations for training on COCO. I have used 500K iterations for 120K images (COCO train + val). Have you tried to keep the training running for the full 320K iteration and check whether the AP keeps decreasing after 150K?

ice-pice commented 8 years ago

@happyharrycn I was skipping average pooling from ResNet which was leading to poor results after 150K iterations as mentioned. Found my mistake when I generated the model using https://github.com/XiaozhiChen/resnet-generator. I'd presume without average pooling, there will be too many parameters between last convolutional and first fully connected layer that are difficult to calibrate using fine-tuning.
Following is my network configuration and results. Can you please emphasize on points due to which my AP scores are below the reported ones?
- Model : ResNet-50 + faster-rcnn
- Train/Test set : COCO train/validation set
- Iterations : 320K (240K with lr = 0.001 and 80 K with lr = 0.0001)
- Mini-batch : 1 image generating 256 proposals (as mentioned in faster-rcnn paper)
- Detection network initialization : ResNet-50 Imagenet trained weights
- RPN network initialization : Random initialization
- Results : AP (IoU=0.5) scores at different iterations for val set

I want to notify that my 320K iterations process 320K images. In the 500K iterations you mentioned, do you process 500K images or 500*8K?

Thanks!

CrossLee1 commented 8 years ago

@ice-pice I also tried to train resnet50+faster rcnn on the COCO dataset. However, the training speed is very slow, about 4s / iter, and the loss seems not decrease at all.

What is your training speed? Could you share your log file so I can see the change of loss? Thanks~

ice-pice commented 8 years ago

@CrossLee1 It takes around 1s/iter for me to train resnet50 + faster-rcnn on NVidia TitanX. 4s/iter seems like too much, it could have been happening because you are adding unnecessary additional parameters into the architecture. You can validate your prototxt by comparing it with a generated prototxt from https://github.com/XiaozhiChen/resnet-generator.

From my observation change of loss is not reflective of convergence in this case because of a mini-batch size of 1. After 50K iterations or so, the loss value fluctuates in the same interval until 320K. Link to log for the model I have trained in my previous comment: https://drive.google.com/file/d/0B4AOlDvVIP8RMUxQS0M5dWJEYjg/view?usp=sharing

I am still making changes and if I reach the baseline, I can share the prototxt with you if you'll like. Cheers!

CrossLee1 commented 8 years ago

@ice-pice Thanks for your reply~ I will use the generated prototxt from https://github.com/XiaozhiChen/resnet-generator and try again.

Wish you have a good result~

CrossLee1 commented 8 years ago

@ice-pice Wonder if you have succeeded in the training of resnet50 + py-faster-rcnn and reached the baseline? Hope to get your results~

cervantes-loves-ai commented 8 years ago

how to solve this ? WARNING: Logging before InitGoogleLogging() is written to STDERR W0718 17:01:06.813407 26562 _caffe.cpp:122] DEPRECATION WARNING - deprecated use of Python interface W0718 17:01:06.813459 26562 _caffe.cpp:123] Use this instead (with the named "weights" parameter): W0718 17:01:06.813477 26562 _caffe.cpp:125] Net('/home/rvlab/Documents/fast-rcnn-master/models/VGG16/test.prototxt', 1, weights='/home/rvlab/Documents/fast-rcnn-master/data/fast_rcnn_models/vgg16_fast_rcnn_iter_40000.caffemodel') [libprotobuf ERROR google/protobuf/text_format.cc:245] Error parsing text-format caffe.NetParameter: 392:21: Message type "caffe.LayerParameter" has no field named "roi_pooling_param". F0718 17:01:06.815565 26562 upgrade_proto.cpp:79] Check failed: ReadProtoFromTextFile(param_file, param) Failed to parse NetParameter file: /home/rvlab/Documents/fast-rcnn-master/models/VGG16/test.prototxt * Check failure stack trace: * Aborted (core dumped

ice-pice commented 8 years ago

@CrossLee1 With resnet50 + py-faster-rcnn, I am able to achieve 45% mAP. My guess is that if I use resnet101, I can reach closer to the 48% baseline.

CrossLee1 commented 8 years ago

@ice-pice glad to hear that~ Could you provide your train_val.prototxt for reference or your email so we can discuss it in detail~ Thanks a lot~

CrossLee1 commented 8 years ago

@sarkeribrahim do you implement this step?

`Build the Cython modules

cd $FRCN_ROOT/lib make`

ice-pice commented 8 years ago

@CrossLee1 My resnet-50 + faster-rcnn prototxt.

yjn870 commented 8 years ago

@ice-pice Can i get test.prototxt file? test.prototxt(by resnet generator) is not compatible.

zimenglan-sysu-512 commented 8 years ago

@happyharrycn can you share your train&test prototxt of "resnet + faster rcnn"? thanks.

banxiaduhuo commented 8 years ago

@ice-pice ResNet50 and ResNet101 that I trained both close to 44.4 mAP, How about yours? I wonder if you know how to implements the methods of He's paper, such as Box refinement&context learning&multi-scale testing. My implements seems not good. Thank you so much!

ice-pice commented 8 years ago

@banxiaduhuo I did implement box refinement strategy and it gave me a 1.3% boost as compared to 2% mentioned in the paper. Did not get a chance to try the other 2 strategies. You should read the details very carefully in the paper, I practically followed everything line by line. Hope it helps!

ice-pice commented 8 years ago

@yjn870 : You need to remove the topmost data layer as it is not required while testing. Take some hints from https://github.com/rbgirshick/py-faster-rcnn/blob/master/models/coco/VGG16/faster_rcnn_end2end/test.prototxt

Remove all the layers which are unnecesary while testing.

rajiv235 commented 8 years ago

Hello. I am trying to run Resnet101 with Faster RCNN on AWS 4gb K520 gpu. I realized that this GPU memory wont be enough and got the same error. 0321 07:29:44.037149 1892 solver.cpp:60] Solver scaffolding done. Loading pretrained model weights from data/imagenet_models/resnet.caffemodel I0321 07:29:44.240974 1892 net.cpp:816] Ignoring source layer fc1000 I0321 07:29:44.241065 1892 net.cpp:816] Ignoring source layer prob Solving... F0321 07:29:45.412804 1892 syncedmem.cpp:56] Check failed: error == cudaSuccess (2 vs. 0) out of memory * Check failure stack trace: * Aborted (core dumped)

I wanted to ask if AWS with g2.8xlarge instance( 4 GPUs with 4GB each) should do the job and has anyone tried that?

Thanks. Any help will be appreciated.

EthanReid commented 8 years ago

Hi everyone. I am having trouble with the make file and finding ./tools/demo.py. It says for the make file no targets found. I installed it in cuda/fast-rcnn/lib and cuda. When is try to run ./tools/demo.py it says no directory found.

zimenglan-sysu-512 commented 8 years ago

@ice-pice @banxiaduhuo can you share how you do bbox refinement?

zimenglan-sysu-512 commented 8 years ago

@happyharrycn @banxiaduhuo i try resnet50 and 80k iterations, get 73.59% mAP for pascal voc 07.

spandanagella commented 8 years ago

Did anyone try ResNet-50 or higher depth with COCO classes using pyfaster-rcnn?

KeyKy commented 8 years ago

I tried ResNet-50 prototxt from ice-pice and ResNet-50 prototxt from siddharthm83. The base_lr=0.001(step=300000), total_iters=490000. However, I only get map 0.265(IoU=0.5) in coco. Did anyone have ResNet-50+py-faster-rcnn pretrained model of coco?

@ice-pice @banxiaduhuo I can not reproduce your result, Could you help me?

Eniac-Xie commented 7 years ago

@spandanagella I release a implementation (prototxt file and model weights) of ResNet-101 based faster-rcnn, check this repos

spandanagella commented 7 years ago

@Eniac-Xie Thanks. I am looking for ResNet model trained on COCO object categories. Looks like you have resnet based faster-rcnn for just PASCAL-VOC.

onkarganjewar commented 7 years ago

@KeyKy @ice-pice @banxiaduhuo @Eniac-Xie @zimenglan-sysu-512

I'm trying to train the ResNet-50 model on PASCAL VOC 2007 trainval dataset. I've followed the comments in issue #62. So, I'm using this command to start the training

./tools/train_net.py --gpu 1 --weights data/imagenet_models/ResNet-50-model.caffemodel --imdb voc_2007_trainval --cfg experiments/cfgs/faster_rcnn_end2end.yml --solver models/ResNet-50/faster_rcnn_end2end/solver.prototxt

I'm using the solver/train prototxt files from @twtygqyy repo

However, I'm getting this error:

Normalizing targets done WARNING: Logging before InitGoogleLogging() is written to STDERR I0222 13:59:58.538053 23076 solver.cpp:54] Initializing solver from parameters: test_iter: 100 test_interval: 1000 base_lr: 0.0001 display: 100 max_iter: 200000 lr_policy: "multistep" gamma: 0.1 momentum: 0.9 weight_decay: 0.0001 stepsize: 20000 snapshot: 10000 snapshot_prefix: "resnet50_train" solver_mode: GPU net: "models/ResNet-50/faster_rcnn_end2end/ResNet-50-train_val.prototxt" test_initialization: false I0222 13:59:58.538121 23076 solver.cpp:96] Creating training net from net file: models/ResNet-50/faster_rcnn_end2end/ResNet-50-train_val.prototxt [libprotobuf ERROR google/protobuf/text_format.cc:245] Error parsing text-format caffe.NetParameter: 74:26: Message type "caffe.LayerParameter" has no field named "batch_norm_param". F0222 13:59:58.538242 23076 upgrade_proto.cpp:928] Check failed: ReadProtoFromTextFile(param_file, param) Failed to parse NetParameter file: models/ResNet-50/faster_rcnn_end2end/ResNet-50-train_val.prototxt *** Check failure stack trace: *** Aborted (core dumped)

I'm on the latest commit of faster-rcnn branch of caffe-fast-rcnn

Pardon my lack of knowledge, but would you guys mind helping me resolve this error, please? Appreciate it. Thanks.

joyivan commented 7 years ago

make it

twtygqyy commented 7 years ago

Here is the caffe-fast-rcnn with upstream caffe https://github.com/twtygqyy/caffe-fast-rcnn-upstream

646677064 commented 7 years ago

@onkarganjewar @twtygqyy @Eniac-Xie @happyharrycn I tried R-FCN + ResNet-101(from jifeng dai Orpine https://github.com/Orpine/py-R-FCN),why R-FCn-ohem take 5g of the gpu memory ,but the faster-rcnn+resnet-101-bn-scale-merged-ohem (from @Eniac-Xie) take 11g of the gpu memory. I don't know what difference make it. Somebody,Please

murphypei commented 7 years ago

@646677064 because of the full connection layers , they have most of the parameters.

Dareschoels commented 7 years ago

I'm using matlab 2017a student version, gpu: gtx 1060 (6 gb) I have few questions related to Matlab , hope so i will get the answers i need, thanks. 1.Is there any special requirements if i want to make my own set of images for Re-train RestNet or AlexNet and which 1 is better?

When i retrain network for my purpose how much epoch should be optimal number(guess that depends on my Data set)? 3.How much gpu memory does Faster RCNN object detector requires and do i have manually to label images or there's some faster way? 4.Implementing Faster RCNN detector on video for real time detection any tips about what tool to use?

whmin commented 7 years ago

@646677064 I have tried faster-rcnn+resnet-50-bn-scale-merged-ohem(from @Eniac-Xie),but there is an error when i run "./experiments/scripts/faster_rcnn_end2end.sh 0 ResNet-50 pascal_voc",like this: screenshot from 2017-11-08 21-08-43

Do you know how to solve it?Can you share your faster-rcnn+resnet-101-bn-scale-merged-ohem (from @Eniac-Xie) test.prototxt file with me?Thank you very much!!!

nnop commented 6 years ago

Have you found the reason of training slow problrm? I met the same issue. About 4s/iter on Titan X. @CrossLee1

YoungMagic commented 6 years ago

Same thing happened to me! Any idea yet? @nnop

mantou22 commented 6 years ago

Excuse me.When I trained my own model, I used the model I trained to run demo.py to detect the graph. When the pixel was large (5000，3000), the results were all white include image.If the image pixel is not too large, there is no problem.What's the reason?(当我训练好自己的模型时，用自己训练的模型运行demo.py,去检测图形，当检测图片像素很大时（5000，3000），检测出来的结果是全白包括图片。如果图片像素不是太大，就不会出问题。请问这是什么原因？)

rbgirshick / py-faster-rcnn

Py-Faster-Rcnn using Resnet #122