Open hoticevijay opened 8 years ago
Experiment with different batch sizes in the yml file or in config.py
ResNet 101 on Caffe requires >10G GPU memory for an input with VGA resoltuion (640*480) during training (even when you fixed all conv1_x/conv2_x/conv3_x layers). A couple of memory optimizations can be done easily though.
A more fundamental solution is to allow Caffe to reuse the gradients (diff) for each blob. One can safely rewrite the diff of a blob when the weights of all layers including the blob had been updated. And that's the way to train ResNet 101 using a batch size of 32 on 12G memory as mentioned in the original paper.
@happyharrycn I find something interesting when I train resnet + faster rcnn on my own dataset. If I fix all batchNorm+scale layers on conv1 ~ conv4, and only allow updates on conv layers, the resulting model is far from the paper claims. If I allow batchNorm+scale to update, it gets much better performance (close to vgg16). But faster rcnn only uses one image per batch and is not supposed to update batchNorm+scale properly. What am I missing?
@kl2005ad ResNet for detection was re-produced at multiple sites. And I am not sure why you are getting worse performance based on your description here. The last time I tried on VOC, it is working better than they claimed in the paper :) Here are some implementation details I used for training.
@kl2005ad @happyharrycn Hi! I try to train resnet50 with faster-rcnn. And I got a very low result on voc2007, about 0.47, even lower than ZF model's 0.62. What's the result you got on voc2007? I didn't freeze the batchNorm+Scale layers.
Is my solver correct? base_lr: 0.001 lr_policy: "multistep" gamma: 0.1 stepvalue: 300000 stepvalue: 500000 display: 20 momentum: 0.9 weight_decay: 0.0001 snapshot: 0
Thank you very much!
I was getting mAP ~0.73-0.74 on VOC07 test when using ResNet101 (trained on VOC07 trainval) with 60K iterations. Training details can be found in my previous post in this thread. By a quick look at your solver file, I think you probably had too many iterations (500K is way too much).
@happyharrycn Thank you~ I was getting mAP 0.65 on VOC2007 using ResNet50 with 70k iterations, according your implementation details to fix all batchNorm+scale and conv_x. I will try to use ResNet101. It seems BN must be fixed when fine-tuning, right? And how to merge Batchnorm + Scale and let Caffe to reuse the gradients for each blob?Can you give me some guidance?
@banxiaduhuo, maybe a BatchNorm layer acts as a Scale layer during test time, and we can merge two consecutive Scale layers into one. @happyharrycn, thank you for sharing your knowledge. It's really helpful! I have a question about reusing the diff blobs. Does that mean we need to modify caffe so the backprop (gradients computation and parameters update) processes layer-by-layer?
@happyharrycn I am trying to finte-tune resnet-50 and faster-rcnn for COCO dataset as mentioned in Kaiming's paper by using a learning rate of 0.001 for 240K iterations and 0.0001 for next 80K iterations (using the provided end2end training). It appears that these number of iterations are way too much because val AP score starts decreasing from iteration 150K onwards. Can you share some insights on how many iterations are required to train on 80K images of COCO datset? Thanks!
@ice-pice I think 240K + 80K is actually not enough iterations for training on COCO. I have used 500K iterations for 120K images (COCO train + val). Have you tried to keep the training running for the full 320K iteration and check whether the AP keeps decreasing after 150K?
Following is my network configuration and results. Can you please emphasize on points due to which my AP scores are below the reported ones?
I want to notify that my 320K iterations process 320K images. In the 500K iterations you mentioned, do you process 500K images or 500*8K?
Thanks!
@ice-pice I also tried to train resnet50+faster rcnn on the COCO dataset. However, the training speed is very slow, about 4s / iter, and the loss seems not decrease at all.
What is your training speed? Could you share your log file so I can see the change of loss? Thanks~
@CrossLee1 It takes around 1s/iter for me to train resnet50 + faster-rcnn on NVidia TitanX. 4s/iter seems like too much, it could have been happening because you are adding unnecessary additional parameters into the architecture. You can validate your prototxt by comparing it with a generated prototxt from https://github.com/XiaozhiChen/resnet-generator.
From my observation change of loss is not reflective of convergence in this case because of a mini-batch size of 1. After 50K iterations or so, the loss value fluctuates in the same interval until 320K. Link to log for the model I have trained in my previous comment: https://drive.google.com/file/d/0B4AOlDvVIP8RMUxQS0M5dWJEYjg/view?usp=sharing
I am still making changes and if I reach the baseline, I can share the prototxt with you if you'll like. Cheers!
@ice-pice Thanks for your reply~ I will use the generated prototxt from https://github.com/XiaozhiChen/resnet-generator and try again.
Wish you have a good result~
@ice-pice Wonder if you have succeeded in the training of resnet50 + py-faster-rcnn and reached the baseline? Hope to get your results~
how to solve this ? WARNING: Logging before InitGoogleLogging() is written to STDERR W0718 17:01:06.813407 26562 _caffe.cpp:122] DEPRECATION WARNING - deprecated use of Python interface W0718 17:01:06.813459 26562 _caffe.cpp:123] Use this instead (with the named "weights" parameter): W0718 17:01:06.813477 26562 _caffe.cpp:125] Net('/home/rvlab/Documents/fast-rcnn-master/models/VGG16/test.prototxt', 1, weights='/home/rvlab/Documents/fast-rcnn-master/data/fast_rcnn_models/vgg16_fast_rcnn_iter_40000.caffemodel') [libprotobuf ERROR google/protobuf/text_format.cc:245] Error parsing text-format caffe.NetParameter: 392:21: Message type "caffe.LayerParameter" has no field named "roi_pooling_param". F0718 17:01:06.815565 26562 upgrade_proto.cpp:79] Check failed: ReadProtoFromTextFile(param_file, param) Failed to parse NetParameter file: /home/rvlab/Documents/fast-rcnn-master/models/VGG16/test.prototxt * Check failure stack trace: * Aborted (core dumped
@CrossLee1 With resnet50 + py-faster-rcnn, I am able to achieve 45% mAP. My guess is that if I use resnet101, I can reach closer to the 48% baseline.
@ice-pice glad to hear that~ Could you provide your train_val.prototxt for reference or your email so we can discuss it in detail~ Thanks a lot~
@sarkeribrahim do you implement this step?
`Build the Cython modules
cd $FRCN_ROOT/lib make`
@ice-pice Can i get test.prototxt file? test.prototxt(by resnet generator) is not compatible.
@happyharrycn can you share your train&test prototxt of "resnet + faster rcnn"? thanks.
@ice-pice ResNet50 and ResNet101 that I trained both close to 44.4 mAP, How about yours? I wonder if you know how to implements the methods of He's paper, such as Box refinement&context learning&multi-scale testing. My implements seems not good. Thank you so much!
@banxiaduhuo I did implement box refinement strategy and it gave me a 1.3% boost as compared to 2% mentioned in the paper. Did not get a chance to try the other 2 strategies. You should read the details very carefully in the paper, I practically followed everything line by line. Hope it helps!
@yjn870 : You need to remove the topmost data layer as it is not required while testing. Take some hints from https://github.com/rbgirshick/py-faster-rcnn/blob/master/models/coco/VGG16/faster_rcnn_end2end/test.prototxt
Remove all the layers which are unnecesary while testing.
Hello. I am trying to run Resnet101 with Faster RCNN on AWS 4gb K520 gpu. I realized that this GPU memory wont be enough and got the same error. 0321 07:29:44.037149 1892 solver.cpp:60] Solver scaffolding done. Loading pretrained model weights from data/imagenet_models/resnet.caffemodel I0321 07:29:44.240974 1892 net.cpp:816] Ignoring source layer fc1000 I0321 07:29:44.241065 1892 net.cpp:816] Ignoring source layer prob Solving... F0321 07:29:45.412804 1892 syncedmem.cpp:56] Check failed: error == cudaSuccess (2 vs. 0) out of memory * Check failure stack trace: * Aborted (core dumped)
I wanted to ask if AWS with g2.8xlarge instance( 4 GPUs with 4GB each) should do the job and has anyone tried that?
Thanks. Any help will be appreciated.
Hi everyone. I am having trouble with the make file and finding ./tools/demo.py. It says for the make file no targets found. I installed it in cuda/fast-rcnn/lib and cuda. When is try to run ./tools/demo.py it says no directory found.
@ice-pice @banxiaduhuo can you share how you do bbox refinement?
@happyharrycn @banxiaduhuo i try resnet50 and 80k iterations, get 73.59% mAP for pascal voc 07.
Did anyone try ResNet-50 or higher depth with COCO classes using pyfaster-rcnn?
I tried ResNet-50 prototxt from ice-pice and ResNet-50 prototxt from siddharthm83. The base_lr=0.001(step=300000), total_iters=490000. However, I only get map 0.265(IoU=0.5) in coco. Did anyone have ResNet-50+py-faster-rcnn pretrained model of coco?
@ice-pice @banxiaduhuo I can not reproduce your result, Could you help me?
@spandanagella I release a implementation (prototxt file and model weights) of ResNet-101 based faster-rcnn, check this repos
@Eniac-Xie Thanks. I am looking for ResNet model trained on COCO object categories. Looks like you have resnet based faster-rcnn for just PASCAL-VOC.
@KeyKy @ice-pice @banxiaduhuo @Eniac-Xie @zimenglan-sysu-512
I'm trying to train the ResNet-50 model on PASCAL VOC 2007 trainval dataset. I've followed the comments in issue #62. So, I'm using this command to start the training
./tools/train_net.py --gpu 1 --weights data/imagenet_models/ResNet-50-model.caffemodel --imdb voc_2007_trainval --cfg experiments/cfgs/faster_rcnn_end2end.yml --solver models/ResNet-50/faster_rcnn_end2end/solver.prototxt
I'm using the solver/train prototxt files from @twtygqyy repo
However, I'm getting this error:
Normalizing targets done WARNING: Logging before InitGoogleLogging() is written to STDERR I0222 13:59:58.538053 23076 solver.cpp:54] Initializing solver from parameters: test_iter: 100 test_interval: 1000 base_lr: 0.0001 display: 100 max_iter: 200000 lr_policy: "multistep" gamma: 0.1 momentum: 0.9 weight_decay: 0.0001 stepsize: 20000 snapshot: 10000
snapshot_prefix: "resnet50_train" solver_mode: GPU
net: "models/ResNet-50/faster_rcnn_end2end/ResNet-50-train_val.prototxt"
test_initialization: false
I0222 13:59:58.538121 23076 solver.cpp:96] Creating training net from net file: models/ResNet-50/faster_rcnn_end2end/ResNet-50-train_val.prototxt
[libprotobuf ERROR google/protobuf/text_format.cc:245] Error parsing text-format caffe.NetParameter: 74:26: Message type "caffe.LayerParameter" has no field named "batch_norm_param".
F0222 13:59:58.538242 23076 upgrade_proto.cpp:928] Check failed: ReadProtoFromTextFile(param_file, param) Failed to parse NetParameter file: models/ResNet-50/faster_rcnn_end2end/ResNet-50-train_val.prototxt
*** Check failure stack trace: *** Aborted (core dumped)
I'm on the latest commit of faster-rcnn
branch of caffe-fast-rcnn
Pardon my lack of knowledge, but would you guys mind helping me resolve this error, please? Appreciate it. Thanks.
make it
Here is the caffe-fast-rcnn with upstream caffe https://github.com/twtygqyy/caffe-fast-rcnn-upstream
@onkarganjewar @twtygqyy @Eniac-Xie @happyharrycn I tried R-FCN + ResNet-101(from jifeng dai Orpine https://github.com/Orpine/py-R-FCN),why R-FCn-ohem take 5g of the gpu memory ,but the faster-rcnn+resnet-101-bn-scale-merged-ohem (from @Eniac-Xie) take 11g of the gpu memory. I don't know what difference make it. Somebody,Please
@646677064 because of the full connection layers , they have most of the parameters.
I'm using matlab 2017a student version, gpu: gtx 1060 (6 gb) I have few questions related to Matlab , hope so i will get the answers i need, thanks. 1.Is there any special requirements if i want to make my own set of images for Re-train RestNet or AlexNet and which 1 is better?
@646677064 I have tried faster-rcnn+resnet-50-bn-scale-merged-ohem(from @Eniac-Xie),but there is an error when i run "./experiments/scripts/faster_rcnn_end2end.sh 0 ResNet-50 pascal_voc",like this:
Do you know how to solve it?Can you share your faster-rcnn+resnet-101-bn-scale-merged-ohem (from @Eniac-Xie) test.prototxt file with me?Thank you very much!!!
Have you found the reason of training slow problrm? I met the same issue. About 4s/iter on Titan X. @CrossLee1
Same thing happened to me! Any idea yet? @nnop
Excuse me.When I trained my own model, I used the model I trained to run demo.py to detect the graph. When the pixel was large (5000,3000), the results were all white include image.If the image pixel is not too large, there is no problem.What's the reason?(当我训练好自己的模型时,用自己训练的模型运行demo.py,去检测图形,当检测图片像素很大时(5000,3000),检测出来的结果是全白包括图片。如果图片像素不是太大,就不会出问题。请问这是什么原因?)
Based on https://github.com/rbgirshick/py-faster-rcnn/issues/62 I am trying to train my own dataset using resnet+py-faster-rcnn (using @siddharthm83 train.txt). I am getting the following error.
I0321 07:29:44.037149 1892 solver.cpp:60] Solver scaffolding done. Loading pretrained model weights from data/imagenet_models/resnet.caffemodel I0321 07:29:44.240974 1892 net.cpp:816] Ignoring source layer fc1000 I0321 07:29:44.241065 1892 net.cpp:816] Ignoring source layer prob Solving... F0321 07:29:45.412804 1892 syncedmem.cpp:56] Check failed: error == cudaSuccess (2 vs. 0) out of memory * Check failure stack trace: * Aborted (core dumped)
I am using AWS instance. I was able to train resnet-50 (without fast-rcnn) using the same instance with same dataset. But when I tried using py-faster-rcnn, I am getting this error. I know this error could possibly be due to insufficient memory. So I changed the batch size in deploy.prototxt (iter_size: 1). But still I am getting the error. Can someone help me out?