Check failed: error == cudaSuccess (2 vs. 0) out of memory - Githubissues

unsky / FPN

Feature Pyramid Networks for Object Detection

524 stars 263 forks source link

Check failed: error == cudaSuccess (2 vs. 0) out of memory #34

Closed hmjbuaa closed 6 years ago

hmjbuaa commented 6 years ago

I have met the error as follow when using FPN to train my own dataset: Loading pretrained model weights from data/pretrained_model/ResNet50.v2.caffemodel I0126 16:08:01.051177 10326 net.cpp:816] Ignoring source layer pool5 I0126 16:08:01.051213 10326 net.cpp:816] Ignoring source layer fc1000 I0126 16:08:01.051214 10326 net.cpp:816] Ignoring source layer prob Solving... F0126 16:08:02.152665 10326 syncedmem.cpp:56] Check failed: error == cudaSuccess (2 vs. 0) out of memory Check failure stack trace: ./experiments/scripts/FP_Net_end2end.sh: 行 57: 10326 已放弃 (核心已转储) ./tools/train_net.py --gpu ${GPU_ID} --solver models/${PT_DIR}/${NET}/FP_Net_end2end/solver.prototxt --weights data/pretrained_model/ResNet50.v2.caffemodel --imdb ${TRAIN_IMDB} --iters ${ITERS} --cfg experiments/cfgs/FP_Net_end2end.yml ${EXTRA_ARGS}

I tried to reduce C.TRAIN.IMS_PER_BATCH and C.TRAIN.BATCH_SIZE in lib/fast_rcnn/config.py, but it doesn't work. How to reduce the batch_size to reduce the cost of memory?

unsky commented 6 years ago

@hmjbuaa reduce image size like https://github.com/unsky/FPN/issues/33

jjy201314 commented 6 years ago

when i training ,i have a problem :

Traceback (most recent call last): File "/home/hncs/liuwei/FPN1/tools/../lib/rpn/proposal_layer.py", line 14, in from fast_rcnn.nms_wrapper import nms File "/home/hncs/liuwei/FPN1/tools/../lib/fast_rcnn/nms_wrapper.py", line 9, in from nms.gpu_nms import gpu_nms ImportError: No module named gpu_nms Traceback (most recent call last): File "./tools/train_net.py", line 113, in max_iters=args.max_iters) File "/home/hncs/liuwei/FPN1/tools/../lib/fast_rcnn/train.py", line 143, in train_net pretrained_model=pretrained_model) File "/home/hncs/liuwei/FPN1/tools/../lib/fast_rcnn/train.py", line 46, in init self.solver = caffe.SGDSolver(solver_prototxt) SystemError: NULL result without error in PyObject_Call

But i running on the GPU,and the id is right, i don't know what's wrong. thank you very much! @hmjbuaa @unsky

unsky commented 6 years ago

build lib

jjy201314 commented 6 years ago

解决了，谢谢 @unsky

jjy201314 commented 6 years ago

but I meet the roi_data_layer problem: Traceback (most recent call last): File "./tools/train_net.py", line 113, in max_iters=args.max_iters) File "/home/hncs/liuwei/FPN1/tools/../lib/fast_rcnn/train.py", line 146, in train_net model_paths = sw.train_model(max_iters) File "/home/hncs/liuwei/FPN1/tools/../lib/fast_rcnn/train.py", line 86, in train_model self.solver.step(1) File "/home/hncs/liuwei/FPN1/tools/../lib/roi_data_layer/layer.py", line 144, in forward blobs = self._get_next_minibatch() File "/home/hncs/liuwei/FPN1/tools/../lib/roi_data_layer/layer.py", line 63, in _get_next_minibatch return get_minibatch(minibatch_db, self._num_classes) File "/home/hncs/liuwei/FPN1/tools/../lib/roi_data_layer/minibatch.py", line 29, in get_minibatch im_blob, im_scales = _get_image_blob(roidb, random_scale_inds) File "/home/hncs/liuwei/FPN1/tools/../lib/roi_data_layer/minibatch.py", line 142, in _get_image_blob cfg.TRAIN.MAX_SIZE,cfg.TRAIN.IMAGE_STRIDE) TypeError: prep_im_for_blob() takes exactly 4 arguments (5 given)

I don't know how to solve . @unsky

unsky commented 6 years ago

do you use my codes?

jjy201314 commented 6 years ago

是的

unsky commented 6 years ago

拉最新代码试试，以前的删了

jjy201314 commented 6 years ago

我刚重新下载了https://github.com/unsky/FPN的代码，然后重新编译，遇到了一样的问题。

unsky commented 6 years ago

你确定lib/utils里的代码是我的？

jjy201314 commented 6 years ago

我的天，不是的，当时编译出了问题，我就用了faster-rcnn的，现在换回来了，谢谢谢谢

zqdeepbluesky commented 6 years ago

@hmjbuaa @unsky hi,I met the same problem as you did,how do you fix it? I try to change the config.py and reduce the batch_size from 256 to 2,but it didn't work,I try to change image_size from [768,1280] to [448,512],it also didn't work. I don't know how to fix the problem.can you help me please?thanks so much!!!!!!!