ruoqianguo / cascade-rcnn_Pytorch

An implementation of Cascade R-CNN: Delving into High Quality Object Detection.
MIT License
436 stars 107 forks source link

cudaCheckError() failed : an illegal memory access was encountered #10

Open KevinQian97 opened 6 years ago

KevinQian97 commented 6 years ago

Hi, thanks for your code! I use your code for training and it succeed, however, when it comes to testing, I am encountered with a weird error: CUDA_VISIBLE_DEVICES=0,1,2,3 python test_net.py exp_name --cascade --cuda --mGPUs "TiTanX" 09:48 09-9月-1/home/zhiqi.cheng/anaconda2/lib/python2.7/site-packages/scipy/sparse/lil.py:16: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 from . import _csparsetools /home/zhiqi.cheng/anaconda2/lib/python2.7/site-packages/scipy/sparse/csgraph/init.py:167: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 from ._shortest_path import shortest_path, floyd_warshall, dijkstra,\ /home/zhiqi.cheng/anaconda2/lib/python2.7/site-packages/scipy/sparse/csgraph/_validation.py:5: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 from ._tools import csgraph_to_dense, csgraph_from_dense,\ /home/zhiqi.cheng/anaconda2/lib/python2.7/site-packages/scipy/sparse/csgraph/init.py:169: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 from ._traversal import breadth_first_order, depth_first_order, \ /home/zhiqi.cheng/anaconda2/lib/python2.7/site-packages/scipy/sparse/csgraph/init.py:171: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 from ._min_spanning_tree import minimum_spanning_tree /home/zhiqi.cheng/anaconda2/lib/python2.7/site-packages/scipy/sparse/csgraph/init.py:172: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 from ._reordering import reverse_cuthill_mckee, maximum_bipartite_matching, \ /home/zhiqi.cheng/anaconda2/lib/python2.7/site-packages/scipy/linalg/basic.py:17: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 from ._solve_toeplitz import levinson /home/zhiqi.cheng/anaconda2/lib/python2.7/site-packages/scipy/linalg/init.py:191: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 from ._decomp_update import /home/zhiqi.cheng/anaconda2/lib/python2.7/site-packages/scipy/special/init.py:640: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 from ._ufuncs import /home/zhiqi.cheng/anaconda2/lib/python2.7/site-packages/scipy/special/_ellip_harm.py:7: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 from ._ellip_harm_2 import _ellipsoid, _ellipsoid_norm /home/zhiqi.cheng/anaconda2/lib/python2.7/site-packages/scipy/optimize/_numdiff.py:8: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 from ._group_columns import group_dense, group_sparse /home/zhiqi.cheng/anaconda2/lib/python2.7/site-packages/scipy/interpolate/_bsplines.py:9: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 from . import _bspl /home/zhiqi.cheng/anaconda2/lib/python2.7/site-packages/scipy/spatial/init.py:94: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 from .ckdtree import /home/zhiqi.cheng/anaconda2/lib/python2.7/site-packages/scipy/spatial/init.py:95: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 from .qhull import /home/zhiqi.cheng/anaconda2/lib/python2.7/site-packages/scipy/spatial/_spherical_voronoi.py:18: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 from . import _voronoi /home/zhiqi.cheng/anaconda2/lib/python2.7/site-packages/scipy/spatial/distance.py:121: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 from . import _hausdorff /home/zhiqi.cheng/anaconda2/lib/python2.7/site-packages/scipy/io/matlab/mio4.py:18: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 from .mio_utils import squeeze_element, chars_to_strings /home/zhiqi.cheng/anaconda2/lib/python2.7/site-packages/scipy/io/matlab/mio5.py:98: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 from .mio5_utils import VarReader5 Called with args: Namespace(batch_size=1, cascade=True, cfg_file='cfgs/res101.yml', checkepoch=7, checkpoint=6310, checksession=1, class_agnostic=False, cuda=True, dataset='pascal_voc', exp_name='exp_name', large_scale=False, load_dir='models', mGPUs=True, net='detnet59', parallel_type=0, set_cfgs=None, soft_nms=False, vis=False) Using config: {'ANCHOR_RATIOS': [0.5, 1, 2], 'ANCHOR_SCALES': [4, 8, 16, 32], 'CROP_RESIZE_WITH_MAX_POOL': False, 'CUDA': False, 'DATA_DIR': '/DATACENTER2/qyj/cascade-rcnn_Pytorch-master/data', 'DEDUP_BOXES': 0.0625, 'DETNET': {'FIXED_BLOCKS': 1, 'MAX_POOL': False}, 'EPS': 1e-14, 'EXP_DIR': 'res101', 'FEAT_STRIDE': [16], 'FPN_ANCHOR_SCALES': [32, 64, 128, 256, 512], 'FPN_ANCHOR_STRIDE': 1, 'FPN_FEAT_STRIDES': [4, 8, 16, 16, 16], 'GPU_ID': 0, 'HAS_MASK': True, 'MATLAB': 'matlab', 'MAX_NUM_GT_BOXES': 20, 'MOBILENET': {'DEPTH_MULTIPLIER': 1.0, 'FIXED_LAYERS': 5, 'REGU_DEPTH': False, 'WEIGHT_DECAY': 4e-05}, 'PIXEL_MEANS': array([[[0.485, 0.456, 0.406]]]), 'PIXEL_STDS': array([[[0.229, 0.224, 0.225]]]), 'POOLING_MODE': 'align', 'POOLING_SIZE': 14, 'RESNET': {'FIXED_BLOCKS': 1, 'MAX_POOL': False}, 'RNG_SEED': 3, 'ROOT_DIR': '/DATACENTER2/qyj/cascade-rcnn_Pytorch-master', 'TEST': {'BBOX_REG': True, 'HAS_RPN': True, 'MAX_SIZE': 1000, 'MODE': 'nms', 'NMS': 0.3, 'PROPOSAL_METHOD': 'gt', 'RPN_MIN_SIZE': 16, 'RPN_NMS_THRESH': 0.7, 'RPN_POST_NMS_TOP_N': 300, 'RPN_PRE_NMS_TOP_N': 6000, 'RPN_TOP_N': 5000, 'SCALES': [600], 'SOFT_NMS_METHOD': 1, 'SVM': False}, 'TRAIN': {'ASPECT_CROPPING': False, 'ASPECT_GROUPING': False, 'BATCH_SIZE': 128, 'BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0], 'BBOX_NORMALIZE_MEANS': [0.0, 0.0, 0.0, 0.0], 'BBOX_NORMALIZE_STDS': [0.1, 0.1, 0.2, 0.2], 'BBOX_NORMALIZE_TARGETS': True, 'BBOX_NORMALIZE_TARGETS_PRECOMPUTED': True, 'BBOX_REG': True, 'BBOX_THRESH': 0.5, 'BG_THRESH_HI': 0.5, 'BG_THRESH_LO': 0.0, 'BIAS_DECAY': False, 'BN_TRAIN': False, 'DISPLAY': 20, 'DOUBLE_BIAS': False, 'FG_FRACTION': 0.25, 'FG_THRESH': 0.5, 'FG_THRESH_2ND': 0.6, 'FG_THRESH_3RD': 0.7, 'GAMMA': 0.1, 'HAS_RPN': True, 'IMS_PER_BATCH': 1, 'LEARNING_RATE': 0.001, 'MAX_SIZE': 1000, 'MOMENTUM': 0.9, 'PROPOSAL_METHOD': 'gt', 'RPN_BATCHSIZE': 256, 'RPN_BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0], 'RPN_CLOBBER_POSITIVES': False, 'RPN_FG_FRACTION': 0.5, 'RPN_MIN_SIZE': 8, 'RPN_NEGATIVE_OVERLAP': 0.3, 'RPN_NMS_THRESH': 0.7, 'RPN_POSITIVE_OVERLAP': 0.7, 'RPN_POSITIVE_WEIGHT': -1.0, 'RPN_POST_NMS_TOP_N': 2000, 'RPN_PRE_NMS_TOP_N': 12000, 'SCALES': [600], 'SNAPSHOT_ITERS': 5000, 'SNAPSHOT_KEPT': 3, 'SNAPSHOT_PREFIX': 'res101_faster_rcnn', 'STEPSIZE': [30000], 'SUMMARY_INTERVAL': 180, 'TRIM_HEIGHT': 600, 'TRIM_WIDTH': 600, 'TRUNCATED': False, 'USE_ALL_GT': True, 'USE_FLIPPED': True, 'USE_GT': False, 'WEIGHT_DECAY': 0.0001}, 'USE_GPU_NMS': True} Loaded dataset voc_2007_test for training Set proposal method: gt Preparing training data... voc_2007_test gt roidb loaded from /DATACENTER2/qyj/cascade-rcnn_Pytorch-master/data/cache/voc_2007_test_gt_roidb.pkl done 3462 roidb entries load checkpoint models/detnet59/pascal_voc/exp_name/fpn_1_7_6310.pth load model successfully! cudaCheckError() failed : an illegal memory access was encountered

And that's the report after using os.environ['CUDA_LAUNCH_BLOCKING'] = '1' to locate the real place which triggered the cudaCheckError() Without using it, the error is: 3462 roidb entries load checkpoint models/detnet59/pascal_voc/exp_name/fpn_1_7_6310.pth load model successfully! THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1513363039688/work/torch/lib/THC/generated/../THCReduceAll.cuh line=339 error=77 : an illegal memory access was encountered Traceback (most recent call last): File "test_net.py", line 246, in ret = fpn(im_data, im_info, gt_boxes, num_boxes) File "/home/zhiqi.cheng/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 325, in call result = self.forward(*input, **kwargs) File "/DATACENTER2/qyj/cascade-rcnn_Pytorch-master/lib/model/fpn/cascade/fpn.py", line 316, in forward roi_pool_feat = self._PyramidRoI_Feat(mrcnn_feature_maps, rois, im_info) File "/DATACENTER2/qyj/cascade-rcnn_Pytorch-master/lib/model/fpn/cascade/fpn.py", line 135, in _PyramidRoI_Feat if (roi_level == l).sum() == 0: RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /opt/conda/conda-bld/pytorch_1513363039688/work/torch/lib/THC/generated/../THCReduceAll.cuh:339

hcx1231 commented 6 years ago

i also have the same problem

qq184861643 commented 5 years ago

@hcx1231 @KevinQian97 @guoruoqian Hi guys, I've got the same problem. Have you guys solved it? Could you plz tell me how to solve it?

chenghuaijun commented 5 years ago

@KevinQian97 @hcx1231 @qq184861643 @guoruoqian i also have the same problem, have you solve it?

LcenArthas commented 5 years ago

@chenghuaijun@KevinQian97 @hcx1231 @qq184861643 @guoruoqian i also have the same problem, have you solve it?

huihuiustc commented 5 years ago

i also have the same problem, have you solve it?

KevinQian97 commented 5 years ago

Personally, when I use single gpu for trainning. The problem seems dismiss.

huihuiustc commented 5 years ago

It happens when testing

KevinQian97 commented 5 years ago

sorry, I made a typo.

huihuiustc notifications@github.com 于 2019年5月26日周日 下午8:57写道:

It happens when testing

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/guoruoqian/cascade-rcnn_Pytorch/issues/10?email_source=notifications&email_token=AJTYCVIGL2RZAUUJCKVJN53PXKCLHA5CNFSM4FUANMHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWIFHCA#issuecomment-495997832, or mute the thread https://github.com/notifications/unsubscribe-auth/AJTYCVJ3N7I6NSCHAXYRVBDPXKCLHANCNFSM4FUANMHA .

E-Dreamer-LQ commented 5 years ago

@chenghuaijun@KevinQian97 @hcx1231 @qq184861643 @guoruoqian @chenghuaijun i also have the same problem, have you solve it?,The code can only be trained ,can't be tested!!!

linquanxu commented 5 years ago

I also have the same problem, have you solve it?

Jacky-gsq commented 4 years ago

@linquanxu @qq184861643 @hcx1231 @huihuiustc @chenghuaijun Maybe a little late. i recently used this code and had the same the problems, but it was solved. You can try as follows to solve that:

49

herrickli commented 3 years ago

same error,help~