mjiUST / SurfaceNet

2017 ICCV, SurfaceNet: An End-to-end 3D Neural Network for Multiview Stereopsis
123 stars 36 forks source link

CUDNN_STATUS_INTERNAL_ERROR while running main.py #4

Open Rubikplayer opened 6 years ago

Rubikplayer commented 6 years ago

Hi thanks for previous feedback in another thread. After I setup up Cuda8.0/CuDNN 5.1 and theano 0.9, I can run some part of main.py. But there's still some error when executing patch2embedding() function in the early rejection stage.

More specifically:

Traceback (most recent call last):
  File "./main.py", line 27, in <module>
    save_npz_file_path = main_reconstruct.reconstruction(datasetFolder, _model, imgNamePattern, poseNamePattern, outputFolder, N_viewPairs4inference, resol, BB, viewList)
  File "/home/ICT2000/tli/Workspace/SurfaceNet/main_reconstruct.py", line 77, in reconstruction
    cubeCenter_hw = np.stack([img_h_cubesCenter, img_w_cubesCenter], axis=0))    # (N_cubes, N_views, D_embedding), (N_cubes, N_views)
  File "./utils/earlyRejection.py", line 31, in patch2embedding
    patches_embedding[:,:] = patch2embedding_fn(patch_allBlack)[0] # don't use np.repeat (out of memory)
  File "/home/ICT2000/tli/.conda/envs/SurfaceNet/lib/python2.7/site-packages/theano/compile/function_module.py", line 898, in __call__
    storage_map=getattr(self.fn, 'storage_map', None))
  File "/home/ICT2000/tli/.conda/envs/SurfaceNet/lib/python2.7/site-packages/theano/gof/link.py", line 325, in raise_with_op
    reraise(exc_type, exc_value, exc_trace)
  File "/home/ICT2000/tli/.conda/envs/SurfaceNet/lib/python2.7/site-packages/theano/compile/function_module.py", line 884, in __call__
    self.fn() if output_subset is None else\
RuntimeError: error doing operation: CUDNN_STATUS_INTERNAL_ERROR
Apply node that caused the error: GpuDnnConv{algo='small', inplace=False}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty{dtype='float32', context_name=None}.0, GpuDnnConvDesc{border_mode=(1, 1), subsample=(1, 1), conv_mode='cross', precision='float32'}.0, Cast{float32}.0, Cast{float32}.0)
Toposort index: 276
Inputs types: [GpuArrayType<None>(float32, (False, False, False, False)), GpuArrayType<None>(float32, (False, False, False, False)), GpuArrayType<None>(float32, (False, False, False, False)), <theano.gof.type.CDataType object at 0x7fbd6848bc90>, Scalar(float32), Scalar(float32)]
Inputs shapes: [(1, 3, 64, 64), (64, 3, 3, 3), (1, 64, 64, 64), 'No shapes', (), ()]
Inputs strides: [(49152, 16384, 256, 4), (108, 36, 12, 4), (1048576, 16384, 256, 4), 'No strides', (), ()]
Inputs values: ['not shown', 'not shown', 'not shown', <capsule object NULL at 0x7fbb43bd10c0>, 1.0, 0.0]
Inputs type_num: [11, 11, 11, '', 11, 11]
Outputs clients: [[HostFromGpu(gpuarray)(GpuDnnConv{algo='small', inplace=False}.0)]]

Detail error log can be seen here: err_log.txt

I have tried:

None has worked so far.

Have you seen this type of error before? Or did I set my computer correctly? I observed you have a params.py to specify all parameters. Some has mentioned this error can result from lack of memory (link), and it seems your code did something for batch processing.

Info of my setting:

My ~/.theanorc:






If you have any suggestions, please let me know! Thanks for your help and support!


After I tried to remove other versions of CuDNN: (https://groups.google.com/forum/#!topic/theano-users/w4M3Xy0ec60), the error changes to the following.

Traceback (most recent call last):
  File "./main.py", line 27, in <module>
    save_npz_file_path = main_reconstruct.reconstruction(datasetFolder, _model, imgNamePattern, poseNamePattern, outputFolder, N_viewPairs4inference, resol, BB, viewList)
  File "/home/ICT2000/tli/Workspace/SurfaceNet/main_reconstruct.py", line 77, in reconstruction
    cubeCenter_hw = np.stack([img_h_cubesCenter, img_w_cubesCenter], axis=0))    # (N_cubes, N_views, D_embedding), (N_cubes, N_views)
  File "./utils/earlyRejection.py", line 48, in patch2embedding
    _patches_embedding_inScope[_batch] = patch2embedding_fn(_patches_preprocessed[_batch])     # (N_batch, 3/1, patchSize, patchSize) --> (N_batch, D_embedding). similarityNet: patch --> embedding
  File "/home/ICT2000/tli/.conda/envs/SurfaceNet/lib/python2.7/site-packages/theano/compile/function_module.py", line 898, in __call__
    storage_map=getattr(self.fn, 'storage_map', None))
  File "/home/ICT2000/tli/.conda/envs/SurfaceNet/lib/python2.7/site-packages/theano/gof/link.py", line 325, in raise_with_op
    reraise(exc_type, exc_value, exc_trace)
  File "/home/ICT2000/tli/.conda/envs/SurfaceNet/lib/python2.7/site-packages/theano/compile/function_module.py", line 884, in __call__
    self.fn() if output_subset is None else\
  File "pygpu/gpuarray.pyx", line 676, in pygpu.gpuarray.pygpu_empty
  File "pygpu/gpuarray.pyx", line 290, in pygpu.gpuarray.array_empty
pygpu.gpuarray.GpuArrayException: cuMemAlloc: CUDA_ERROR_OUT_OF_MEMORY: out of memory
Apply node that caused the error: GpuDnnConv{algo='small', inplace=False}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty{dtype='float32', context_name=None}.0, GpuDnnConvDesc{border_mode=(1, 1), subsample=(1, 1), conv_mode='cross', precision='float32'}.0, Cast{float32}.0, Cast{float32}.0)
Toposort index: 276
Inputs types: [GpuArrayType<None>(float32, (False, False, False, False)), GpuArrayType<None>(float32, (False, False, False, False)), GpuArrayType<None>(float32, (False, False, False, False)), <theano.gof.type.CDataType object at 0x7f703a019c90>, Scalar(float32), Scalar(float32)]
Inputs shapes: [(1100, 3, 64, 64), (64, 3, 3, 3), (1100, 64, 64, 64), 'No shapes', (), ()]
Inputs strides: [(49152, 16384, 256, 4), (108, 36, 12, 4), (1048576, 16384, 256, 4), 'No strides', (), ()]
Inputs values: ['not shown', 'not shown', 'not shown', <capsule object NULL at 0x7f6e133930c0>, 1.0, 0.0]
Inputs type_num: [11, 11, 11, '', 11, 11]
Outputs clients: [[HostFromGpu(gpuarray)(GpuDnnConv{algo='small', inplace=False}.0)]]
mjiUST commented 6 years ago

@Rubikplayer For the updated error_log, it mentions: pygpu.gpuarray.GpuArrayException: cuMemAlloc: CUDA_ERROR_OUT_OF_MEMORY: out of memory. Can you change the cnmem=0.75 --> cnmem=0.95 in .theanorc OR change __GPUMemoryGB = 11 to a safe value, say __GPUMemoryGB = 6 in params.py and let's see what it print out.

Also, for the theano installation please refer to https://github.com/mjiUST/SurfaceNet/issues/3#issuecomment-371688429

Rubikplayer commented 6 years ago

@mjiUST The code seems to be running, after I set gpuarray.preallocate=0.8 (also commented #cnmem=0.75). (This was before I saw your feedback. I will try your suggested values a bit later).

May I confirm with you on two questions:

According to the theano doc link, seems gpuarray.preallocate was designed for new gpu back, and cnmem for the old one. Since we are using version 0.9, I suppose I should set cnmem instead of gpuarray.preallocate? If so, then what I just set was just not setting any limit.

My setting change: __GPUMemoryGB = 11 and __cube_D = 32. Also, my GPU (1080 Ti) should be slower than Titan X.

Thanks for the help!!

mjiUST commented 6 years ago

@Rubikplayer Thanks for your feedback. It's great to know the code is running.

Rubikplayer commented 6 years ago

@mjiUST Thanks for the suggestion! I tried optimizer=fast_run indeed accelerates the process. but for __cube_D = 64, I still got some out of memory issue. I've sent an email to your school email for detail questions.

Thanks again!