mdaiter / openMVG

openMVG with a LATCH descriptor, an ORB descriptor, DEEP descriptors from the cvpr15compare repo, PNNet/Torch loader and a GPU-based L2 matcher integrated
Other
31 stars 20 forks source link

SIGSEGV when running GPU matcher #12

Closed donlk closed 8 years ago

donlk commented 8 years ago

So running GPU_LATCH matcher on LATCH descriptors results in the following error:

`Thread 1 "openMVG_main_Co" received signal SIGSEGV, Segmentation fault. 0x00007fffab57391c in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1 (gdb) wher

0 0x00007fffab57391c in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

1 0x00007fffab61681e in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

2 0x00007fffab70aecf in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

3 0x00007fffab617bd1 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

4 0x00007fffab3951a2 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

5 0x00007fffab397a85 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

6 0x00007fffab6927c2 in cuMemcpyHtoDAsync_v2 () from /usr/lib/x86_64-linux-gnu/libcuda.so.1

7 0x000000000139adac in cudart::driverHelper::memcpyAsyncDispatch(void, void const, unsigned long, cudaMemcpyKind, CUstream_st*, bool) ()

8 0x00000000013747fb in cudart::cudaApiMemcpyAsyncCommon(void, void const, unsigned long, cudaMemcpyKind, CUstream_st*, bool) ()

9 0x00000000013b00f8 in cudaMemcpyAsync ()

e#10 0x00000000012843a9 in LatchBitMatcher::match (this=0x7fffffffbdf0, h_descriptors1=0x7fffafa5d010, h_descriptors2=0x7fffcc494010, numKP0=30720, numKP1=30720) at /mnt/linuxdata/Development/work/projects/sfmrecon/3rdparty/openMVG/src/openMVG/matching_image_collection/gpu/LatchBitMatcher.cpp:57

11 0x000000000127e69f in openMVG::matching_image_collection::GPU_Matcher_Regions_AllInMemory::Match (

this=0x1d0f470, sfm_data=..., regions_provider=std::shared_ptr (count 1, weak 0) 0x1d0f6b0, 
pairs=std::set with 666 elements = {...}, map_PutativesMatches=std::map with 0 elements)
at /mnt/linuxdata/Development/work/projects/sfmrecon/3rdparty/openMVG/src/openMVG/matching_image_collection/GPU_Matcher_Regions_AllInMemory.cpp:81

12 0x0000000000f36073 in main (argc=1, argv=0x7fffffffd678)

at /mnt/linuxdata/Development/work/projects/sfmrecon/3rdparty/openMVG/src/software/SfM/main_ComputeMatches.cpp:352`

Its not running in parallel to my knowledge.

mdaiter commented 8 years ago

Hey @donlk , It seems like this is a cudaMemcpyAsync error, which is strange because the cudaMemset call beforehand seems successful. This leads to one of two issues:

  1. Can you check the size of h_descriptors1? If it can't alloc that much memory to a single continuous block, could you try setting the command ulimit -s unlimited on the command line beforehand?
  2. Can you check the total memory of your GPU?
donlk commented 8 years ago

This is getting stranger by the minute. Yesterday i had it crashing all the time, now it runs just fine. Although i'm only getting a single match for every image pair, even though the detected features are more than 10000.

I have 4GB of VRAM (precisely its 3.5 + 5 on a GTX 970).

csp256 commented 8 years ago

I developed it on a 970M, which is a less powerful card. The memory required should be << 1 GB. You would see errors before kernel launch if alloc failed. Assuming you are checking for them, Matthew?

Where are the matches located? Is it in the first or last few keypoints?

donlk commented 8 years ago

Ok, i've run a few more tests. Right now i'm getting the following results quite consistently: LATCH_UNSIGNED + GPU_LATCH = 1 match for every pair, no crash LATCH_UNSIGNED + GPUBRUTEFORCEL2 = 0 match for every pair, no crash LATCH_BINARY + GPU_LATCH = the aforementioned crash LATCH_BINARY + GPUBRUTEFORCEL2 = 0 match for every pair, no crash

csp256 commented 8 years ago

1 match for every pair meaning 1 match for every pair of frames?

0 matches for L2 is to be expected and is not a bug. LATCH is not designed to work under the L2 norm.

@mdaiter what is the difference between LATCH_UNSIGNED and LATCH_BINARY? If I remember correctly I only used unsigned ints in my original implementation.

@donlk I feel stupid asking this, but do you know what the endianness of your machine is? You have a reasonably modern Intel or AMD CPU, right?

donlk commented 8 years ago

I have an i7 4770, which i think can deal with both big and little endian compiled code. I did not pass any specific flags that would mess with the byte order though, here are my compile flags straight from cmake: -D_FORCE_INLINES -march=native --std=c++11 -fopenmp -O3 -fPIC The interesting part here is march=native of course, i'm trying to find out what other flags it triggers on haswell.

Update: here they are: /usr/lib/gcc/x86_64-linux-gnu/5/cc1 -E -quiet -v -imultiarch x86_64-linux-gnu - -march=haswell -mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -mno-sse4a -mcx16 -msahf -mmovbe -maes -mno-sha -mpclmul -mpopcnt -mabm -mno-lwp -mfma -mno-fma4 -mno-xop -mbmi -mbmi2 -mno-tbm -mavx -mavx2 -msse4.2 -msse4.1 -mlzcnt -mrtm -mhle -mrdrnd -mf16c -mfsgsbase -mno-rdseed -mno-prfchw -mno-adx -mfxsr -mxsave -mxsaveopt -mno-avx512f -mno-avx512er -mno-avx512cd -mno-avx512pf -mno-prefetchwt1 -mno-clflushopt -mno-xsavec -mno-xsaves -mno-avx512dq -mno-avx512bw -mno-avx512vl -mno-avx512ifma -mno-avx512vbmi -mno-clwb -mno-pcommit -mno-mwaitx --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=8192 -mtune=haswell -fstack-protector-strong -Wformat -Wformat-security

donlk commented 8 years ago

You think the endianness interferes with the data that gets copied to the GPU?

csp256 commented 8 years ago

No. Endianness issues are vanishingly unlikely. I was just checking that there wasn't any sort of weird issues that might have arisen in case @mdaiter did something silly like use a bool array CPU side and transfer it to the GPU.

He is certainly asleep right now, but I have barely even glanced at his wrapper code for the kernel.

mdaiter commented 8 years ago

OH! I know what's up. You can't do LATCH_BINARY + GPU_LATCH . The Latch matcher @csp256 had designed was intended for only scalar points, not binary points. If you use GPU_LATCH, you need to use it with LATCH_UNSIGNED. Otherwise, you need to use a binary matcher.

donlk commented 8 years ago

Ok, but we still have the issue of low (or non existent) match count. Ideas?

mdaiter commented 8 years ago

@donlk what's the image set?

donlk commented 8 years ago

I use my own ones. They are certainly not the source of the issue, as i feeded imagesets with resolution of fullHD to 4k, and tens of thousands of features were successfully detected. Isn't it related to some CUDA 8 specific implementation?

mdaiter commented 8 years ago

@donlk what arch did you compile this for? If you compiled for sm_60, that could be your problem.

donlk commented 8 years ago

I compiled it with compute_30 + sm_30 and compute_53 + sm_53 as well. Both worked fine detection wise, the only difference was in feature count.

donlk commented 8 years ago

Are you sure there's absolutely no chance this is caused by some code part being CUDA 8 specific? I could install version 8 if necessary.

csp256 commented 8 years ago

I don't even have CUDA 8 (I use 7.5), so the CUDA itself is not 8-specific. Maybe the wrapper code but I sincerely doubt that.

When you set compute_xx and sm_xx isn't one supposed to represent your physical architecture? That might be your problem.

donlk commented 8 years ago

YES! Setting it to level 52 solved the issue (it seems it has to match the highest supported CUDA level). I had no idea it needed to match my architecture exactly. Thank you! Forgive for my ignorance.