Closed donlk closed 8 years ago
Hey @donlk ,
It seems like this is a cudaMemcpyAsync
error, which is strange because the cudaMemset
call beforehand seems successful. This leads to one of two issues:
h_descriptors1
? If it can't alloc that much memory to a single continuous block, could you try setting the command ulimit -s unlimited
on the command line beforehand?This is getting stranger by the minute. Yesterday i had it crashing all the time, now it runs just fine. Although i'm only getting a single match for every image pair, even though the detected features are more than 10000.
I have 4GB of VRAM (precisely its 3.5 + 5 on a GTX 970).
I developed it on a 970M, which is a less powerful card. The memory required should be << 1 GB. You would see errors before kernel launch if alloc failed. Assuming you are checking for them, Matthew?
Where are the matches located? Is it in the first or last few keypoints?
Ok, i've run a few more tests. Right now i'm getting the following results quite consistently:
LATCH_UNSIGNED
+ GPU_LATCH
= 1 match for every pair, no crash
LATCH_UNSIGNED
+ GPUBRUTEFORCEL2
= 0 match for every pair, no crash
LATCH_BINARY
+ GPU_LATCH
= the aforementioned crash
LATCH_BINARY
+ GPUBRUTEFORCEL2
= 0 match for every pair, no crash
1 match for every pair meaning 1 match for every pair of frames?
0 matches for L2 is to be expected and is not a bug. LATCH is not designed to work under the L2 norm.
@mdaiter what is the difference between LATCH_UNSIGNED and LATCH_BINARY? If I remember correctly I only used unsigned ints in my original implementation.
@donlk I feel stupid asking this, but do you know what the endianness of your machine is? You have a reasonably modern Intel or AMD CPU, right?
I have an i7 4770, which i think can deal with both big and little endian compiled code. I did not pass any specific flags that would mess with the byte order though, here are my compile flags straight from cmake:
-D_FORCE_INLINES -march=native --std=c++11 -fopenmp -O3 -fPIC
The interesting part here is march=native of course, i'm trying to find out what other flags it triggers on haswell.
Update: here they are:
/usr/lib/gcc/x86_64-linux-gnu/5/cc1 -E -quiet -v -imultiarch x86_64-linux-gnu - -march=haswell -mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -mno-sse4a -mcx16 -msahf -mmovbe -maes -mno-sha -mpclmul -mpopcnt -mabm -mno-lwp -mfma -mno-fma4 -mno-xop -mbmi -mbmi2 -mno-tbm -mavx -mavx2 -msse4.2 -msse4.1 -mlzcnt -mrtm -mhle -mrdrnd -mf16c -mfsgsbase -mno-rdseed -mno-prfchw -mno-adx -mfxsr -mxsave -mxsaveopt -mno-avx512f -mno-avx512er -mno-avx512cd -mno-avx512pf -mno-prefetchwt1 -mno-clflushopt -mno-xsavec -mno-xsaves -mno-avx512dq -mno-avx512bw -mno-avx512vl -mno-avx512ifma -mno-avx512vbmi -mno-clwb -mno-pcommit -mno-mwaitx --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=8192 -mtune=haswell -fstack-protector-strong -Wformat -Wformat-security
You think the endianness interferes with the data that gets copied to the GPU?
No. Endianness issues are vanishingly unlikely. I was just checking that there wasn't any sort of weird issues that might have arisen in case @mdaiter did something silly like use a bool array CPU side and transfer it to the GPU.
He is certainly asleep right now, but I have barely even glanced at his wrapper code for the kernel.
OH! I know what's up.
You can't do LATCH_BINARY
+ GPU_LATCH
. The Latch matcher @csp256 had designed was intended for only scalar points, not binary points. If you use GPU_LATCH
, you need to use it with LATCH_UNSIGNED
. Otherwise, you need to use a binary matcher.
Ok, but we still have the issue of low (or non existent) match count. Ideas?
@donlk what's the image set?
I use my own ones. They are certainly not the source of the issue, as i feeded imagesets with resolution of fullHD to 4k, and tens of thousands of features were successfully detected. Isn't it related to some CUDA 8 specific implementation?
@donlk what arch did you compile this for? If you compiled for sm_60
, that could be your problem.
I compiled it with compute_30 + sm_30
and compute_53 + sm_53
as well. Both worked fine detection wise, the only difference was in feature count.
Are you sure there's absolutely no chance this is caused by some code part being CUDA 8 specific? I could install version 8 if necessary.
I don't even have CUDA 8 (I use 7.5), so the CUDA itself is not 8-specific. Maybe the wrapper code but I sincerely doubt that.
When you set compute_xx and sm_xx isn't one supposed to represent your physical architecture? That might be your problem.
YES! Setting it to level 52 solved the issue (it seems it has to match the highest supported CUDA level). I had no idea it needed to match my architecture exactly. Thank you! Forgive for my ignorance.
So running GPU_LATCH matcher on LATCH descriptors results in the following error:
`Thread 1 "openMVG_main_Co" received signal SIGSEGV, Segmentation fault. 0x00007fffab57391c in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1 (gdb) wher
0 0x00007fffab57391c in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
1 0x00007fffab61681e in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
2 0x00007fffab70aecf in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
3 0x00007fffab617bd1 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
4 0x00007fffab3951a2 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
5 0x00007fffab397a85 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
6 0x00007fffab6927c2 in cuMemcpyHtoDAsync_v2 () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
7 0x000000000139adac in cudart::driverHelper::memcpyAsyncDispatch(void, void const, unsigned long, cudaMemcpyKind, CUstream_st*, bool) ()
8 0x00000000013747fb in cudart::cudaApiMemcpyAsyncCommon(void, void const, unsigned long, cudaMemcpyKind, CUstream_st*, bool) ()
9 0x00000000013b00f8 in cudaMemcpyAsync ()
e#10 0x00000000012843a9 in LatchBitMatcher::match (this=0x7fffffffbdf0, h_descriptors1=0x7fffafa5d010, h_descriptors2=0x7fffcc494010, numKP0=30720, numKP1=30720) at /mnt/linuxdata/Development/work/projects/sfmrecon/3rdparty/openMVG/src/openMVG/matching_image_collection/gpu/LatchBitMatcher.cpp:57
11 0x000000000127e69f in openMVG::matching_image_collection::GPU_Matcher_Regions_AllInMemory::Match (
12 0x0000000000f36073 in main (argc=1, argv=0x7fffffffd678)
Its not running in parallel to my knowledge.