response map fusion implementation

meiqua commented 4 years ago

Motivation

According to Halide paper, fusion can improve the creation of response map a lot. However, configing Halide is not an easy job, and our response map don't need many features of Halide too. So implementing a simple version of tile-based fusion method is preferred. This is also what opencv4 is doing.

Related issues

Current works

Currently, a simple tile-based fusion pipeline is implemented, and gaussian / sobel / mag / phase / hist / spread ... is finished and tested. Refer to fusion by hand branch for more info. The basic idea is implementing tile-based fusion only, and do the compiling stuff of Halide by hand... Though it seems not as fancy as Halide, it simplifies jobs a lot and is easy to use too.

Results and TODOs

The speed is roughly 10x faster than using opencv. We will use it to create response map in the future.

See test_fusion.cpp for more examples. Also, Any discussion, test, or improvements are welcomed!

Update

Now we pass all tests and match function can be used as usual! It's about 6x faster for full pipeline of creating response map, and no need to crop images to 16n as before.

Update

Now rgb image is also supported, by cvtColor first. After investigating many solutions, we found using opencv is the cleanest way... Compared with using gray image, cvtColor only cost ~5% more.

DennisLiu-elogic commented 4 years ago

meiqua大又是我冏

馬上試了下手工fusion，斷在這

圖像同那張很多愛心的

VS進階指令集選SSE2

若選AVX2則斷不同地方

meiqua commented 4 years ago

一般来说是因为MIPP在有些指令集上没有实现函数。先关掉能正常跑吗

DennisLiu-elogic commented 4 years ago

一般来说是因为MIPP在有些指令集上没有实现函数。先关掉能正常跑吗

換地方了

meiqua commented 4 years ago

什么报错？

DennisLiu-elogic commented 4 years ago

什么报错？

發現op_row給錯了，改成5之後

meiqua commented 4 years ago

这个是最新的代码直接跑的吗？我找个win笔记本试试

DennisLiu-elogic commented 4 years ago

fusion by hand branch

這一個fusion.h，改了點指標用到gauss_size的地方讓VS編譯過，MIPP也是從這來的

meiqua commented 4 years ago

@DennisLiu-elogic 我试了下，gauss_size那用vector，SIMD关掉可以跑呀。
用SIMD的话，除了AVX2都挂了。。我看看怎么把MIPP没定义的都补全

DennisLiu-elogic commented 4 years ago

@DennisLiu-elogic 我试了下，gauss_size那用vector，SIMD关掉可以跑呀。用SIMD的话，除了AVX2都挂了。。我看看怎么把MIPP没定义的都补全

這麼奇怪，int32_t parent_buf_ptr [gauss_size] --- > int32_t parent_buf_ptr [5]導致不開simd也會錯...?

aemior commented 4 years ago

@meiqua RGB图的fusion最近有计划更新吗？

meiqua commented 4 years ago

@aemior 我打算先把这个SIMD的问题解决掉，然后做RGB2GRAY的fusion。RGB的fusion有点麻烦，感觉不是很必要。

aemior commented 4 years ago

@meiqua 好的，我这边做的RGB的pipline，RGB的化如果涉及不同目标的自然场景的检测应该能提高精度，工业场景确实没必要

mangoeffect commented 4 years ago

您好，我测试了一些fusion,在vs上无法编译通过呢这么定义数组可以吗？参数-Wno-sign-compare在vs上又是无效的

meiqua commented 4 years ago

@mangosroom VS编译器不支持变量数组，新commit改成vector可用

mangoeffect commented 4 years ago

嗯嗯，我也是这么改的，算法层代码最好还是写标准的c++

meiqua commented 4 years ago

@DennisLiu-elogic 现在SSE2应该能跑了。之前测的结果是SSE4 AVX2可以

DennisLiu-elogic commented 4 years ago

@DennisLiu-elogic 现在SSE2应该能跑了。之前测的结果是SSE4 AVX2可以可以幫我顯示下line2Dup.h .cpp的改動嗎?

DennisLiu-elogic commented 4 years ago

中斷在這

roi的x, y都是-4，這樣呼叫.ptr ()一定會錯的吧?

meiqua commented 4 years ago

这两个文件没改，改的是MIPP，增加了mul abs cvt<int16_t,int32_t>
如果没有这个debug assert没问题，因为之后有范围判断。可以把这句加在范围判断之后，或者直接用in.at(r, c)

DennisLiu-elogic commented 4 years ago

这两个文件没改，改的是MIPP，增加了mul abs cvt<int16_t,int32_t> 如果没有这个debug assert没问题，因为之后有范围判断。可以把这句加在范围判断之后，或者直接用in.at(r, c)

居然沒注意到後面有判斷... 不過copyToBound這段if寫在for裡面有點浪費時間，應該可以先讓out填充0，再根據roi填值吧? 還是有什麼我沒注意到的地方

meiqua commented 4 years ago

先填0不如这个快，因为会多一遍写入。不过这里不是hot path，时间差不了多少。

DennisLiu-elogic commented 4 years ago

更新了fusion branch的line2Dup.h .cpp，走原匹配流程不用simd的話，， out_hearder這個陣列越界了

用simd

test_fusion.cpp跑起來是沒問題的 ----更正 test_fusion.cpp，設use_simd=true的話

meiqua commented 4 years ago

如果use_simd = true，但没有配置SIMD确实会出错；use_simd = false这个我跑的没问题，是用的最新的代码吗？

DennisLiu-elogic commented 4 years ago

如果use_simd = true，但没有配置SIMD确实会出错；use_simd = false这个我跑的没问题，是用的最新的代码吗？

我沒講清楚，Visual Studio編譯器選項都是有開SSE2的，調整的只有use_simd

所以反而是test_fusion在use_simd=true，編譯器選項開SSE2時會報錯 use_simd=false，編譯器開SSE2時正常

fusion.h是新代碼沒錯

新版的line2Dup.h .cpp是用原版的test.cpp的angle_test()測試的，這部分沒有更新到，明天試試

-----0609 檢查了下angle_test ()，只有更新旋轉模板的部分(use_rot)，我這邊已是新的代碼

-- use_simd=false，編譯器也關掉在高斯node這邊，r=8時out_header的size不對，r=其他值的時候都正常

meiqua commented 4 years ago

确实会越界，应该加上条件。之前之所以还能正常跑，是因为越界的时候刚好没用这个值，然后编译器也不会做越界检查。

DennisLiu-elogic commented 4 years ago

确实会越界，应该加上条件。之前之所以还能正常跑，是因为越界的时候刚好没用这个值，然后编译器也不会做越界检查。

這個加了檢查後沒問題

但在use_simd=true且編譯器開啟SSE2時還是會報錯。

update_simd ()中的dxint16.r = 0時

測試圖檔 https://drive.google.com/file/d/1FTuiw5dEgCmpNi3bnPTc8QwAmcVS0zFu/view?usp=sharing

meiqua commented 4 years ago

什么报错？

DennisLiu-elogic commented 4 years ago

看callStack順序是這樣 748行

meiqua commented 4 years ago

看起来是未定义low，但其实已经在这里定义过了。这应该会在use_simd=true，同时没有配置SSE2时发生；确定SSE2开了吗？可以跑mipp_test()看看

DennisLiu-elogic commented 4 years ago

原來是我的電腦SSE2開了沒作用，AVX2才有...何解?

meiqua commented 4 years ago

MIPP通过这里的宏进入SSE分支，不太清楚VS编译器定义了没。

XuleiTao commented 4 years ago

我用vs也是只能用avx2，但cpu不支持avx指令集，这个怎么使用MIPP呢？看MIPP那里是支持SSE的。

meiqua commented 4 years ago

也是上面说的问题吗，开SSE但MIPP没进入SSE分支？

meiqua commented 4 years ago

搜了下，还真是这样:

According to their documentation (msdn.microsoft.com/en-us/library/b0084kay.aspx), Visual Studio doesn’t set the SSEn macros (but they do set AVX and AVX2). – Stephen Canon May 22 '14 at 15:27 Typical, I suppose - everybody else defines the SSEn macros, but not Microsoft. – Paul R May 22 '14 at 15:39

试试这个branch解决了没

XuleiTao commented 4 years ago

好像还不行，我这里用的x86编译。看VS里的说明是：只有x86体系结构生成程序时，SSE、SSE2才可用

meiqua commented 4 years ago

这个关系不大。SSE2的时候应该把SSE的宏也加上，改了下，再试试？

XuleiTao commented 4 years ago

可以用了，赞。不过，我测试感觉在VS上，使用MIPP的效果不明显。测试，模板特征点数都是128

未加MIPP那份代码：我在梯度扩散，梯度响应那里加了两句OpenMP，加速了大概10ms（130ms->120ms）。matchClass那里用你提供的那段并行，提高大概20ms(30ms->10ms)。图像：200w（1600x1200）；CPU:i7-6700；VS2015
有MIPP的master那份代码，开启了AVX2，梯度响应那块大概耗时是110-120ms，匹配大概10ms。图像：200w（1600x1200）；CPU:i7-6700；VS2015

不过，这个在linux上跑很快，设置padding=500，像素大于200w的，大概总耗时80ms。CPU:i7-8700 同样参数下，VS2015，AVX2耗时大概150ms。

然后，VS2017，AVX2，CPU:i5-6300，同样master那份，padding=500，耗时大概280ms。

CPU: i3，VS2017，图像：200w。对比了有MIPP那份代码和未加MIPP的代码，有MIPP的开启了SSE2，耗时大概300-400ms；未加MIPP的耗时也差不多300ms，平均稍快一点儿。

然后，fusion那份代码，（1）图像200w，VS2017，SSE2，CPU: i3，开闭AVX2的耗时都大概100-110ms。（2）图像200w，VS2015，CPU: i7-6700，开闭AVX2的耗时都大概80ms。

这个环境用的有点乱，但VS上使用MIPP速度没怎么提升，Linux上提升明显。看MIPP那里的说明，是需要升级到VS2019吗？ On msvc 14.10 (Microsoft Visual Studio 2017), the performances are reduced compared to the other compilers, the compiler is not able to fully inline all the MIPP methods. This has been fixed on msvc 14.21 (Microsoft Visual Studio 2019) and now you can expect high performances.

meiqua commented 4 years ago

MIPP相对最开始SSE实现对速度提升应该不大，是为了在arm上能用加的；linux平台下快一点是有可能的，一是opencv可能不同版本、不同编译选项下的速度不一样，二是可能像这里说的inline做的更好。

XuleiTao commented 4 years ago

哦哦。fusion那份代码跑200w像素的图片，用时大概70-80ms，CPU：i7，OpenCV：3.4.6；这个属于正常吗？

meiqua commented 4 years ago

不正常，我在ubuntu16.04 i7跑的20ms。可以把这行改成false先关掉MIPP看看是不是inline的问题，我关掉后大概40ms。

XuleiTao commented 4 years ago

自带的图像，padding=500，在ubuntu16.04 i7上跑也是20ms，关了MIPP大概50ms。 VS2015 开或关掉MIPP都大概是60-70ms。现在这个CPU：i7-8700，比之前那个i7-6700的80ms快点。难道是VS的问题？需要VS升级一下？

meiqua commented 4 years ago

看起来是这样，因为fusion的代码没调用opencv，那可能就是编译器优化不够了。

XuleiTao commented 4 years ago

嗯嗯，之后找个装VS2017的电脑试试。感谢感谢。

XuleiTao commented 4 years ago

VS2017对速度提升是有效的。看来VS2015对MIPP也是不支持的。

zzqusst commented 2 years ago

单张图像内，多个模板实例，需要加上 cv_dnn_nms::NMSBoxes，设置好重叠率，然后再做ICP 配准

wiekern commented 2 years ago

测试图片1200x1200 训练 padding=100，角度[-60,60]每一度一个共计121个模板，尺度只有1个（line2Dup::Detector detector(128, {4});）测试 padding=250，只取top1，stride=16 CPU: Intel Xeon E3-1270 支持AVX2指令集系统: Win11 编译环境：QT creator（有在定义#define SSE2 后加#praga message打印，可以看到编译时进了这个逻辑因此开记了SSE2）、Qt_6_2_4_MinGW_64，默认release版本开启了 O2优化（从编译输出看到 g++ -c -fno-keep-inline-dllexport -O2）使用分支: fusion_fix_memo 耗时如下，基本250ms左右，达不到上面提到200W像素70-80ms，不知道哪里没设置对？还请指教，感谢！

----------thread 1---------
bgr2gray: 2.3253ms
gauss1x5: 8.8146ms
gauss5x1: 8.4448ms
sobel1x3_sxx_syx: 1.5282ms
sobel3x1_sxy_syy: 1.4955ms
mag_phase_quant1x1: 15.6051ms
hist3x3: 47.7554ms
spread1xn: 0.595ms
spreadnx1: 1.6778ms
response1x1: 1.9622ms
linearizeTxT: 17.3772ms
-----------------------------------------
fusion time
elasped time:0.114451s

match time
elasped time:0.138171s

wiekern commented 2 years ago

使用 fusion_by_hand 分支跑了一下测试程序，结果如下：第一次打印的 fusion 耗时严重

MIPP tests
----------

Instr. type:       SSE
Instr. full type:  SSE3
Instr. version:    3
Instr. size:       128 bits
Instr. lanes:      1
64-bit support:    yes
Byte/word support: yes
in this SIMD, int8 max is not inplemented by MIPP
in this SIMD, int8 shuff is not inplemented by MIPP
----------

test img size: 2356800

fusion time
elasped time:0.100045s

fusion time
elasped time:0.0262209s

match time
elasped time:0.027269s

match total time
elasped time:0.156801s

matches.size(): 7

match.template_id: 340
match.similarity: 100

DennisLiu1993 commented 2 years ago

@wiekern @zzqusst @XuleiTao 各位可以參考我的github，這裡有個shaped matching的替代方案，可以替換某些應用場域 https://github.com/DennisLiu1993/Fastest_Image_Pattern_Matching

meiqua / shape_based_matching