Open JulienFleuret opened 5 years ago
performance
code executed in debug mode
Optimize in release mode only.
You missed the points. I did the optimization. I followed points 1-3 directly and I optimized the code using manual instruction e.g. v_float32, #if CV_ENABLE_UNROLL, ... Concerning : Optimize in release mode only. I did pass -Ofast only in release mode. The processing time are indicative and excepte the one I gave in release mode does only concerns the modification I proposed in comparison to the original code.
@JSharp4273 you should create a pull request. Makes it easier to review the changes and to integrate them.
System information (version)
Detailed description
Hello I recently has to work with both Vgg and BoostDesc image descriptors. I had to make descriptors for a hugh number of keypoints (several hundreds of thousands) for each image of PASCAL 2012, after one week my algorithm has crash and I observed that only a one hundred images has been processed in seven days.
So I start looking if I could improve the code here are some observations:
In both file the keypoints computer make a copy of the keypoints. If the keypoints were stored in a former cv::Vector data structe if would have been performance less but not with a std::vector data structure.
is it possible to change the arguments of the class
ComputeVGGInvoker
from:to:
In the operator() override of the class
ComputeVGGInvoker
the one before last operation consists to reshape the descriptor as a matrix with 4096 rows and 1 columns:Desc = Desc.reshape( 1, (int)Desc.total() );
The operation is immediately followed by a matrix multiplication the transpose of the results with the transpose of the member variable Proj.descriptors->row( k ) = Desc.t() * Proj.t();
The transposition of Desc in the last line can be prevented by reshaping the descriptor as a matrix of one row and 4096 columns. Another observation is that the variable Proj is always transpose and never used as is. So the two last line should become:
I also observed that in the function
get_desc
there is a lot of transposition that can be prevented by simply saving the data in the patch at the transposed place (x,y rather than y,x). Also the function get_patch evaluate 4096 time a condition that can be determine before the loop even start... it is a lot useless. I propose to rewrite get_patch function as:// sample 64x64 patch from image given keypoint
In the function get_desc the processing of the variable
GAngleRatio
can be done in two steps. The first step consists simply by processing theatan2
for every pixelwisely, the second steps consist to processGMag
,GAngleRatio
,w1
,w2
,Bin1
, andBin2
in one look. The advantage to do that like this is that, in the first step some some unrolling can be done in order to help the compiler to call the autovectorizer for the architecture that have the trigonometric vectorized instruction, it will help the compiler optimize the caches otherwise. The second interest is that the second step is vectorizable manually and the everything is aligned.The last loops order should be inversed. make the anglebins as first loop and the length of the patch in the second loop mean that for every element of
w1
,w2
,GMagT
,Bin1T
andBin2T
will be evaluateanglebins
(8) times. By inversing the two loops each element of each variable will only be evaluate one time only.So I propose to modify the function
get_desc
to the following:Steps to reproduce
During the tests I conducted I without modification of vgg call from the library locally compiled it tooks 30 s to process the keypoints, from the code executed in debug mode before any modifications it tooks around 80 s to process the keypoints, after modifications it tooks between 39-40 seconds in debug and around 27 in release.
I did not provided ximgproc_improved.h due to some currently works on it. For the following example it does only contains the declaration of the class "fastVGG" that is identically the same as the one of VGG. The image is from Pascal Voc the mask is an adaptation from the data from Pascal Part.