komrad36 / LATCH

Fastest CPU implementation of the LATCH 512-bit binary feature descriptor; fully scale- and rotation-invariant
MIT License
34 stars 12 forks source link

Descriptor output not as expected? #4

Closed kentsommer closed 7 years ago

kentsommer commented 7 years ago

Perhaps I'm doing something wrong, but the output from latch seems to be odd / incorrect.

Terminal output (first two desc):

0000000000000000000000000000000000000000111010101010111010000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000100000
0000000000000000000000000000000000000000000000000000000010110001
0110001101101001011001100110100101110100011100100110010101000011
0000000001110100011100110110100101001100011001010111010001100001
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000

0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000

Visualization (imshow output keypoints are detected consistently): viz

This is a minimal reproduction example (uses KFAST for keypoint, and LATCH for descriptor):

#define VC_EXTRALEAN
#define WIN32_LEAN_AND_MEAN

#include <iostream>
#include <chrono>
#include <opencv2/opencv.hpp>
#include <opencv2/videoio.hpp>
#include <opencv2/imgproc.hpp>
#include <vector>
#include <bitset>

#include "KFAST.h"
#include "LATCH.h"

int main() {
    // -------- Video Capture ----------
    cv::VideoCapture cap(0); // open the default camera
    if(!cap.isOpened()) {
        std::cout << "Could not open camera... " << std::endl;
        return -1;
    }
    while(true)
    {
        cv::Mat color_frame;
        cv::Mat image;
        cv::Mat kps_frame;
        // get a new frame from camera
        cap >> color_frame;
        // convert to grayscale
        cv::cvtColor(color_frame, image, cv::COLOR_BGR2GRAY);

        // START KFAST
        // ------------- Configuration ------------
        constexpr bool nonmax_suppress = true;
        constexpr auto thresh = 40;
        constexpr bool KFAST_multithread = true;
        // --------------------------------
        std::vector<Keypoint> KFAST_kps;
        std::vector<cv::KeyPoint> cv_kps;
        KFAST<KFAST_multithread, nonmax_suppress>(image.data, image.cols, image.rows, static_cast<int>(image.step), KFAST_kps, thresh);
        for (const auto& kp : KFAST_kps) cv_kps.emplace_back(static_cast<float>(kp.x), static_cast<float>(kp.y), 0.0f, 0.0f, static_cast<float>(kp.score));
        // END KFAST

        // START LATCH
        // ------------- Configuration ------------
        constexpr bool multithread = true;
        // --------------------------------
        std::vector<KeyPoint> kps;
        for (auto&& kp : cv_kps) kps.emplace_back(kp.pt.x, kp.pt.y, kp.size, kp.angle * 3.14159265f / 180.0f);
        uint64_t* desc = new uint64_t[8 * kps.size()];
        LATCH<multithread>(image.data, image.cols, image.rows, static_cast<int>(image.step), kps, desc);

        for (int i = 0; i < 8; ++i) std::cout << std::bitset<64>(desc[i]) << std::endl;
        std::cout << "\n" << std::endl;
        // END LATCH
        cv::drawKeypoints(image, cv_kps, kps_frame, {255.0, 0.0, 0.0});
        // --------------------------------

        cv::namedWindow("KFL", CV_WINDOW_NORMAL);
        cv::imshow("KFL", kps_frame);
        if(cv::waitKey(30) >= 0) break;
    }
}
komrad36 commented 7 years ago

Hi, thanks for providing such comprehensive information! I'll take a look at this as soon as I get a chance, hopefully within 24 hours.

kentsommer commented 7 years ago

@komrad36

Seems it is due to some issue with the way I'm using KFAST perhaps...

If I swap out KFAST for ORB, everything seems to work fine (first two outputs, the next frames are all similar). That being said... I'd still like to use KFAST if at all possible, I can get a roughly 2x speedup per frame which for my intended application would be hugely beneficial.

0011100010000010011100010100100110001101100000011110000111110001
0011111110101000010010011001011011000001111110100010101010111010
1110001101111000110100001100000101001000101001010101011110000000
0001101010100100000011000000000001100101000000011001010001110001
0101111100100110100010000010000000010110100001010101100001101000
0110001000111110010100010000010100001001001111101000000000000001
1110101001000110001000000110010001001010010000000000011010101010
0000011100110000111000110000000111001111010000000010010011100000

1101111010000010111100100110110110000001001000000011111001110010
0011100001101110000001110100000010001011001100011011001000111110
1001010111111100010010011110110000000101100000010000001011101001
0110101010010010111010110001101000001001000100100010010001011100
1101010001010011100010011001111101110101111000000111001000111011
0111111000001000101001011111101100011101001000011000001111100010
1100000000000000000000011010011010010110001100001110011001110100
0111100110110010101110011111000111001111000011010110000000010011

I made the following modification to achieve the output shown above:


        // // START KFAST
        // // ------------- Configuration ------------
        // constexpr bool nonmax_suppress = true;
        // constexpr auto thresh = 40;
        // constexpr bool KFAST_multithread = true;
        // // --------------------------------
        // std::vector<Keypoint> KFAST_kps;
        // std::vector<cv::KeyPoint> cv_kps;
        // KFAST<KFAST_multithread, nonmax_suppress>(image.data, image.cols, image.rows, static_cast<int>(image.step), KFAST_kps, thresh);
        // for (const auto& kp : KFAST_kps) cv_kps.emplace_back(static_cast<float>(kp.x), static_cast<float>(kp.y), 0.0f, 0.0f, static_cast<float>(kp.score));
        // // END KFAST

        // START ORB
        constexpr int numkps = 5000;
        cv::Ptr<cv::ORB> orb = cv::ORB::create(numkps, 1.2f, 8, 31, 0, 2, cv::ORB::HARRIS_SCORE, 31, 20);
        std::vector<cv::KeyPoint> cv_kps;
        orb->detect(image, cv_kps);
        // END ORB
kentsommer commented 7 years ago

@komrad36

I've fixed the issue. I think it had to do with how I was converting the keypoints into the cv::KeyPoint style. Using the small modifications you made to KFAST for KORAL I was able to get it up and running:

So in the end this is now working correctly :+1: :

#define VC_EXTRALEAN
#define WIN32_LEAN_AND_MEAN

#include <iostream>
#include <chrono>
#include <opencv2/opencv.hpp>
#include <opencv2/videoio.hpp>
#include <opencv2/imgproc.hpp>
#include <vector>
#include <bitset>

#include "KFAST.h"
#include "LATCH.h"

int main() {
    // -------- Video Capture ----------
    cv::VideoCapture cap(0); // open the default camera
    if(!cap.isOpened()) {
        std::cout << "Could not open camera... " << std::endl;
        return -1;
    }
    while(true)
    {
        cv::Mat color_frame;
        cv::Mat image;
        cv::Mat kps_frame;
        // get a new frame from camera
        cap >> color_frame;
        // convert to grayscale
        cv::cvtColor(color_frame, image, cv::COLOR_BGR2GRAY);

        // START KFAST
        // ------------- Configuration ------------
        constexpr bool nonmax_suppress = true;
        constexpr auto thresh = 40;
        constexpr bool KFAST_multithread = true;
        // --------------------------------
        std::vector<Keypoint> KFAST_kps;
        std::vector<cv::KeyPoint> cv_kps;
        KFAST<KFAST_multithread, nonmax_suppress>(image.data, image.cols, image.rows, static_cast<int>(image.step), KFAST_kps, thresh);
        for (const auto& kp : KFAST_kps) cv_kps.emplace_back(static_cast<float>(kp.x), static_cast<float>(kp.y), 14.0f, 180.0f/3.1415926535f*kp.angle, static_cast<float>(kp.score));
        // END KFAST

        // START LATCH
        // ------------- Configuration ------------
        constexpr bool multithread = true;
        // --------------------------------
        std::vector<KeyPoint> kps;
        for (auto&& kp : cv_kps) kps.emplace_back(kp.pt.x, kp.pt.y, kp.size, kp.angle * 3.14159265f / 180.0f);
        uint64_t* desc = new uint64_t[8 * kps.size()];
        LATCH<multithread>(image.data, image.cols, image.rows, static_cast<int>(image.step), kps, desc);

        for (int i = 0; i < 8; ++i) std::cout << std::bitset<64>(desc[i]) << std::endl;
        std::cout << "\n" << std::endl;
        // END LATCH
        cv::drawKeypoints(image, cv_kps, kps_frame, {255.0, 0.0, 0.0});
        // --------------------------------

        cv::namedWindow("KFL", CV_WINDOW_NORMAL);
        cv::imshow("KFL", kps_frame);
        if(cv::waitKey(30) >= 0) break;
    }
}
komrad36 commented 7 years ago

You beat me to it! Thanks for sharing your findings, and it's super cool that you used KORAL to assist in debugging. It looks like the original code was creating OpenCV keypoints with a feature size of 0, and that was confusing LATCH, which was trying to scale the keypoints accordingly.

I really should unify the KFAST and LATCH keypoints... I definitely don't want you to have to copy the vector for use in LATCH. But I'm glad it's working and it's fast! Please feel free to reach out with any other issues or feature requests. The goal is absolutely to be super fast on AVX2-capable processors. Note that the upright version (ULATCH) is faster if you don't need the rotation invariance, and the CUDA versions (CLATCH, CUDAK2NN) are way faster still if you have an NVIDIA GPU.

kentsommer commented 7 years ago

@komrad36

Actually I did run into something I was hoping you might know off the top of your head.

The LATCH descriptors are set as uint64_t* _desc = new uint64_t[8 * l_kps.size()]; but I'd like to output std::vector<uint64_t> desc instead.

I tried std::vector<uint64_t> tdesc(_desc, _desc + sizeof _desc / sizeof _desc[0]);

(If it is helpful I can put up a repo of the example with everything ready to build)

However it keeps causing a segfault... I also tried a standard memcpy but that seems to segfault as well. Unfortunately c++ is not my strongest language so perhaps I'm overlooking something simple.

Thank you a ton for all of your work on efficient implementations of all of these algorithms though. I'm honestly really surprised I'd never seen your implementations until a few days ago as they are by far the fastest I've seen anywhere.

komrad36 commented 7 years ago

Ah. Okay, you have a couple options. You could just copy the output array into a vector. That looks like what you are trying to do above, but it seems like an unnecessary copy, so I don't recommend it. For completeness, though, your approach (using the iterator constructor) is almost correct. The problem is that when you do sizeof(desc), desc is a pointer, not an array, so sizeof(desc) is 4 or 8 (on a 32- or 64-bit system, respectively), and sizeof(desc[0]) is 8, so you'll create a vector with 0 or 1 elements, respectively. Try:

std::vector<uint64_t> tdesc(desc, desc + 8 * l_kps.size());

However, the faster approach might be to create a vector beforehand, nudge it into becoming large enough, then retrieve its underlying pointer and pass that to LATCH:

std::vector<uint64_t> tdesc(8 * l_kps.size());
// pass LATCH reinterpret_cast<uint64_t*>(tdesc.data()) instead of desc

Or equivalently:

std::vector<uint64_t> tdesc;
tdesc.resize(8 * l_kps.size());
// pass LATCH reinterpret_cast<uint64_t*>(tdesc.data()) instead of desc

This can even be optimized further, although I'd say just not using a vector is easier. The remaining optimization is that the vector constructor (or resize) zero-initializes all those integers, and to avoid that you can do the rather ugly:

struct uninitialized_uint64_t {
    uint64_t x;

    uninitialized_uint64_t() {}
};

std::vector<uninitialized_uint64_t> tdesc(8 * l_kps.size());
// pass LATCH reinterpret_cast<uint64_t*>(tdesc.data()) instead of desc

Hope that helps! I typed this on my phone so please excuse if some of the code isn't verbatim.

kentsommer commented 7 years ago

@komrad36

Thank you a bunch for all of the information and detailed explanations!

I've tried the above mentioned approaches and it seems like it still causes a segfault if I try to use the vectorized descriptor for anything:

Sample code:

std::vector<uint64_t> tdesc;
tdesc.resize(8 * l_kps.size());
LATCH<multithread>(image.data, image.cols, image.rows, static_cast<int>(image.step), l_kps, reinterpret_cast<uint64_t*>(tdesc.data()));

std::cout << "tdesc" << std::endl;
for (int i = 0; i < 8; ++i) std::cout << std::bitset<64>(tdesc[i]) << std::endl;
std::cout << "\n\n" << std::endl;

Valgrind output:

==18784== Memcheck, a memory error detector
==18784== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==18784== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==18784== Command: ./KFL
==18784== 
tdesc
==18784== Invalid read of size 8
==18784==    at 0x405463: main (in /home/kent/Documents/features/kfl/KFL)
==18784==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==18784== 
==18784== 
==18784== Process terminating with default action of signal 11 (SIGSEGV)
==18784==  Access not within mapped region at address 0x0
==18784==    at 0x405463: main (in /home/kent/Documents/features/kfl/KFL)
==18784==  If you believe this happened as a result of a stack
==18784==  overflow in your program's main thread (unlikely but
==18784==  possible), you can try to increase the size of the
==18784==  main thread stack using the --main-stacksize= flag.
==18784==  The main thread stack size used in this run was 8388608.
==18784== 
==18784== HEAP SUMMARY:
==18784==     in use at exit: 19,267,206 bytes in 1,551 blocks
==18784==   total heap usage: 2,336 allocs, 785 frees, 21,188,104 bytes allocated
==18784== 
==18784== LEAK SUMMARY:
==18784==    definitely lost: 0 bytes in 0 blocks
==18784==    indirectly lost: 0 bytes in 0 blocks
==18784==      possibly lost: 18,817,143 bytes in 128 blocks
==18784==    still reachable: 450,063 bytes in 1,423 blocks
==18784==                       of which reachable via heuristic:
==18784==                         newarray           : 1,536 bytes in 16 blocks
==18784==         suppressed: 0 bytes in 0 blocks
==18784== Rerun with --leak-check=full to see details of leaked memory
==18784== 
==18784== For counts of detected and suppressed errors, rerun with: -v
==18784== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
Killed
komrad36 commented 7 years ago

Interesting. Do you still get the problem if you comment out the LATCH call, so it's basically just creating a vector and then printing?

kentsommer commented 7 years ago

Odd, yes I do. I guess it is some issue on my end then :+1:

komrad36 commented 7 years ago

Is the kps vector empty? That is the only thing I can think of that would cause the prints to segfault.

kentsommer commented 7 years ago

Somehow all of them return 0 for kps.size(). A little baffling considering I can pass them out and draw them fine. Anyways, this gives me plenty to go on, thank you a ton for your help and patience!

komrad36 commented 7 years ago

No problem! Feel free to reach out with anything else. Follows and/or stars on interesting repos would be much appreciated!