komrad36 / ULATCH

Fastest CPU implementation of the LATCH 512-bit binary feature descriptor for computer vision (upright)
MIT License
9 stars 3 forks source link

neon code #1

Open kingvision opened 7 years ago

kingvision commented 7 years ago

Hi: do you plan to develop a neon based version of the code?

thanks

komrad36 commented 7 years ago

Hi there! I don't have anything ARM to test on :( I'm an x86 and amd64 programmer. I write my SIMD stuff by writing the assembly I want and then contorting intrinsics until most compilers produce more-or-less acceptable equivalents - a weird and backward MO, but welcome to my life...

I would love to try to learn ARM and NEON and make an equivalent but it wouldn't be fast since I don't have any hardware to test on, and I'd probably have to NEON-ify my whole computer vision pipeline before it would be a useful addition. So I'm not sure I can commit to that right now.

Are you familiar with NEON intrinsics? If you want to try to write a NEON version, I'm definitely available to explain what everything in the code does if it's ever unclear, or help look for ways to refactor steps in a different way if some of the AVX doesn't quite match up with NEON. From memory the only thing that might be tricky is VPMOVZXBD a.k.a. _mm256_cvtepu8_epi32(). ULATCH isn't very complicated. LATCH is a lot worse, as is KFAST. We can walk through it if you'd like. Just let me know!

Thanks!