Question - CUDA Integration into Existing Project

andrewtran1995 commented 7 years ago

Hello,

I am currently working on a school senior project where we are hoping to use CUDA to accelerate our marker detection code. We are at the moment just calling the Python binding for Aruco's detectMarkers function, but are planning on deconstructing the method to run parts of it through the GPU.

It seems that your CUBIN files mainly focus on parallelizing thresholding and filtering? Is this an accurate statement?

Would the proper approach for our project be to write .cu files called from our Python files (e.g., through PyCUDA) that handles the thresholding and any filtering? And the rest would be handled in Python?

Thanks for your help!

nbergst commented 7 years ago

Hi,

You're correct that the cuda optimization targets thresholding and filtering. That was low hanging fruit and pretty straight forward to implement. The rest would be a bit or much more complex to do, and is not something I can put any time into at the moment.

As for using python, I assume that you're using the aruco-implementation in opencv and therefore the bindings from opencv. Since my code is a fork of the original library, there are no python bindings. I'm not sure what the best way is. The functions I ported to cuda are not (I assume) part of the public aruco api, so I don't know if you even have python bindings for them. I don't know about PyCUDA either, so I cannot give any comments about that.

If it was up to me I would use the code in my repository and write python bindings for the functions you need. Then you could just call aruco as you do now and the cuda part would happen automagically. I did that for another repository (akaze) and it seems to work ok. This requires knowledge about c++ however, and uses boost for the bindings.

Sorry I couldn't be of more help

andrewtran1995 commented 7 years ago

Hey Niklas,

Thanks so much for the super-prompt response! If you have more time, I have some questions I'd like to ask you.

Yes, I am indeed using the Aruco implementation in OpenCV. At the current point, I have rewritten some of the internal workings of detectMarkers here in Python that did not seem computationally expensive.

My knowledge about C++ is OK (school projects), though it has been a bit of time since I've last used it, and I have only heard of Boost but never had to use it directly myself. It scares me to abandon the OpenCV Aruco, and since I only need to parallelize certain parts of the marker detection algorithm, I may just write my own .cu files (using and acknowledging your work as reference).

Do you think this is a feasible approach, or is there more infrastructure needed to these thresholding and filtering operations that I do not realize exists?

I am also considering more-or-less forking the Aruco OpenCV implementation to add Python bindings to the private functions that detectMarkers uses as a way of unit-testing the CUDA logic I write.

Lastly (for now), what further directions do you believe CUDA could be used in the marker detection or pose estimation aspects of Aruco, which are the two main features I will need to use during run-time?

nbergst commented 7 years ago

Hi again,

No matter which option you choose it involves altering the aruco code, and if I understand you correctly you're considering to write the higher level functions that are in aruco in python instead. If you can access the necessary functions in aruco through python I think that's a feasible option.

Adding cuda needs initializing some cuda memory, and then use the functions as I have done. I guess you can quite easily find where I have referenced my functions in the original files. I don't know how transparent PyCUDA is, but the important parts besides the actual functions are memory initialization and copies between the cpu and gpu, so it's not that much. Feel free to just take whatever you need of my code.

Now it was some time since I wrote the code, but if I remember correctly the other main parts (time wise) of the algorithm is finding the contours and approximating a polygon to those contours. The former is not really any idea to try and move to the gpu. Connected component algorithms are inherently hard to implement for parallel architectures. (Which platform are you using btw?) The latter could be moved to the gpu, but requires some work.

Once the marker is detected, computing the pose is easy. What could be put on the gpu is the warp that is done on the image for computing the id of the marker. If I remember correctly though this only takes a small fraction of the computations anyway, so it might not be worth it. I suggest that you profile it and see for yourself.

andrewtran1995 commented 7 years ago

Hey Niklas,

I am currently developing on Ubuntu 14.04 with an Nvidia GeForce GTX 970 with a target platform of the Jetson TX1, in case you have any experience with that (or developing on embedded systems in general).

I will take the directions you noted into consideration when continuing with this project, and may contact you at later points if I need help. Thank you so much for your input so far though -- it was very helpful to reaffirm or correct my thoughts on how to proceed.

nbergst commented 7 years ago

Hey,

I implemented the cuda functions since I was running on a TX1, so I hope it will work out for you. Good luck!

andrewtran1995 commented 7 years ago

Hey,

I'm using OpenCV as a whole for my application, and noticed there's a gpu module that has no Python bindings. Do you believe it would be feasible to create a C++ function that acts as a wrapper for one of these functions (such as gpu threshold) and use ctypes or boost to call this from Python? Or are there certain restrictions, especially on the Jetson TX1, that I am not aware of?

Or would I already be calling the GPU-accelerated version of threshold if I have compiled OpenCV with the flag "WITH_CUDA=ON"?

nbergst commented 7 years ago

Hey,

I'm not intricately familiar with how OpenCV handles CUDA or Python. It's possible to write python wrappers to call the CUDA functions I implemented. As I explained before I have used boost to do that. The TX1 platform does not have any restrictions in particular that I'm aware of (regarding what we're discussing here). I don't know how one would use those functions to replace the OpenCV functions in the aruco code though using python.

I don't think (you have to check) if the aruco implementation in OpenCV is gpu accelerated. Not sure if the function I implemented is available as a CUDA function in opencv either, you'd have to check that as well

nbergst / cuda_aruco

Question - CUDA Integration into Existing Project #1