Create CPU grid patterns, that behave exactly as the current GPU grid patterns implemented, in terms of thread indexes.

The reasoning behind this is the following:

In some cases, we need the same functionality we have in GPU to be executed in CPU, to reduce PCIe transfers.
We are implementing some GPU functionality that is not available on OpenCV, therefore there is no OpenCV-CPU equivalent.
Having exactly the same thread indexes in CPU as in the GPU, can ease debugging in some situations, since GPU debugging does not keep all the values for all the variables.

So, we simply what to create CPU based GPU-GridPattern emulators.

Evaluate the usage of std::async to parallelize the for loops, to get better performance. Since we are doing kernel fusion, we will not have the issue of having to keep some threads alive all the time. The idea is to fuse all the operations the user needs into a single operation, and then the thread creation overhead will be the justified. Similarly as with OpenMP. Additionally, some implementations of std::async can actually care about the number of CPU's in the system and properly handle a thread pool.

This issue will define a rule on which functions will be __host__, which ones will be __device__ and which ones will be __host __ __device __.

__host __ : It will not change much. Basically, any host code. There is no need to use it, unless you want to explicitly indicate to differentiate with other closely related code that might be __host __ __device __ or __device __. For instance, it could be used to decorate CPU GridPatterns, but it's not necessary. __device __ : Mainly, only the GPU GridPatterns should be exclusively __device __. __host __ __device __ : this should be applied to every single Operation, and every single Metaprogramming utility.

morousg / cvGPUSpeedup

Create CPU grid patterns, that behave exactly as the current GPU grid patterns implemented, in terms of thread indexes. #86