kunzmi / managedCuda

ManagedCUDA aims an easy integration of NVidia's CUDA in .net applications written in C#, Visual Basic or any other .net language.
Other
440 stars 79 forks source link

Slow Template Match #80

Open serjl opened 4 years ago

serjl commented 4 years ago

Hi @kunzmi, Thanks again for the great wrapper. I wrote a pretty standard function for the pattern matching:

public (double MinDistance, Point MinLocation) PatternMatchL2Normed(NPPImage_8uC1 deviceSourceBuffer, NPPImage_8uC1 devicePatternBuffer)
        {
            int distBufWidth = deviceSourceBuffer.Width - devicePatternBuffer.Width + 1;
            int distBufHeight = deviceSourceBuffer.Height - devicePatternBuffer.Height + 1;
            NPPImage_32fC1 deviceDistancesBuffer = new NPPImage_32fC1(distBufWidth, distBufHeight);
            deviceSourceBuffer.SqrDistanceValid_Norm(devicePatternBuffer, deviceDistancesBuffer);

            CudaDeviceVariable<float> deviceMinDistance = new CudaDeviceVariable<float>(1);
            CudaDeviceVariable<int> deviceMinLocX = new CudaDeviceVariable<int>(1);
            CudaDeviceVariable<int> deviceMinLocY = new CudaDeviceVariable<int>(1);

            int minBufferHostSize = deviceDistancesBuffer.MinIndexGetBufferHostSize();
            CudaDeviceVariable<byte> buffer = new CudaDeviceVariable<byte>(minBufferHostSize);
            deviceDistancesBuffer.MinIndex(deviceMinDistance, deviceMinLocX, deviceMinLocY, buffer);                                

            float[] hostMinDistance = new float[1];
            int[] hostMinLocX = new int[1];
            int[] hostMinLocY = new int[1];

            deviceMinLocX.CopyToHost(hostMinLocX); //!!!!!!!(PROBLEMATIC LINE )!!!!!!!!!!!
            deviceMinLocY.CopyToHost(hostMinLocY);
            deviceMinDistance.CopyToHost(hostMinDistance);

            buffer.Dispose();
            deviceMinLocY.Dispose();
            deviceMinLocX.Dispose();
            deviceMinDistance.Dispose();
            deviceDistancesBuffer.Dispose();            
            return (hostMinDistance[0], new Point(hostMinLocX[0], hostMinLocY[0]));
        }

It works fine, but the !!!!!!!(PROBLEMATIC LINE )!!!!!!!!!!! is extremely slow (about 1 second or even more) (my GPU is GeForce GTX 1070). And if I don't use a buffer for MinIndex function, then the MinIndex is very slow and the !!!!!!!(PROBLEMATIC LINE )!!!!!!!!!!! is fine. In total, both ways take the same long time. Do you have an idea of the reason of such a behavior? Is it a GPU problem or I don't use the memory management correctly.

Thank you in advance, Sergei

kunzmi commented 4 years ago

It is always the MinIndex-method that takes that long. NPP is mostly asynchronous, which means that the NPP function returns to host code before the actual job is done. Memory deallocation (if no buffer is provided) or copying data from device to host are implicit synchronization steps, that's where your application is waiting for your time measurements. Remains the question why it takes so long: Is your image huge in size? Is the DLL for NPP already loaded, meaning do you perform the test several times? Is only the first execution slow or is it always the case?

serjl commented 4 years ago

Thanks for a quick reply! Image's size is about 1000x30 and it runs in a loop, so gpu runs many times (it is not about the zeroth long iteration). It is always slow in every iteration.