performance issues on Nvidia

After testing, it seems like the time spent in the individual OpenCL functions is the same as in the case with AMD/Intel GPUs, except a slight overhead for memory transfers (the AMD and Intel cards I tested on are integrated chips, and memory transfers are zero-cost).

Furthermore, I have found that the extra time is spent neither on the CPU nor on the GPU. The OpenCL implementation simply waits on a semaphore, uselessly, for no apparent reason. This happens when the blocking clEnqueueMapBuffer calls are made. Especially strange is that this also happens when there is nothing in the queue to block on.

Finally I tried rewriting to use the pinned memory as Nvidia does in their OpenCL SDK examples, using a fixed mapped memory segment and clEnqueue{Read|Write}Buffer from/to it. The same hang occurs. This is mysterious.

glenco / lensed

performance issues on Nvidia #189