Scalability issues - Githubissues

This is a meta-issue to collect all information relevant to scallability problems in the UFO framework and it's plugins.

On NVIDIA platform, the kernel execution penalty depends from the number of GPUs used in the OpenCL context. While normal penalty is about 16 - 20 us, it reaches ~ 100 us with 6 GPUs (ipepdvcompute2) and ~ 200 us with 9 (ipepdvcompute1). This affects filters with large number of kernel launches. For instance, SART executes 3 kernels for each projection at each iteration and does not scale beyond 2nd device on ipepdvcompute2. The AMD platform is not affected. If individual OpenCL context is used for each device, there is only a marginal growth of execution time. Small test for kernel launch penalties (cl_launch.c) is available from bzr+ssh://ufo.kit.edu/opencl/tools
Another problem affecting UfoIr filters is a way how ufo-basic-ops are operating. As I understand from explanations of Andrey Shkarin, originally the kernel was compiled on each execution which was introducing huge latency. Then, Matthias Vogelgesang has implemented caching. However, even if multiple GPUs are used, always the same kernel is returned. As result, current implementation of UfoIr uses mutexes and operation can't be executed on multiple devices in parallel. This currently harms performance of SIRT implementation. I guess the caching should be done on the per cl_queue or per-thread basis.
For high-speed reconstruction filters like DFI, PCIe transfer becomes an issue, especially if an external PCIe enclosures sharing a single x16 link between mulitple GPUs are used. As I can see, currently UFO buffers are only supporting synchronous API. I think we should provide alternative API to use asynchronous IO. Moreover, on NVIDIA platform it is possible to use overlapping of memory transfers with kernel execution. This is achieved if pinned (page-locked) host memory is used. On NVIDIA platform it can be done by executing clCreateBuffer with CL_MEM_ALLOC_HOST_PTR flag and then it is mapped with clEnqueueMapBuffer. I guess this also should be supported in ufo buffers.

ufo-kit / ufo-core

Scalability issues #75