ufo-kit / ufo-core

GLib-based framework for GPU-based data processing
GNU Lesser General Public License v3.0
24 stars 8 forks source link

Scalability issues #75

Open csa7fff opened 9 years ago

csa7fff commented 9 years ago

This is a meta-issue to collect all information relevant to scallability problems in the UFO framework and it's plugins.

  1. On NVIDIA platform, the kernel execution penalty depends from the number of GPUs used in the OpenCL context. While normal penalty is about 16 - 20 us, it reaches ~ 100 us with 6 GPUs (ipepdvcompute2) and ~ 200 us with 9 (ipepdvcompute1). This affects filters with large number of kernel launches. For instance, SART executes 3 kernels for each projection at each iteration and does not scale beyond 2nd device on ipepdvcompute2. The AMD platform is not affected. If individual OpenCL context is used for each device, there is only a marginal growth of execution time. Small test for kernel launch penalties (cl_launch.c) is available from bzr+ssh://ufo.kit.edu/opencl/tools
  2. Another problem affecting UfoIr filters is a way how ufo-basic-ops are operating. As I understand from explanations of Andrey Shkarin, originally the kernel was compiled on each execution which was introducing huge latency. Then, Matthias Vogelgesang has implemented caching. However, even if multiple GPUs are used, always the same kernel is returned. As result, current implementation of UfoIr uses mutexes and operation can't be executed on multiple devices in parallel. This currently harms performance of SIRT implementation. I guess the caching should be done on the per cl_queue or per-thread basis.
  3. For high-speed reconstruction filters like DFI, PCIe transfer becomes an issue, especially if an external PCIe enclosures sharing a single x16 link between mulitple GPUs are used. As I can see, currently UFO buffers are only supporting synchronous API. I think we should provide alternative API to use asynchronous IO. Moreover, on NVIDIA platform it is possible to use overlapping of memory transfers with kernel execution. This is achieved if pinned (page-locked) host memory is used. On NVIDIA platform it can be done by executing clCreateBuffer with CL_MEM_ALLOC_HOST_PTR flag and then it is mapped with clEnqueueMapBuffer. I guess this also should be supported in ufo buffers.
tfarago commented 9 years ago

The pinned memory would be nice. Also disabling the double-buffered mode could be beneficial for some situations.