OpenCL Backend NUMA Issues

naibaf7 commented 9 years ago

Excerpt from my current thesis:

An issue that came up testing the OpenCL hybrid backend was that the performance did not scale as expected with systems that have more than one CPU. Such systems have non-unified memory access (NUMA) because the CPUs share one address space for memory, but every processor has its own cache and memory interface. Accessing data across the other CPU comes with a large performance penalty. Compute kernels, such as the matrix-matrix multiplication in the BLAS library or the custom OpenCL kernels, cause the threads to work on adjacent data. This means a write operation of one CPU is likely to invalidate cache lines across both CPUs. At this point, the synchronization overhead seems to become larger than any speedup of having additional cores working on the al- gorithms.

To get the expected speedup, the two (or more) processors need to be presented to the Caffe library as separate devices. Then the library can be used in two individual instances. As the OpenCL hybrid backend uses two separate parallelization mech- anisms (OpenCL kernels and a parallelized BLAS), two solutions would need to be applied:

The Caffe frontend needs to be tied to the cores of one CPU, so that the BLAS library does not show NUMA issues.
The OpenCL backend needs to split up the processor setup into sub-devices using device fission. The splitting rule needs to be that all cores belonging to one processor (tested by cache affinity) are tied to the same sub-device. Only one is then used per Caffe instance. Device fission is an extension to OpenCL that is already available (cl_ext_fission).
The cores used in the frontend and selected sub-device need to be the same.

bhack commented 9 years ago

naibaf7 commented 9 years ago

@bhack Ok, maybe not directly related (this affects CPUs while your article is about (multiple) Xeon Phi card(s)). But definitely an interesting read, thanks. Sadly, currently don't have such a card available for testing anyways. However the Xeon Phi should not behave much different from regular GPUs when using OpenCL.

bhack commented 9 years ago

Yes it is something not related to Numa but to consider on some Intel device. Also interesting https://software.intel.com/en-us/articles/opencl-device-fission-for-cpu-performance

naibaf7 commented 9 years ago

@bhack Yup, this link is exactly what needs to be added to the device initialization in the OpenCL backend. :) You see, this is also a reason why I want to have multi-device training in OpenCL as with CUDA, as the training on a multiprocessor-NUMA system would benefit very heavily.

bhack commented 9 years ago

Other than Intel https://www.dcl.hpi.uni-potsdam.de/teaching/numasem/slides/NUMASem_OpenCL.pdf

naibaf7 commented 9 years ago

And also APU systems like AMD HSA Kaveri and Intel Broadwell, as this PDF points out, yup.

naibaf7 / caffe

OpenCL Backend NUMA Issues #8