naibaf7 / caffe

Caffe: a fast open framework for deep learning. With OpenCL and CUDA support.
http://caffe.berkeleyvision.org/
Other
86 stars 20 forks source link

OpenCL Backend NUMA Issues #8

Open naibaf7 opened 9 years ago

naibaf7 commented 9 years ago

Excerpt from my current thesis:

An issue that came up testing the OpenCL hybrid backend was that the performance did not scale as expected with systems that have more than one CPU. Such systems have non-unified memory access (NUMA) because the CPUs share one address space for memory, but every processor has its own cache and memory interface. Accessing data across the other CPU comes with a large performance penalty. Compute kernels, such as the matrix-matrix multiplication in the BLAS library or the custom OpenCL kernels, cause the threads to work on adjacent data. This means a write operation of one CPU is likely to invalidate cache lines across both CPUs. At this point, the synchronization overhead seems to become larger than any speedup of having additional cores working on the al- gorithms.

To get the expected speedup, the two (or more) processors need to be presented to the Caffe library as separate devices. Then the library can be used in two individual instances. As the OpenCL hybrid backend uses two separate parallelization mech- anisms (OpenCL kernels and a parallelized BLAS), two solutions would need to be applied:

bhack commented 9 years ago

See also https://software.intel.com/en-us/node/540545

naibaf7 commented 9 years ago

@bhack Ok, maybe not directly related (this affects CPUs while your article is about (multiple) Xeon Phi card(s)). But definitely an interesting read, thanks. Sadly, currently don't have such a card available for testing anyways. However the Xeon Phi should not behave much different from regular GPUs when using OpenCL.

bhack commented 9 years ago

Yes it is something not related to Numa but to consider on some Intel device. Also interesting https://software.intel.com/en-us/articles/opencl-device-fission-for-cpu-performance

naibaf7 commented 9 years ago

@bhack Yup, this link is exactly what needs to be added to the device initialization in the OpenCL backend. :) You see, this is also a reason why I want to have multi-device training in OpenCL as with CUDA, as the training on a multiprocessor-NUMA system would benefit very heavily.

bhack commented 9 years ago

Other than Intel https://www.dcl.hpi.uni-potsdam.de/teaching/numasem/slides/NUMASem_OpenCL.pdf

naibaf7 commented 9 years ago

And also APU systems like AMD HSA Kaveri and Intel Broadwell, as this PDF points out, yup.