Can oneDNN run on a device which support opencl?

Airbala commented 3 years ago

I noticed that oneDNN connect intel gpu using -DDNNL-GPU-TUNTIME=OCL, Can oneDNN run on a device which support opencl, such as FPGA?

Can anyone help me please?

igorsafo commented 3 years ago

Hi @Airbala , oneDNN doesn't support FPGAs. While it is possible to reuse OpenCL for FPGA programming it will require changes on both API and implementation sides to dispatch into FPGA and run FPGA-specific kernels. As far as I know OpenVino toolkit provides FPGA support: https://docs.openvinotoolkit.org/2020.3/_docs_install_guides_installing_openvino_linux_fpga.html

Airbala commented 3 years ago

@igorsafo , Acctually I have a DSP acclerator with my cpu, which support opencl runtime on it. oneDNN is a excellent project and I want to let my DSP device deal with matrix issues. OpenVino toolkit is a good project too, but it isn't open-source. The kernel for DSP is easy to write, but i have no idea on changing the API and implementation sides. Could you please offer me some advice. Thanks a lot.

igorsafo commented 3 years ago

@Airbala I assume you are familiar with OpenCL programming model. oneDNN abstractions are well aligned with OpenCL ones:	oneDNN	OpenCL
engine	device + context
stream	queue
primitive	set of kernels
memory	buffer

Currently only GPU is supported using OpenCL runtime in oneDNN, so the OpenCL-related code is in src/gpu directory. The easiest way to start DSP integration would be to work with it as a GPU device, since in this case the infrastructure and dispatching can be reused as is from current OpenCL implementation. The following changes should be done to enable DSP device:

device detection: oneDNN filters OpenCL devices to make sure that we run code on a supported device. To enable DSP this mechanism should be updated. File: src/gpu/ocl/ocl_gpu_device_info.cpp.
engine: Modify it to work with DSP device type instead of CL_DEVICE_TYPE_GPU. File: src/gpu/ocl_engine.hpp.
stream configuration: DSP might require set of options passed to cl command queue. File: src/gpu/ocl/ocl_stream.cpp.
memory: oneDNN supports 2 types of OpenCL memory: USM and buffers. If it is enough for the DSP you target then memory can be reused as is.
primitive implementation: The DSP implementation of some primitive (for example, eltwise) should be implemented in OpenCL and placed next to other OpenCL implementations. I would highly recommend to go through reference eltwise implementation to get understanding of how internals work. Files: src/gpu/ocl/ref_eltwise.{cpp,hpp}.

Once all these steps are done oneDNN GPU example should work using OpenCL but internally dispatching onto DSP.

I would highly recommend you to run oneDNN example on GPU with gdb and analyze it step by step to understand how oneDNN works with OpenCL and what should be modified to replace GPU by DSP.

Just curious, what kind of DSP do you have?

Airbala commented 3 years ago

@igorsafo Thanks for your reply. I've read the file src/gpu/ocl/ref_eltwise.cl. I don't understand why your kernel function only handle one float data once instead of using for loop. Actually I have a CPU with 4 DSP clusters, and each cluster has many DSP cores. You can regard a core as a tiny CPU. I'm trying to make oneDNN fit the stucture. I want to buy an Intel GPU to do some experiments beacuse I found that your kernel in Gen9 GPU may have the similar architecture and programming style as my DSP device.

Airbala commented 3 years ago

@igorsafo I read about the source code about ocl_memoey_storage, could you please tell me what is the difference between USM and Buffer? I want to do some experiments on Intel GPU, and where can I purchase an Intel GPU?

igorsafo commented 3 years ago

buffer utilizes classic OpenCL buffer: https://www.khronos.org/registry/OpenCL/sdk/2.2/docs/man/html/clCreateBuffer.html .
usm utilizes Intel OpenCL extensions which allow to use pointers directly instead of memory objects: https://github.com/intel/compute-samples/blob/5c09620a90733527db46412be5354b0008c886db/compute_samples/applications/usm_hello_world/README.md

Airbala commented 3 years ago

@igorsafo Thanks a lot. The memory, stream, and engine work good in DSP preliminarily. But when it comes to the kernels, they are related to a word 'GWS'. Could you please tell me what it is so that I can write some efficient kernels?

igorsafo commented 3 years ago

@Airbala Good to hear, GWS is Global Workgroup Size. This variable along with Local Workgroup Size (LWS) define amount of work items to be executed on target device. A work-item is a single instance of a kernel (written in OpenCL or other kernel language). LWS is amount of work - items to be executed within a workgroup. Amount of workgroups = GWS/ LWS. In oneDNN GWS/LWS can be generated by dispatch_t. dispatch_t splits amount of work between execution units to keep all of them busy and to have enough of compute inside a single kernel (amount of work to be executed within a single kernel instance called a block). Dispatch takes into account amount of HW threads on a target device and layout of a memory that will be processed during kernel execution.

Airbala commented 3 years ago

@igorsafo Thanks. Your comments mean a lot for me. Some primitivies are supported with opencl such as pooling and eltwise. But some are not supported. such as reorder. There is some code like this:

ok = ok && compute_engine->mayiuse( compute::device_ext_t::intel_subgroups) && IMPLICATION( utils::one_of(data_type::f16, src_md()->data_type, dst_md()->data_type), true && compute_engine->mayiuse( compute::device_ext_t::khr_fp16) && compute_engine->mayiuse( compute::device_ext_t:: intel_subgroups_short));

finallt it return status::unimplemented. My cl_extension doesn't support intel_subgroups_short. What can I do to fix this problem? I found that the convolution primitive is ralated to reorder primitive. So it is important for me to solve this as soon as possible. Hope to hear from you soon. Thanks a lot.

Airbala commented 3 years ago

These are my cl_extensions.

artyom-beilis commented 3 years ago

In general oneDNN is designed for Intel... there are very few projects that can run generic OpenCL device. Also it is very hard to optimize for something generic.

You can take a look on dlprimitives: https://github.com/artyom-beilis/dlprimitives but I don't think that performance will be very good without specific device optimizations.

Airbala commented 3 years ago

@artyom-beilis Actually I just need to move oneDNN to my device, and I don't need to have good performance. The first step is to make it work on my device, and I'll think about the improvment of performance after that. oneDNN have interface for PyTorch and many DL frameworks, and oneDNN use opencl for GPU implemention. This is why I want to work with oneDNN. Thanks a lot and I've read about dlprims. It is a exellent project too. And I'll do some experiments on it later. At the same time I still want to fix my oneDNN problems. @igorsafo Could you please help me. If it's not able to be fixed in principle. Please just tell me, I'll appreciate it with your help.

igorsafo commented 3 years ago

Hi @Airbala , Sure, lets continue the exploration of oneDNN and your DSP device. I see that your device supports cl_khr_fp16, so it only lacks support of intel_subgroup_* extensions. This family of extensions defines a subgroup -- a group of work-items which work together and share a register file, so they can avoid using local memory for communication. In practice this means a kernel for 8 work-items can be compiled as a single instance of SIMD8 kernel. This is possible if the target device is SIMD (Intel GPUs are SIMD: https://www.intel.com/content/dam/develop/external/us/en/documents/the-compute-architecture-of-intel-processor-graphics-gen9-v1d0.pdf Page 7, 5.3).

If your DSP doesn't support this feature then the kernel can be re-written without subgroup extensions. I would recommend you to read this article to understand sub-groups better: https://www.codeproject.com/Articles/994769/SGEMM-for-Intel-Processor-Graphics

There are multiple implementations of reorder targeting different format tags, data types, etc. So what you can do is to start with the most generic implementation of reorder (remove subgroups from there) and then add the other implementations which will be required for optimized convolutions and other primitives on the DSP. Generic reorder:

Links: cl_intel_subgroups: https://github.com/KhronosGroup/OpenCL-Docs/blob/master/extensions/cl_intel_subgroups.asciidoc cl_khr_subgroups: https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_Ext.html#cl_khr_subgroups

oneapi-src / oneDNN

Can oneDNN run on a device which support opencl? #1174