some confusion about oneDNN

oneapi-src / oneDNN

oneAPI Deep Neural Network Library (oneDNN)

https://uxlfoundation.org

Apache License 2.0

3.58k stars 983 forks source link

some confusion about oneDNN #782

Closed ddummkopfer closed 4 years ago

ddummkopfer commented 4 years ago

Hi, To whom it may concern. I am a learner of oneDNN. Here are some confusion about oneDNN. Looking forward to your answers.

What is relationship between oneDNN and intel's dpcpp? Is dpcpp designed for the oneDNN's gpu backend?
If i use PyTorch to do some deep learning work on my machine without CUDA, does that mean oneDNN will help my CPU to accelerate my deep learning work?
In the PyTorch Source Code, the interface 'torch.empty()' will call function 'CPU:empty_mkldnn' in natives_function.yaml if there is no CUDA. Is oneDNN designed as the backend of the PyTorch's API? Does function 'CPU:empty_mkldnn' will further call some functions in oneDNN?
Can oneDNN be deployed on deep learning inference's acceleration on ARM's processor, for example, Qualcomm Snapdragon's serial processors?
I find that the folder 'cpu' and the folder 'gpu' under oneDNN/src has different size. The folder 'cpu' is 8.8 MB, while the folder 'gpu' is 1.6 MB. Is gpu's development still going on? Will it be developed for intel's GEN9 or GEN11, i guess ?

Looking forward to your answers. Thank you for your help!

vpirogov commented 4 years ago

Hi @ddummkopfer,

Thank you for the questions.

What is relationship between oneDNN and intel's dpcpp? Is dpcpp designed for the oneDNN's gpu backend?

Data Parallel C++ (DPC++) is a programming language based on Khronos Group SYCL standard. DPC++ is designed for data parallel and heterogeneous computing and allows developers to write portable programs for CPUs and accelerators. oneDNN v2.0-beta interoperates with DPC++ runtime and DPC++ application code.

If i use PyTorch to do some deep learning work on my machine without CUDA, does that mean oneDNN will help my CPU to accelerate my deep learning work?

oneDNN is designed to improve performance of deep learning applications. Pytorch default builds use oneDNN to improve performance on Intel 64 compatible processors. This article describes the steps to get full advantage of oneDNN in Pytorch.

+@Jianhui-Li, @mingfeima for additional comments.

In the PyTorch Source Code, the interface 'torch.empty()' will call function 'CPU:empty_mkldnn' in natives_function.yaml if there is no CUDA. Is oneDNN designed as the backend of the PyTorch's API? Does function 'CPU:empty_mkldnn' will further call some functions in oneDNN?

Pytorch's CPU backend uses oneDNN to improve performance. I'm not familiar with details of the implementation though. Summoning @Jianhui-Li, @mingfeima for help.

Can oneDNN be deployed on deep learning inference's acceleration on ARM's processor, for example, Qualcomm Snapdragon's serial processors?

oneDNN has initial support for processors based on AArch64, so it shoud run on Snapdragon 8-series. Note, that performance optimizations for AArch64 are limited at this point. Summoning @nSircombe to further comment on Arm support.

I find that the folder 'cpu' and the folder 'gpu' under oneDNN/src has different size. The folder 'cpu' is 8.8 MB, while the folder 'gpu' is 1.6 MB. Is gpu's development still going on? Will it be developed for intel's GEN9 or GEN11, i guess ?

oneDNN currently fully supports GEN9 and GEN11 graphics and we are focusing on optimizations for future GPUs. The difference in size is likely due CPU implementation supporting more ISA generations (Intel SSE4.1, Intel AVX, Intel AVX2, Intel AVX-512, Intel DL Boost, Intel AMX), and verbose GEMM implementation that takes 46% of the codebase.

mingfeima commented 4 years ago

In the PyTorch Source Code, the interface 'torch.empty()' will call function 'CPU:empty_mkldnn' in natives_function.yaml if there is no CUDA. Is oneDNN designed as the backend of the PyTorch's API? Does function 'CPU:empty_mkldnn' will further call some functions in oneDNN?

Hi @ddummkopfer, empty will call empty_cpu for cpu tensor, please refer here and empty_mkldnn is used for MkldnnCPU tensor. empty_mkldnn is just a factory function, it will eventually allocate a buffer for mkldnn tensor, e.g. mkldnn::memory

Also please be aware that mkldnn has been renamed to oneDNN, it's just the PyTorch bindings still use the old name mkldnn here.

ddummkopfer commented 4 years ago

@vpirogov @mingfeima Thank you for your meaningful answers, which are suddently enlightened. With further understanding, i have the following confusion about the oneDNN(previous called MKL-DNN).

What is the biggest difference between the CPU tensor and the MkldnnCPU tensor? Is the MkldnnCPU tensor's format layout quite different from the CPU tensor? In my current understanding, CPU tensor use the plain format 'NCHW', while the MkldnnCPU tensor uses the blocked format 'NHWC and nChw16c' for better performance. Is my understanding right?
What does 'verbose GEMM implementation' mean?
What do 'SparseCPU' and 'SparseCUDA' mean? use_c10_dispatcher: full dispatch: CPU: empty_cpu CUDA: empty_cuda MkldnnCPU: empty_mkldnn SparseCPU, SparseCUDA: empty_sparse
what does the function 'x.to_dense()' mean? How to understand the 'dense'?
If i use an op of PyTorch, which is supported by oneDNN, the runtime will firstly reorder from the plain layout to the blocked layout the oneDNN only accepts. After finishing calculating, the runtime will reorder again from the blocked layout to the plain layout, and give the output to the user. Is my understanding right?
If i use ten ops of PyTorch one by one, which are all supported by oneDNN, the runtime will only do the reordering work 2 times, is what i say right?

class mynet(nn.Module): ... def forward(self, x): x = op1(x) # do the first reorder ......... x = op10(x) # do the second reorder return x

What is the relationship between the ideep intel created and the oneDNN? Is the ideep a wrapper of the oneDNN?

Thank you for your efforts. Really appreciate that.

mingfeima commented 4 years ago

I am not clear what 'verbose gemm' referred by Vadim :( for the rest of your questions:

The tags in native_functions.yaml are called TensorTypeId, which works like a guidepost when op is being dispatched to backend implementations. TensorTypeId consists of 'Layout' + 'Device' + 'Sparsity', e.g. 'Mkldnn' is the layout (compares against to 'strided'), 'CPU' is the device (compares against to 'CUDA'), 'Sparsity' refers to COO or 'dense', none refers to 'dense' here. So you see 'MkldnnCPU' is treating 'Mkldnn' as the layout here, which means the underlying memory format could be 'blocked', e.g. 'nChw16c'. But this is opaque to PyTorch which means user can not get the exact dims of a 'MkldnnCPU' tensor.

'blocked' or 'plain' are mkldnn terms, both NCHW and NHWC are 'plain' formats (under TensorTypeId of 'CPU'); nChw16c and IOhw16i16o are 'blocked' formats (under TensorTypeId of 'MkldnnCPU'). Actually, taking all these terminologies aside, the only thing that matters here is the memory format. You can take a look at this gist on PyTorch channels last CPU optimization, it will give you some clue. You will find most of your answers in this gist. Please be aware this is an ongoing job and this is not an official doc. One thing to point out that is my plan is to make 'NHWC' and 'blocked' orthogonal: which means it will be illegal to call 'to_mkldnn()' on a 'NHWC' tensor.

to_dense()/to_mkldnn() is conversion between 'NCHW' and 'blocked'. to(memory_format=torch.channels_last)/to(memory_format=torch.contiguous) is conversion between 'NCHW' and 'NHWC' ('channels_last' and 'contiguous' are the pytorch terms for 'NHWC' and 'NCHW')

Conv2d in PyTorch by default will still use mkldnn, but with reorders, you can take a look at the 2nd picture in the gist above, it is self explainatory.

'ideep' is simply a wrapper on mkldnn, it's only a header.

ddummkopfer commented 4 years ago

Thank you for your marvelous answers. @mingfeima Another serval questions and confusions about some concepts.

What is IOhw16i16o you mentioned? Is it a 6-D Matrix about net's weight?
According to the second picture of your gist(https://gist.github.com/mingfeima/595f63e5dd2ac6f87fdb47df4ffe4772), it seems that the oneDNN can only process NHWC and nChw16c input data. If the input data is NCHW layout, the reordering must be done for oneDNN. Is that right?
The plain layout and the blocked layout share the same StorageImpl, but the different TensorImpl. Is my understanding right?
Why we need memory copy when calling to_mkldnn() and to_dense()? The memory will be copied from where to where?
You mentioned 'Transfering the model to the mkl-dnn version, which prepares for the weight cache during the inference'. What is the weight cache you mentioned?

Really appreciate your help! Thank you so much!

mingfeima commented 4 years ago

for oneDNN memory format (or layout), you can refer to understanding_memory_formats.
yes, for NCHW layout, input/output and weight needs to be reordered for every execution.
the blocked layout is a inherit from OpaqueTensorImpl, if you refer to the C++ code, yes, they have the same base class; but keep in mind that a blocked layout tensor has a separate buffer (memory) from a plain layout tensor.
to_mkldnn() and to_dense() are outplace operations, to_mkldnn() will create a new blocked tensor from a plain tensor; to_dense do the reverse.
weight cache or sometimes referred as weight prepacking is designed for inference with small batch size. The thing is, for inference with small bs (e.g. bs=1), weight reorder will take about 20-30% of total runtime. Since weight stayed constant all the time so we can do the reorder once (OIHW -> OIhw16i16o) and cache/prepack the blocked weight, therefore the following runs will skip the weight reorder.

additionally, i suggest you take a look at onednn convolution primitive description, here. It explains how onednn manipulates the memory layouts and algorithms in convolution.

vpirogov commented 4 years ago

I am not clear what 'verbose gemm' referred by Vadim :(

I was referring to autogenerated GEMM kernels that oneDNN is using in some cases, like this one. This code is almost literal assembly, including all the loops. This somewhat bloats the codebase. Sorry for not being clear enough.

ddummkopfer commented 4 years ago

Thank you for both of your significant answers. Now i am more clear about your wonderful work(oneDNN and Oneapi). Thank you so much.

nSircombe commented 4 years ago

Hi @ddummkopfer, Sorry for the belated response. At present there is a basic level of support for Arm in oneDNN - Aarch64 builds pass the CI test suite. But we're relying on the fallback c++ reference implementations. The option to build against Arm Performance Libraries for BLAS calls has recently been exposed in master. There are some notes on this in the build docs and 'cmake/options.cmake'. We are planning an RFC soon to outline plans for implementing Aarch64 specific kernels. It's worth noting that this (and the ArmPL BLAS option) will not make use of "JITed" kernels. My focus for oneDNN on Arm has been 'server' scale SoC's, rather than mobile parts. That said, at present, oneDNN should build on any Arm-v8a hardware, which includes most varieties of Snapdragon (although I have not tested this). This is likely to remain the case, although in some features may only be implemented for SoCs that support later iterations of the architecture.

ddummkopfer commented 4 years ago

Thank you for your significant answers. Oneday Arm platform may co-work with oneDNN to get a better performance.