Open xuhancn opened 5 months ago
cc @mgouicem
Hi @xuhancn and thanks for the proposal. Some time ago, we decided to rely on pointers pre-allocated by users instead of malloc/free callbacks. There was 2 main reasons for this:
In general, the memory allocation in oneDNN happens in four places:
Could you clarify if you are already using the mechanisms above and still see allocation overheads?
Hi @mgouicem Thanks for your comment, and sorry to reply delay. Acturally I took some time wrote a POC and collected some performance data.
My proposal indeed to optimize your mentioned item: for small temporary buffers not covered by scratchpad. This mostly affects gemm functionality as it is not a primitive. We encourage user to rely on matmul primtiive as it has more features. In particular it is compatible with user managed scratchpad.
The POC PR is here: https://github.com/pytorch/pytorch/pull/126049 which contains:
The performance comparsion as following:
After mimalloc registered, the mkldnn_convolution performance improved about 0.3s. Could you please help on designed a memory allocation callback mechanism? It will help on pytorch Windows get better performance, much appreciated. CC: @jgong5
Summary
During our pytorch development, we found Windows system memory alloctor is worse performance, and slow down the whole pytorch performance. After add third party memory alloctor, pytorch improved its tensor alloction performance. Detailed please take reference: https://github.com/pytorch/pytorch/issues/102534
As pytorch submodule, I found oneDNN still using system memory alloctor to malloc some buffer for reorder/resharp options. Related code as here: https://github.com/oneapi-src/oneDNN/blob/11f55587a6ef7ac07bac5e81fdac72a8233bb469/src/common/utils.cpp#L146-L170
I add some debug log to confirmed also.
On Windows, I tested resnet18 it has more than 360k times malloc/free via system malloc/free. Shows as below:
Problem statement
For slow memory alloction on Windows OS, I also write a malloc benchmark: https://github.com/xuhancn/bench_malloc The other third party memory malloc libraries can improve the performance. It is also works well on pytorch: https://github.com/pytorch/pytorch/issues/102534#issuecomment-1627903049
So, we need an idea to let oneDNN use some third party memory alloctor for performance improvement.
Option 1: Add some memory alloction library as a submodule.
Acturally, It is not a good option:
Option 2: Add cpu alloc/free callback to support customlize memory alloctor APIs.
It is a light method to change the memory alloction implemention.
Preferred solution
For above option 2: First, we can define the callback funtions:
The registeration API as below:
Reference implemention:
Additional question: oneDNN has two piece of malloc/free implemention:
CC: @jgong5, @chunyuan-w, @Guobing-Chen