oneapi-src / oneDNN

oneAPI Deep Neural Network Library (oneDNN)
https://uxlfoundation.org
Apache License 2.0
3.6k stars 995 forks source link

[BRGEMM Ukernel] Call of execute with empty post op parameter #2179

Open Devjiu opened 1 day ago

Devjiu commented 1 day ago

Summary

During modification of initial example for brgemm ukernel usage I removed post-ops from configuration but kept execution call with default value of post-ops parameters. In this configuration accumulation buffer is ignored and don't passed into output buffer. In general, issue is behavior of this execute with default attributes don't match with behavior of raw brgemm kernel's execute.

Version

onednn_verbose,v1,info,oneDNN v3.7.0 (commit 2f8619b0b9bb64e75a8da8f49f70ae226f841bbc) onednn_verbose,v1,info,cpu,runtime:OpenMP,nthr:128 onednn_verbose,v1,info,cpu,isa:Intel AVX-512 with Intel DL Boost onednn_verbose,v1,info,gpu,runtime:none onednn_verbose,v1,info,graph,backend,0:dnnl_backend

Environment

oneDNN includes hardware-specific optimizations and may behave differently on depending on the compiler and build environment. Include the following information to help reproduce the issue:

-- CMAKE_BUILD_TYPE is unset, defaulting to Release -- The C compiler identification is GNU 11.4.0 -- The CXX compiler identification is GNU 11.4.0 -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /usr/bin/cc - skipped -- Detecting C compile features -- Detecting C compile features - done -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /usr/bin/c++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- DNNL_TARGET_ARCH: X64 -- DNNL_LIBRARY_NAME: dnnl -- Performing Test CMAKE_HAVE_LIBC_PTHREAD -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success -- Found Threads: TRUE -- Found OpenMP_C: -fopenmp (found version "4.5") -- Found OpenMP_CXX: -fopenmp (found version "4.5") -- Found OpenMP: TRUE (found version "4.5") -- Could NOT find Doxygen (missing: DOXYGEN_EXECUTABLE) -- Could NOT find Doxyrest (missing: DOXYREST_EXECUTABLE) CMake Warning (dev) at cmake/Sphinx.cmake:25 (find_package): Policy CMP0148 is not set: The FindPythonInterp and FindPythonLibs modules are removed. Run "cmake --help-policy CMP0148" for policy details. Use the cmake_policy command to set the policy and suppress this warning.

Call Stack (most recent call first): cmake/doc.cmake:28 (include) CMakeLists.txt:135 (include) This warning is for project developers. Use -Wno-dev to suppress it.

-- Found PythonInterp: .venv/bin/python (found suitable version "3.10.12", minimum required is "2.7") -- Could NOT find Sphinx (missing: SPHINX_EXECUTABLE) -- Found Git: /usr/bin/git (found version "2.34.1") -- Enabled testing coverage: CI -- Enabled workload: TRAINING -- Enabled primitives: ALL -- Enabled primitive CPU ISA: ALL -- Enabled primitive GPU ISA: ALL -- Enabled GeMM kernels ISA: ALL -- Primitive cache is enabled -- Experimental functionality for ukernels is enabled -- The ASM compiler identification is GNU -- Found assembler: /usr/bin/cc -- Graph component is enabled -- Graph compiler backend is disabled. -- Configuring done (0.8s) -- Generating done (0.4s) -- Build files have been written to: oneDNN/build

* git hash 

based on 2f8619b0b9bb64e75a8da8f49f70ae226f841bbc


# Steps to reproduce
PR to reproduce: https://github.com/Devjiu/oneDNN/pull/1 . It modifies [examples/ukernels/cpu_brgemm.cpp](https://github.com/Devjiu/oneDNN/pull/1/files#diff-212195ddf2d10e7c53b5b26281b7c3bfe233bdd82c9efade5c8f51cb9395ceb3)
Just run this test with comparison with reference output.   With execute, used [here](https://github.com/Devjiu/oneDNN/pull/1/files#diff-212195ddf2d10e7c53b5b26281b7c3bfe233bdd82c9efade5c8f51cb9395ceb3L318) output result from D_mem don't have data from accumulator buffer C_ptr. 

# Observed behavior
Dump with reduced size and enabled verbose in this [comment](https://github.com/Devjiu/oneDNN/pull/1#issuecomment-2429019344). 

# Expected behavior
I expect that those calls: 

brg_po.execute(A_ptr, B_base_ptr, A_B_po_offsets, C_ptr, scratchpad.data());

and 

brg_po.execute(A_ptr, B_base_ptr, A_B_po_offsets, C_ptr, D_mem.get_data_handle(), scratchpad.data());



will have the same result. (`C_ptr` in first case, `D_mem` in second one)
mgouicem commented 1 day ago

Hey @Devjiu thanks for the report. Indeed, when calling the extended execute function, without post-op nor conversion, the C matrix is not read (it seems that internally, it reads D and accumulate in it in that situation).

You can workaround the issue by using the minimal version of execute (without D matrix) when no post-op nor data conversion is needed.

On oneDNN side, we will clarify this behavior in documentation. We can likely add internal checks to avoid this issue for future users.