Closed dagdoron closed 2 months ago
Hi @dzarukin I understand the problem, though the way we use the library, requires the results to be bit exact with our HW. This means we can't use it on machines without VNNI. typically we run on Amazon's aws and have no control on the spec so this poses a problem. We have our own version of avx2 code, so if there is a less optimized AVX512 version that could be enabled it would be very helpful
Hi @dzarukin have you checked the possibility to expose the less optimized / more correct version in avx512?
The implementation that is not prone to overflow for Intel AVX512/Intel AVX2 instruction sets is expected to be 2x slower than the current one and requires non-trivial investment. This is not a priority for the core team, but we would welcome contributions with alternative implementation that suits your needs. Alternative instruction sequence is described in this document.
Closing as stale.
Summary
running convolution with different ISA results in different outputs the smallest example I have is convolution 1x1 on 1x4x1x1 input / output tensors running with uint8 src and int32 dst and with a single OMP thread AVX2 VNNI gives the correct results AVX2 wrong results
Version
v2.5.1 (commit 3f3016c424fe0e6280898b30124fbc5037c03586) but reproduces also with master
Environment
oneDNN includes hardware-specific optimizations and may behave differently on depending on the compiler and build environment. Include the following information to help reproduce the issue:
lscpu
; if yourlscpu
does not list CPU flags, Intel(R) Xeon(R) Platinum 8260M CPU @ 2.40GHzFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single ssbd mba ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke avx512_vnni md_clear flush_l1d arch_capabilities
uname -a
) Ubuntu 18.04.5 LTSgcc --version
) gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0cmake --version
) 3.10.2git log -1 --format=%H
) 3f3016c424fe0e6280898b30124fbc5037c03586Steps to reproduce
control ISA with DNNL_MAX_CPU_ISA run setting it to ALL and to AVX2 on some inputs the AVX2 version output is wrong an input/weights example is hard coded into the attached code
include
include
include "oneapi/dnnl/dnnl.hpp"
include
using namespace dnnl;
int main(int argc, char **argv) {
}
Observed behavior
the second output element is wrong
onednn_verbose,info,oneDNN v2.5.1 (commit 3f3016c424fe0e6280898b30124fbc5037c03586) onednn_verbose,info,cpu,runtime:OpenMP onednn_verbose,info,cpu,isa:Intel AVX2 onednn_verbose,info,gpu,runtime:none onednn_verbose,info,prim_template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time onednn_verbose,create:cache_miss,cpu,reorder,jit:uni,undef,src_u8::blocked:acdb:f0 dst_u8::blocked:acdb:f0,,,1x4x1x1,0.207031 onednn_verbose,exec,cpu,reorder,jit:uni,undef,src_u8::blocked:acdb:f0 dst_u8::blocked:acdb:f0,,,1x4x1x1,0.000976562 onednn_verbose,create:cache_miss,cpu,reorder,jit:uni,undef,src_s8::blocked:abcd:f0 dst_s8:p:blocked:ABcd2b8a4b:f0,,,4x4x1x1,0.0319824 onednn_verbose,exec,cpu,reorder,jit:uni,undef,src_s8::blocked:abcd:f0 dst_s8:p:blocked:ABcd2b8a4b:f0,,,4x4x1x1,0.000976562 onednn_verbose,create:cache_miss,cpu,convolution,jit_uni_int8_1x1:avx2,forward_training,src_u8::blocked:acdb:f0 wei_s8:p:blocked:ABcd2b8a4b:f0 bia_undef::undef::f0 dst_s32::blocked:acdb:f0,,alg:convolution_direct,mb1_ic4oc4_ih1oh1kh1sh1dh0ph0_iw1ow1kw1sw1dw0pw0,0.10791 onednn_verbose,exec,cpu,convolution,jit_uni_int8_1x1:avx2,forward_training,src_u8::blocked:acdb:f0 wei_s8:p:blocked:ABcd2b8a4b:f0 bia_undef::undef::f0 dst_s32::blocked:acdb:f0,,alg:convolution_direct,mb1_ic4oc4_ih1oh1kh1sh1dh0ph0_iw1ow1kw1sw1dw0pw0,0.00805664 result 8095 33038 -29131 18316
Expected behavior
output result should be identical to this run
onednn_verbose,info,oneDNN v2.5.1 (commit 3f3016c424fe0e6280898b30124fbc5037c03586) onednn_verbose,info,cpu,runtime:OpenMP onednn_verbose,info,cpu,isa:Intel AVX-512 with Intel DL Boost onednn_verbose,info,gpu,runtime:none onednn_verbose,info,prim_template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time onednn_verbose,create:cache_miss,cpu,reorder,jit:uni,undef,src_u8::blocked:acdb:f0 dst_u8::blocked:acdb:f0,,,1x4x1x1,0.189941 onednn_verbose,exec,cpu,reorder,jit:uni,undef,src_u8::blocked:acdb:f0 dst_u8::blocked:acdb:f0,,,1x4x1x1,0.000976562 onednn_verbose,create:cache_miss,cpu,reorder,jit:uni,undef,src_s8::blocked:abcd:f0 dst_s8:p:blocked:ABcd16a4b:f0,,,4x4x1x1,0.0258789 onednn_verbose,exec,cpu,reorder,jit:uni,undef,src_s8::blocked:abcd:f0 dst_s8:p:blocked:ABcd16a4b:f0,,,4x4x1x1,0.000976562 onednn_verbose,create:cache_miss,cpu,convolution,brgconv_1x1:avx512_core_vnni,forward_training,src_u8::blocked:acdb:f0 wei_s8:p:blocked:ABcd16a4b:f0 bia_undef::undef::f0 dst_s32::blocked:acdb:f0,,alg:convolution_direct,mb1_ic4oc4_ih1oh1kh1sh1dh0ph0_iw1ow1kw1sw1dw0pw0,0.0717773 onednn_verbose,exec,cpu,convolution,brgconv_1x1:avx512_core_vnni,forward_training,src_u8::blocked:acdb:f0 wei_s8:p:blocked:ABcd16a4b:f0 bia_undef::undef::f0 dst_s32::blocked:acdb:f0,,alg:convolution_direct,mb1_ic4oc4_ih1oh1kh1sh1dh0ph0_iw1ow1kw1sw1dw0pw0,0.00610352 result 8095 47455 -29131 18316