oneapi-src / oneDNN

oneAPI Deep Neural Network Library (oneDNN)
https://uxlfoundation.org
Apache License 2.0
3.58k stars 985 forks source link

convolution forward results are different in AVX512 VNNI and AVX2 #1354

Closed dagdoron closed 2 months ago

dagdoron commented 2 years ago

Summary

running convolution with different ISA results in different outputs the smallest example I have is convolution 1x1 on 1x4x1x1 input / output tensors running with uint8 src and int32 dst and with a single OMP thread AVX2 VNNI gives the correct results AVX2 wrong results

Version

v2.5.1 (commit 3f3016c424fe0e6280898b30124fbc5037c03586) but reproduces also with master

Environment

oneDNN includes hardware-specific optimizations and may behave differently on depending on the compiler and build environment. Include the following information to help reproduce the issue:

Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single ssbd mba ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke avx512_vnni md_clear flush_l1d arch_capabilities

Steps to reproduce

control ISA with DNNL_MAX_CPU_ISA run setting it to ALL and to AVX2 on some inputs the AVX2 version output is wrong an input/weights example is hard coded into the attached code

include

include

include "oneapi/dnnl/dnnl.hpp"

include

using namespace dnnl;

int main(int argc, char **argv) {

DNNL_API dnnl_set_verbose(2);
//dnnl::set_max_cpu_isa(dnnl::cpu_isa::avx2);

int N = 1;
int Ci = 4;
int Co = 4;
int H = 1, W = 1, filter = 1;
int padding = 0;

using tag = memory::format_tag;
using dt = memory::data_type;
#define srcTag  tag::nhwc
#define dstTag  tag::nhwc

auto eng = dnnl::engine(dnnl::engine::kind::cpu, 0);
stream s(eng);

memory::dims conv_src_tz = {N, Ci, H, W};
memory::dims conv_weights_tz = {Co, Ci, filter, filter};
memory::dims conv_bias_tz = {Co};
memory::dims conv_dst_tz = {N, Co, H, W};
memory::dims conv_strides = {1, 1};
memory::dims conv_padding = {padding, padding};

//[Configure tensor shapes]
std::vector<uint8_t> src = { 0xed, 0xf2, 0x43, 0x0d };
void* user_src = (void*)src.data();

std::vector<uint8_t> weights = {0xcc, 0x5f, 0xc7, 0x60, 0x62, 0x63, 0xff, 0x1a, 0x85, 0xfe, 0xf8, 0x50,  0xfc, 0x3e, 0x3f, 0x03};

void* conv_weights = (void*)weights.data();
auto user_src_memory = memory({{conv_src_tz}, dt::u8, srcTag}, eng, user_src);
auto user_weights_memory = memory({{conv_weights_tz}, dt::s8, tag::oihw}, eng, conv_weights);
auto conv_src_md = memory::desc({conv_src_tz}, dt::u8, tag::any);
auto conv_weights_md = memory::desc({conv_weights_tz}, dt::s8, tag::any);
auto conv_dst_md = memory::desc({conv_dst_tz}, dt::s32, tag::any);

auto conv_desc = convolution_forward::desc(
        prop_kind::forward,
        algorithm::convolution_direct,
        conv_src_md,
        conv_weights_md,
        conv_dst_md,
        conv_strides,
        conv_padding,
        conv_padding
);
// create default attributes
dnnl::primitive_attr attr;
auto conv_prim_desc = convolution_forward::primitive_desc(conv_desc, attr, eng);
auto conv_src_memory = memory(conv_prim_desc.src_desc(), eng);
auto src_reorder_pd = reorder::primitive_desc(eng, user_src_memory.get_desc(), eng, conv_src_memory.get_desc());
auto src_reorder = reorder(src_reorder_pd);
src_reorder.execute(s, user_src_memory, conv_src_memory);

auto conv_weights_memory = memory(conv_prim_desc.weights_desc(), eng);
auto weights_reorder_pd = reorder::primitive_desc(eng, user_weights_memory.get_desc(), eng, conv_weights_memory.get_desc());
auto weights_reorder = reorder(weights_reorder_pd);
weights_reorder.execute(s, user_weights_memory, conv_weights_memory);

auto conv_dst_memory = memory(conv_prim_desc.dst_desc(), eng);

auto conv = convolution_forward(conv_prim_desc);
conv.execute(s,{
    {DNNL_ARG_SRC, conv_src_memory},
    {DNNL_ARG_WEIGHTS, conv_weights_memory},
    {DNNL_ARG_DST, conv_dst_memory}
});
s.wait();

size_t size = conv_dst_memory.get_desc().get_size();
int32_t *val = static_cast<int32_t *>(conv_dst_memory.get_data_handle());
std::cout << " result ";
for (size_t i = 0; i < size / sizeof(int32_t); ++i) {
    std::cout << val[i] << " "; 
}

return 0;

}

Observed behavior

the second output element is wrong

onednn_verbose,info,oneDNN v2.5.1 (commit 3f3016c424fe0e6280898b30124fbc5037c03586) onednn_verbose,info,cpu,runtime:OpenMP onednn_verbose,info,cpu,isa:Intel AVX2 onednn_verbose,info,gpu,runtime:none onednn_verbose,info,prim_template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time onednn_verbose,create:cache_miss,cpu,reorder,jit:uni,undef,src_u8::blocked:acdb:f0 dst_u8::blocked:acdb:f0,,,1x4x1x1,0.207031 onednn_verbose,exec,cpu,reorder,jit:uni,undef,src_u8::blocked:acdb:f0 dst_u8::blocked:acdb:f0,,,1x4x1x1,0.000976562 onednn_verbose,create:cache_miss,cpu,reorder,jit:uni,undef,src_s8::blocked:abcd:f0 dst_s8:p:blocked:ABcd2b8a4b:f0,,,4x4x1x1,0.0319824 onednn_verbose,exec,cpu,reorder,jit:uni,undef,src_s8::blocked:abcd:f0 dst_s8:p:blocked:ABcd2b8a4b:f0,,,4x4x1x1,0.000976562 onednn_verbose,create:cache_miss,cpu,convolution,jit_uni_int8_1x1:avx2,forward_training,src_u8::blocked:acdb:f0 wei_s8:p:blocked:ABcd2b8a4b:f0 bia_undef::undef::f0 dst_s32::blocked:acdb:f0,,alg:convolution_direct,mb1_ic4oc4_ih1oh1kh1sh1dh0ph0_iw1ow1kw1sw1dw0pw0,0.10791 onednn_verbose,exec,cpu,convolution,jit_uni_int8_1x1:avx2,forward_training,src_u8::blocked:acdb:f0 wei_s8:p:blocked:ABcd2b8a4b:f0 bia_undef::undef::f0 dst_s32::blocked:acdb:f0,,alg:convolution_direct,mb1_ic4oc4_ih1oh1kh1sh1dh0ph0_iw1ow1kw1sw1dw0pw0,0.00805664 result 8095 33038 -29131 18316

Expected behavior

output result should be identical to this run

onednn_verbose,info,oneDNN v2.5.1 (commit 3f3016c424fe0e6280898b30124fbc5037c03586) onednn_verbose,info,cpu,runtime:OpenMP onednn_verbose,info,cpu,isa:Intel AVX-512 with Intel DL Boost onednn_verbose,info,gpu,runtime:none onednn_verbose,info,prim_template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time onednn_verbose,create:cache_miss,cpu,reorder,jit:uni,undef,src_u8::blocked:acdb:f0 dst_u8::blocked:acdb:f0,,,1x4x1x1,0.189941 onednn_verbose,exec,cpu,reorder,jit:uni,undef,src_u8::blocked:acdb:f0 dst_u8::blocked:acdb:f0,,,1x4x1x1,0.000976562 onednn_verbose,create:cache_miss,cpu,reorder,jit:uni,undef,src_s8::blocked:abcd:f0 dst_s8:p:blocked:ABcd16a4b:f0,,,4x4x1x1,0.0258789 onednn_verbose,exec,cpu,reorder,jit:uni,undef,src_s8::blocked:abcd:f0 dst_s8:p:blocked:ABcd16a4b:f0,,,4x4x1x1,0.000976562 onednn_verbose,create:cache_miss,cpu,convolution,brgconv_1x1:avx512_core_vnni,forward_training,src_u8::blocked:acdb:f0 wei_s8:p:blocked:ABcd16a4b:f0 bia_undef::undef::f0 dst_s32::blocked:acdb:f0,,alg:convolution_direct,mb1_ic4oc4_ih1oh1kh1sh1dh0ph0_iw1ow1kw1sw1dw0pw0,0.0717773 onednn_verbose,exec,cpu,convolution,brgconv_1x1:avx512_core_vnni,forward_training,src_u8::blocked:acdb:f0 wei_s8:p:blocked:ABcd16a4b:f0 bia_undef::undef::f0 dst_s32::blocked:acdb:f0,,alg:convolution_direct,mb1_ic4oc4_ih1oh1kh1sh1dh0ph0_iw1ow1kw1sw1dw0pw0,0.00610352 result 8095 47455 -29131 18316

dzarukin commented 2 years ago

Hi @dagdoron, thank you for letting us know. This is an expected behavior and documented in several places, like this, this, this and this. Could you, please, take a look? Thank you.

dagdoron commented 2 years ago

Hi @dzarukin I understand the problem, though the way we use the library, requires the results to be bit exact with our HW. This means we can't use it on machines without VNNI. typically we run on Amazon's aws and have no control on the spec so this poses a problem. We have our own version of avx2 code, so if there is a less optimized AVX512 version that could be enabled it would be very helpful

dagdoron commented 2 years ago

Hi @dzarukin have you checked the possibility to expose the less optimized / more correct version in avx512?

vpirogov commented 2 years ago

The implementation that is not prone to overflow for Intel AVX512/Intel AVX2 instruction sets is expected to be 2x slower than the current one and requires non-trivial investment. This is not a priority for the core team, but we would welcome contributions with alternative implementation that suits your needs. Alternative instruction sequence is described in this document.

vpirogov commented 2 months ago

Closing as stale.