oneapi-src / oneDNN

oneAPI Deep Neural Network Library (oneDNN)
https://uxlfoundation.org
Apache License 2.0
3.52k stars 970 forks source link

AMX and VNNI perform nearly the same in IP primitives #1969

Closed ld0108 closed 1 day ago

ld0108 commented 2 weeks ago

Summary

we test the IP primitive in SPR-SP with AMX flag enabled. from the execution time recorded by VERBOSE results, we observed very close behaviors

Version

oneDnn tag v3.3.1

Environment

CPU:SPR-SP OS: Ubuntu22.04 Compiler: clang14

Observed behavior

VNNI: onednn_verbose,info,oneDNN v3.3.0 (commit N/A) onednn_verbose,info,cpu,runtime:OpenMP,nthr:1 onednn_verbose,info,cpu,isa:Intel AVX-512 with Intel DL Boost onednn_verbose,info,gpu,runtime:none onednn_verbose,info,graph,backend,0:dnnl_backend onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_vnni,forward_inference,src_u8::blocked:ab::f0 wei_s8:a:blocked:AB4b64a4b::f0 bia_s8::blocked:a::f0 dst_u8:a:blocked:ab::f0,,,mb32ic64oc256,0.0449219 onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_vnni,forward_inference,src_u8::blocked:ab::f0 wei_s8:a:blocked:AB4b64a4b::f0 bia_s8::blocked:a::f0 dst_u8:a:blocked:ab::f0,,,mb32ic256oc256,0.0100098 onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_vnni,forward_inference,src_u8::blocked:ab::f0 wei_s8:a:blocked:AB4b64a4b::f0 bia_s8::blocked:a::f0 dst_u8:a:blocked:ab::f0,,,mb32ic256oc256,0.0090332 onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_vnni,forward_inference,src_u8::blocked:ab::f0 wei_s8:a:blocked:AB4b64a4b::f0 bia_s8::blocked:a::f0 dst_u8:a:blocked:ab::f0,,,mb32ic256oc256,0.0109863 onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_vnni,forward_inference,src_u8::blocked:ab::f0 wei_s8:a:blocked:AB4b64a4b::f0 bia_s8::blocked:a::f0 dst_u8:a:blocked:ab::f0,,,mb32ic256oc128,0.00512695 onednn_verbose,primitive,exec,cpu,eltwise,jit_int8:avx512_core,forward_inference,data_u8::blocked:ab::f0 diff_undef::undef:::,,alg:eltwise_relu alpha:0 beta:0,32x128,0.00195312 onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_vnni,forward_inference,src_u8::blocked:ab::f0 wei_s8:a:blocked:AB4b64a4b::f0 bia_s8::blocked:a::f0 dst_u8:a:blocked:ab::f0,,,mb32ic64oc256,0.00390625 onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_vnni,forward_inference,src_u8::blocked:ab::f0 wei_s8:a:blocked:AB4b64a4b::f0 bia_s8::blocked:a::f0 dst_u8:a:blocked:ab::f0,,,mb32ic256oc256,0.0090332 onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_vnni,forward_inference,src_u8::blocked:ab::f0 wei_s8:a:blocked:AB4b64a4b::f0 bia_s8::blocked:a::f0 dst_u8:a:blocked:ab::f0,,,mb32ic256oc256,0.00878906 onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_vnni,forward_inference,src_u8::blocked:ab::f0 wei_s8:a:blocked:AB4b64a4b::f0 bia_s8::blocked:a::f0 dst_u8:a:blocked:ab::f0,,,mb32ic256oc256,0.00805664 onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_vnni,forward_inference,src_u8::blocked:ab::f0 wei_s8:a:blocked:AB4b64a4b::f0 bia_s8::blocked:a::f0 dst_u8:a:blocked:ab::f0,,,mb32ic256oc128,0.00488281 onednn_verbose,primitive,exec,cpu,eltwise,jit_int8:avx512_core,forward_inference,data_u8::blocked:ab::f0 diff_undef::undef:::,,alg:eltwise_relu alpha:0 beta:0,32x128,0.000976562

AMX onednn_verbose,info,oneDNN v3.3.0 (commit N/A) onednn_verbose,info,cpu,runtime:OpenMP,nthr:1 onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support onednn_verbose,info,gpu,runtime:none onednn_verbose,info,graph,backend,0:dnnl_backend onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_amx,forward_inference,src_u8::blocked:ab::f0 wei_s8:a:blocked:AB16b64a4b::f0 bia_s8::blocked:a::f0 dst_u8:a:blocked:ab::f0,,,mb32ic64oc256,0.078125 onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_amx,forward_inference,src_u8::blocked:ab::f0 wei_s8:a:blocked:AB16b64a4b::f0 bia_s8::blocked:a::f0 dst_u8:a:blocked:ab::f0,,,mb32ic256oc256,0.0200195 onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_amx,forward_inference,src_u8::blocked:ab::f0 wei_s8:a:blocked:AB16b64a4b::f0 bia_s8::blocked:a::f0 dst_u8:a:blocked:ab::f0,,,mb32ic256oc256,0.013916 onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_amx,forward_inference,src_u8::blocked:ab::f0 wei_s8:a:blocked:AB16b64a4b::f0 bia_s8::blocked:a::f0 dst_u8:a:blocked:ab::f0,,,mb32ic256oc256,0.0170898 onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_amx,forward_inference,src_u8::blocked:ab::f0 wei_s8:a:blocked:AB16b64a4b::f0 bia_s8::blocked:a::f0 dst_u8:a:blocked:ab::f0,,,mb32ic256oc128,0.0100098 onednn_verbose,primitive,exec,cpu,eltwise,jit_int8:avx512_core,forward_inference,data_u8::blocked:ab::f0 diff_undef::undef:::,,alg:eltwise_relu alpha:0 beta:0,32x128,0.00195312 onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_amx,forward_inference,src_u8::blocked:ab::f0 wei_s8:a:blocked:AB16b64a4b::f0 bia_s8::blocked:a::f0 dst_u8:a:blocked:ab::f0,,,mb32ic64oc256,0.00610352 onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_amx,forward_inference,src_u8::blocked:ab::f0 wei_s8:a:blocked:AB16b64a4b::f0 bia_s8::blocked:a::f0 dst_u8:a:blocked:ab::f0,,,mb32ic256oc256,0.0090332 onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_amx,forward_inference,src_u8::blocked:ab::f0 wei_s8:a:blocked:AB16b64a4b::f0 bia_s8::blocked:a::f0 dst_u8:a:blocked:ab::f0,,,mb32ic256oc256,0.0090332 onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_amx,forward_inference,src_u8::blocked:ab::f0 wei_s8:a:blocked:AB16b64a4b::f0 bia_s8::blocked:a::f0 dst_u8:a:blocked:ab::f0,,,mb32ic256oc256,0.0078125 onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_amx,forward_inference,src_u8::blocked:ab::f0 wei_s8:a:blocked:AB16b64a4b::f0 bia_s8::blocked:a::f0 dst_u8:a:blocked:ab::f0,,,mb32ic256oc128,0.00390625 onednn_verbose,primitive,exec,cpu,eltwise,jit_int8:avx512_core,forward_inference,data_u8::blocked:ab::f0 diff_undef::undef:::,,alg:eltwise_relu alpha:0 beta:0,32x128,0.000976562

I summary one table for IP execution time image

Expected behavior

AMX should perform better than VNNI

dzarukin commented 2 weeks ago

Hi @ld0108, the problems you report barely take 80 KB of data. Where the expectation that AMX will perform better than VNNI is coming from?

ld0108 commented 20 hours ago

@vpirogov @dzarukin sorry, I dont get notice to the reply from your side, can I continue to talk at this topic ? do you have any benchmark results for AMX and VNNI in IP layer? it seems the data size is one key parameter to choose ISA in onednn.