Closed ld0108 closed 1 day ago
Hi @ld0108, the problems you report barely take 80 KB of data. Where the expectation that AMX will perform better than VNNI is coming from?
@vpirogov @dzarukin sorry, I dont get notice to the reply from your side, can I continue to talk at this topic ? do you have any benchmark results for AMX and VNNI in IP layer? it seems the data size is one key parameter to choose ISA in onednn.
Summary
we test the IP primitive in SPR-SP with AMX flag enabled. from the execution time recorded by VERBOSE results, we observed very close behaviors
Version
oneDnn tag v3.3.1
Environment
CPU:SPR-SP OS: Ubuntu22.04 Compiler: clang14
Observed behavior
VNNI: onednn_verbose,info,oneDNN v3.3.0 (commit N/A) onednn_verbose,info,cpu,runtime:OpenMP,nthr:1 onednn_verbose,info,cpu,isa:Intel AVX-512 with Intel DL Boost onednn_verbose,info,gpu,runtime:none onednn_verbose,info,graph,backend,0:dnnl_backend onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_vnni,forward_inference,src_u8::blocked:ab::f0 wei_s8:a:blocked:AB4b64a4b::f0 bia_s8::blocked:a::f0 dst_u8:a:blocked:ab::f0,,,mb32ic64oc256,0.0449219 onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_vnni,forward_inference,src_u8::blocked:ab::f0 wei_s8:a:blocked:AB4b64a4b::f0 bia_s8::blocked:a::f0 dst_u8:a:blocked:ab::f0,,,mb32ic256oc256,0.0100098 onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_vnni,forward_inference,src_u8::blocked:ab::f0 wei_s8:a:blocked:AB4b64a4b::f0 bia_s8::blocked:a::f0 dst_u8:a:blocked:ab::f0,,,mb32ic256oc256,0.0090332 onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_vnni,forward_inference,src_u8::blocked:ab::f0 wei_s8:a:blocked:AB4b64a4b::f0 bia_s8::blocked:a::f0 dst_u8:a:blocked:ab::f0,,,mb32ic256oc256,0.0109863 onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_vnni,forward_inference,src_u8::blocked:ab::f0 wei_s8:a:blocked:AB4b64a4b::f0 bia_s8::blocked:a::f0 dst_u8:a:blocked:ab::f0,,,mb32ic256oc128,0.00512695 onednn_verbose,primitive,exec,cpu,eltwise,jit_int8:avx512_core,forward_inference,data_u8::blocked:ab::f0 diff_undef::undef:::,,alg:eltwise_relu alpha:0 beta:0,32x128,0.00195312 onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_vnni,forward_inference,src_u8::blocked:ab::f0 wei_s8:a:blocked:AB4b64a4b::f0 bia_s8::blocked:a::f0 dst_u8:a:blocked:ab::f0,,,mb32ic64oc256,0.00390625 onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_vnni,forward_inference,src_u8::blocked:ab::f0 wei_s8:a:blocked:AB4b64a4b::f0 bia_s8::blocked:a::f0 dst_u8:a:blocked:ab::f0,,,mb32ic256oc256,0.0090332 onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_vnni,forward_inference,src_u8::blocked:ab::f0 wei_s8:a:blocked:AB4b64a4b::f0 bia_s8::blocked:a::f0 dst_u8:a:blocked:ab::f0,,,mb32ic256oc256,0.00878906 onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_vnni,forward_inference,src_u8::blocked:ab::f0 wei_s8:a:blocked:AB4b64a4b::f0 bia_s8::blocked:a::f0 dst_u8:a:blocked:ab::f0,,,mb32ic256oc256,0.00805664 onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_vnni,forward_inference,src_u8::blocked:ab::f0 wei_s8:a:blocked:AB4b64a4b::f0 bia_s8::blocked:a::f0 dst_u8:a:blocked:ab::f0,,,mb32ic256oc128,0.00488281 onednn_verbose,primitive,exec,cpu,eltwise,jit_int8:avx512_core,forward_inference,data_u8::blocked:ab::f0 diff_undef::undef:::,,alg:eltwise_relu alpha:0 beta:0,32x128,0.000976562
AMX onednn_verbose,info,oneDNN v3.3.0 (commit N/A) onednn_verbose,info,cpu,runtime:OpenMP,nthr:1 onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support onednn_verbose,info,gpu,runtime:none onednn_verbose,info,graph,backend,0:dnnl_backend onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_amx,forward_inference,src_u8::blocked:ab::f0 wei_s8:a:blocked:AB16b64a4b::f0 bia_s8::blocked:a::f0 dst_u8:a:blocked:ab::f0,,,mb32ic64oc256,0.078125 onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_amx,forward_inference,src_u8::blocked:ab::f0 wei_s8:a:blocked:AB16b64a4b::f0 bia_s8::blocked:a::f0 dst_u8:a:blocked:ab::f0,,,mb32ic256oc256,0.0200195 onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_amx,forward_inference,src_u8::blocked:ab::f0 wei_s8:a:blocked:AB16b64a4b::f0 bia_s8::blocked:a::f0 dst_u8:a:blocked:ab::f0,,,mb32ic256oc256,0.013916 onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_amx,forward_inference,src_u8::blocked:ab::f0 wei_s8:a:blocked:AB16b64a4b::f0 bia_s8::blocked:a::f0 dst_u8:a:blocked:ab::f0,,,mb32ic256oc256,0.0170898 onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_amx,forward_inference,src_u8::blocked:ab::f0 wei_s8:a:blocked:AB16b64a4b::f0 bia_s8::blocked:a::f0 dst_u8:a:blocked:ab::f0,,,mb32ic256oc128,0.0100098 onednn_verbose,primitive,exec,cpu,eltwise,jit_int8:avx512_core,forward_inference,data_u8::blocked:ab::f0 diff_undef::undef:::,,alg:eltwise_relu alpha:0 beta:0,32x128,0.00195312 onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_amx,forward_inference,src_u8::blocked:ab::f0 wei_s8:a:blocked:AB16b64a4b::f0 bia_s8::blocked:a::f0 dst_u8:a:blocked:ab::f0,,,mb32ic64oc256,0.00610352 onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_amx,forward_inference,src_u8::blocked:ab::f0 wei_s8:a:blocked:AB16b64a4b::f0 bia_s8::blocked:a::f0 dst_u8:a:blocked:ab::f0,,,mb32ic256oc256,0.0090332 onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_amx,forward_inference,src_u8::blocked:ab::f0 wei_s8:a:blocked:AB16b64a4b::f0 bia_s8::blocked:a::f0 dst_u8:a:blocked:ab::f0,,,mb32ic256oc256,0.0090332 onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_amx,forward_inference,src_u8::blocked:ab::f0 wei_s8:a:blocked:AB16b64a4b::f0 bia_s8::blocked:a::f0 dst_u8:a:blocked:ab::f0,,,mb32ic256oc256,0.0078125 onednn_verbose,primitive,exec,cpu,inner_product,brgemm:avx512_core_amx,forward_inference,src_u8::blocked:ab::f0 wei_s8:a:blocked:AB16b64a4b::f0 bia_s8::blocked:a::f0 dst_u8:a:blocked:ab::f0,,,mb32ic256oc128,0.00390625 onednn_verbose,primitive,exec,cpu,eltwise,jit_int8:avx512_core,forward_inference,data_u8::blocked:ab::f0 diff_undef::undef:::,,alg:eltwise_relu alpha:0 beta:0,32x128,0.000976562
I summary one table for IP execution time
Expected behavior
AMX should perform better than VNNI