Summary

We tried to reproduce the results given in "Anatomy of High-Performance Deep Learning Convolutions on SIMD Architecture", Evangelos Georganas, Sasikanth Avancha, Kunal Banerjee, Dhiraj Kalamkar, Greg Henry, Hans Pabst and Alexander Heinecke. ArXiv:1808.05567v2 (20 August 2018) And did not get the expected results.

Version

Report oneDNN version and githash. Version information is printed to stdout in verbose mode. version 2.3 oneDNN commit a4c31c17ce4eb846bd1e80c0ff7a9699d37c3813

Environment

cmake 3.13.4-1 gcc 8.3.0-6 make 4.2.1-1.2

CMake log : see attachment

uname : Linux dahu-29.grenoble.grid5000.fr 4.19.0-16-amd64 #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64 GNU/Linux

lscpu : Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 48 bits virtual CPU(s): 64 On-line CPU(s) list: 0-63 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz Stepping: 4 CPU MHz: 2053.353 CPU max MHz: 3700.0000 CPU min MHz: 1000.0000 BogoMIPS: 4200.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 22528K NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke md_clear flush_l1d

Observed behavior

Hello,

We are currently working on benchmarking oneDNN on a set of convolutions and we ran into some problems. Basically, we struggle to reproduce the results given in [Georganas18]. Our architecture is Intel(R) Xeon(R) Gold 6130 (Skylake).

We tried two different methodologies :

We used our own benchmarking code and timing function to profile onednn code. The exact methodology is described below.
We used directly the benchDNN utilities provided with oneDNN.

We were surprised by the results for two reasons : 1) First the results are inconsistent between the two methods. Our benchmarking tool gets nearly always significantly better results than benchDNN.

2) Second, even with our method (the fastest one) we found the performance of oneDNN to be much lower than we would have expected, especially for a sequential implementation. Thus, we are currently at loss to explain the discrepancies in performance and we suspect that we are missing a crucial detail.

[Georganas18] "Anatomy of High-Performance Deep Learning Convolutions on SIMD Architecture", Evangelos Georganas, Sasikanth Avancha, Kunal Banerjee, Dhiraj Kalamkar, Greg Henry, Hans Pabst and Alexander Heinecke. ArXiv:1808.05567v2 (20 August 2018). In particular, we looked at Figure 4 in order to get an idea on where the performance of oneDNN should be.

Description of the evaluation methodology

We evaluate oneDNN on a set of convolution taken from classical deep neural network benchmarks, namely ResNet18, Yolo9000 and MobileNet. We only evaluate our benchmarks in inference mode, that is, with batches of size 1, and in the sequential case.
For benchDNN, we use the following command (build option: DNNL_CPU_RUNTIME=SEQ ) :
```
benchdnn --conv --dir=FWD_I --mode=p --cfg=f32 <problem description>
```
Example of problem description : ic3ih544iw544_oc32oh544ow544_kh3kw3_sh1sw1_n"yolo9000_0"

In our custom framework, we do not measure the time taken by initialization calls (to build the 'dnnl_primitive', plus the time taken by a 'write_to_dnnl_memory') but the time taken by the following portion of code :


CHECK(dnnl_primitive_execute(
    reorder_Input_p,        // Primitive to execute
    stream_cpu,             //
    2,                      // Number of elements in args
    reorder_Input_p_args    // Arguments (data I/O)
));
CHECK(dnnl_primitive_execute(
    reorder_K_p,            // Primitive to execute
    stream_cpu,             //
    2,                      // Number of elements in args
    reorder_K_p_args        // Arguments (data I/O)
));

CHECK(dnnl_primitive_execute( convolution_p, // Primitive to execute stream_cpu, // 3, // Number of elements in args conv_p_args // Arguments (data I/O) ));

CHECK(dnnl_primitive_execute( reorder_Output_p, // Primitive to execute stream_cpu, // 2, // Number of elements in args reorder_Output_p_args // Arguments (data I/O) ));


* We used two different way of measuring time.
 - The first was clock_gettime(CLOCK_MONOTONIC...). 
 - The second was to use PAPI to get a cycle count.
We report both possibilities.

* We also to see if some cache effect could impact our measurement.
For that we use a piece of code supposed to flush entirely the cache before
taking measurement.
```c
#define BIG_SIZE 5000000
#define NUM_ITER 2
void flush_cache() {
    float tmp[8] = {0.};
    float res = 0;
    for (int i = 0; i< NUM_ITER; i++){
        float *dirty = (float *)malloc(BIG_SIZE * sizeof(float));
#pragma omp parallel for
        for (int dirt = 0; dirt < BIG_SIZE; dirt++){
            dirty[dirt] = dirt%100;
            tmp[dirt%8] += dirty[dirt];
        }
        for(int ii =0; ii<8;ii++){res+= tmp[ii];}
        free(dirty);

    }
    FILE* fd = fopen("/dev/null", "w");
    fprintf(fd, "%f\n", res);
    fclose(fd);
}

We ran both a code with this flush (cold cache) and without, and report both situations.

To address frequency scaling issues, we also disable turbo boost with the following command :
```
echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
```

Measured performances

The measured results are reported in the chart (joined to this email), in percentage of the peak performance: -> The blue bar is the time reported by our framework by using "clock_gettime" and with cache flush (cold cache) -> The red bar is the time reported by our framework by using "clock_gettime" and without cache flush

-> The green bar is the time reported by our framework by using PAPI and with cache flush (cold cache) -> The violet bar is the time reported by our framework by using PAPI and without cache flush

-> The black bar is the time reported by benchDNN.

As you can see on the chart, the results are quite unexpected : -> The results of benchDNN is at most at 55% of the machine peak, which is much lower than the performances measured by our framework. -> In any case, all these measurements (both benchDNN and our framework) seems to be below their expected value (from [Georganas18]).

Therefore: -> Could you precise the evaluation methodology from which you obtained the performance measurement from [Georganas18] ? In particular, are you assuming hot or cold cache?

-> Do you have an idea on why benchDNN measured performance are that low in sequential?

-> In general, if you see a detail that we might have been missing, could you report it? Because, we are currently at loss to explain these discrepancy of performance.

We will be available, if you wish to have more details on our evaluation methodology, or if you wish for us to run additional experiments to refine our observations.

Thanks dahu_final_onednn CMakeOutput.log

Hello,

In order to facilitate the investigation of this issue, I have isolated and simplified our benchmarking code for oneDNN from the rest of our tool and I have pushed it in the following repository: https://github.com/guillaumeiooss/onednn_meas_issue_July21

By default, this code corresponds to "Papi with flush", but you should be able to easily obtain the other measures (ex: by commenting the "flush_cache()" in "oneDNN_conv.c", or by switching the timing function used)

If you have any additional questions about how we measure our benchmarks, feel free to contact us ( nicoTolly or myself ). Thanks.

Adding the paper authors, @alheinecke, @egeor, @hfp.

Three things:

we used a different CPU (28 vs. 16 cores) and different OS
we used a much older version of oneDNN and benchDNN
the NUMA bindings of the your CPU looks weird it's alternating.... never saw such a binding. That might point to a BIOS/OS problem. Expectation is that cores 0-15 are NUMA0, 16-31 NUMA1 and then the HT should be 32-47 (NUMA0) and 48-64 (NUMA1)

The alternating numbering of CPU between NUMA isn't uncommon. It was actually very common in the past. Vendors used this so that dumb applications and operating systems will scatter tasks between NUMA nodes and thus get a better memory bandwidth when only few tasks are running. Most modern x86 platforms indeed don't alternate and put HTs at the end as you said, but it's not strictly required (and that's actually why tools such as hwloc are used everywhere in HPC now, it hides these non-portable issues).

@alheinecke confirmed that the data was collected with benchdnn using recommended benchmarking settings and Intel C/C++ Compiler. Here're the commands to build and run the benchmark with oneDNN v2.3:

mkdir build && cd build
CC=icc CXX=icpc cmake .. && make -j
KMP_AFFINITY=compact,granularity=fine OMP_NUM_THREADS=28 numactl -l ./tests/benchdnn/benchdnn --conv --mode=p --cfg=f32 --dir=fwd_d --mb=28 --batch=tests/benchdnn/inputs/conv/shapes_resnet_50

This command line assumes that system has 28 cores, hyperthreading is off, and cores 0-27 are located on the same socket.

First, I would like to clarify that we only consider sequential performance in this issue. Thus, the NUMA binding should not have an impact on the performance of the code Thank you anyway for your comments, their insights will be useful for the parallel case (that we also considered separately).

Next, let me reformulate our questions: 1) The first question is about the measurement methodology. From our understanding, the "official" way to measure the performance of your application is benchdnn (cf black bar in our figure). We tried to replicate these measurement in our framework, but we were not sure of the hypothesis you took. In particular, we tried a cold/hot cache hypothesis and we tried different counters to measure the performance (cf rest of the bars in the Figure). However, we frequently obtain better performance with our framework compared to benchdnn. Therefore, there is probably either a mismatch or an issue here, and in either case, we would appreciate a bit more clarification on your measurement methodology.

2) The second question concerns the average performance obtained for "conv2d". Assuming that, for each convolution layer sizes, we consider the best result obtained across all measurement methods, the average is around 62% (60% to 65% if we consider each network separately). These performance seems to be too low compared to what we were expecting. For example, in the paper we have cited ([Georganas18]) the performance reported are around 75% in average, with some layers being above 90% of the peak performance. Of course, the architecture is not the same, and we consider the sequential case, so a direct comparison with the numbers of this paper does not make sense. However, we are expecting that this provide at least an intuition on the interval where your performance should be. Therefore: are our intuition correct (and we might be missing something when measuring your performance), or do you think our numbers are coherent with your experience with oneDNN ?

Once again, thank you for your feedback.

Thank you for clarifications.

We use benchdnn as the main tool to benchmark oneDNN performance. Benchdnn uses the following approach:

Input data and weights are populated with data that does not contain NaNs or denormals.
A warmup run is executed. This run is not timed.
Problem under test is executed in the loop for at least 5 iterations until 10000 iterations or 1 s of runtime, whichever happens first. You can find relevant piece of code here. Caches are not flushed between runs, so we are in the 'hot cache' situation.
Timestamps are collected with std::chrono::high_resolution_clock. You can find the timer implementation here.
Minimum and average execution time for the problem is reported, and GFlops/s metrics derived from these times. By default we use average time and average GFlops/s values.

The paper you keep referring to contains performance data for ResNet-50 forward propagation with batch 28 on a system with Intel Xeon Platinum 8180 processor running with 28 threads on a single socket. Using different workload, or different execution conditions, or different batch size will produce different performance results. The biggest factor impacting compute efficiency numbers in your case is likely a batch size.

I ran an experiment with ResNet-50 batch 1 inference in sequential mode on Intel Xeon Platinum 8180 and see compute efficiency in 64-71% range (based on observed frequency). One thing to keep in mind when benchmarking on a single core and computing efficiency is that Intel Xeon Scalable can significantly increase frequency of loaded core when the system is otherwise idle. I'm seeing the core going up to 3.36 GHz in the benchmark, while the nominal frequency is 2.5 GHz. To get stable reproducible results you might want to disable power management features and run the system at fixed frequency for the purposes of your study.

I just noticed that I made a mistake in efficiency compuations. For single core of Intel Xeon Platinum 8180 the compute peak is 160 GFlops/s at nominal frequency (2.5 GHz) and 215 GFlops/s at maximal turbo frequency I observed in the benchmark (3.36 GFlops/s). The brings compute efficiency to 73-96% range for nominal frequency and 68-97% for the turbo case.

$ lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          56
On-line CPU(s) list:             0-55
Thread(s) per core:              1
Core(s) per socket:              28
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
Stepping:                        4
CPU MHz:                         1000.112
CPU max MHz:                     3800.0000
CPU min MHz:                     1000.0000
BogoMIPS:                        5000.00
Virtualization:                  VT-x
L1d cache:                       1.8 MiB
L1i cache:                       1.8 MiB
L2 cache:                        56 MiB
L3 cache:                        77 MiB
NUMA node0 CPU(s):               0-27
NUMA node1 CPU(s):               28-55
Vulnerability Itlb multihit:     KVM: Vulnerable
Vulnerability L1tf:              Mitigation; PTE Inversion
Vulnerability Mds:               Mitigation; Clear CPU buffers; SMT disabled
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled
                                  via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __us
                                 er pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full generic retpoline, IBPB cond
                                 itional, IBRS_FW, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Mitigation; Clear CPU buffers; SMT disabled
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep m
                                 trr pge mca cmov pat pse36 clflush dts acpi m
                                 mx fxsr sse sse2 ss ht tm pbe syscall nx pdpe
                                 1gb rdtscp lm constant_tsc art arch_perfmon p
                                 ebs bts rep_good nopl xtopology nonstop_tsc c
                                 puid aperfmperf pni pclmulqdq dtes64 monitor 
                                 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xt
                                 pr pdcm pcid dca sse4_1 sse4_2 x2apic movbe p
                                 opcnt tsc_deadline_timer aes xsave avx f16c r
                                 drand lahf_lm abm 3dnowprefetch cpuid_fault e
                                 pb cat_l3 cdp_l3 invpcid_single pti intel_ppi
                                 n ssbd mba ibrs ibpb stibp tpr_shadow vnmi fl
                                 expriority ept vpid ept_ad fsgsbase tsc_adjus
                                 t bmi1 hle avx2 smep bmi2 erms invpcid rtm cq
                                 m mpx rdt_a avx512f avx512dq rdseed adx smap 
                                 clflushopt clwb intel_pt avx512cd avx512bw av
                                 x512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc
                                  cqm_occup_llc cqm_mbm_total cqm_mbm_local dt
                                 herm ida arat pln pts pku ospke md_clear flus
                                 h_l1d

$ KMP_AFFINITY=compact,granularity=fine OMP_NUM_THREADS=1 numactl -l ./tests/benchdnn/benchdnn --conv --mode=p --cfg=f32 --dir=fwd_i --mb=1 --batch=tests/benchdnn/inputs/conv/shapes_resnet_50
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%Gfreq%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,brgconv:avx512_core,"resnet_50:conv1",--conv --mode=P --dir=FWD_I g1mb1ic3ih224oc64oh112kh7sh2ph3n"resnet_50:conv1",0.232429,0,1.65649,140.314,1.66623,139.494
perf,cpu,brgconv_1x1:avx512_core,"resnet_50:res2a_branch1*4",--conv --mode=P --dir=FWD_I mb1ic64ih56oc256oh56kh1ph0n"resnet_50:res2a_branch1*4",0.10276,0,0.787598,130.473,0.791423,129.843
perf,cpu,brgconv_1x1:avx512_core,"resnet_50:res2a_branch2a",--conv --mode=P --dir=FWD_I mb1ic64ih56oc64oh56kh1ph0n"resnet_50:res2a_branch2a",0.0256901,0,0.17749,144.741,0.179558,143.074
perf,cpu,brgconv:avx512_core,"resnet_50:res2a_branch2b*3",--conv --mode=P --dir=FWD_I mb1ic64ih56oc64oh56kh3ph1n"resnet_50:res2a_branch2b*3",0.225739,0,1.57227,143.575,1.57818,143.037
perf,cpu,brgconv_1x1:avx512_core,"resnet_50:res2b_branch2a*2",--conv --mode=P --dir=FWD_I mb1ic256ih56oc64oh56kh1ph0n"resnet_50:res2b_branch2a*2",0.10276,0,0.702637,146.25,0.707757,145.192
perf,cpu,brgconv_1x1:avx512_core,"resnet_50:res3a_branch1",--conv --mode=P --dir=FWD_I g1mb1ic256ih56oc512oh28kh1sh2ph0n"resnet_50:res3a_branch1",0.205521,0,1.39819,146.99,1.40263,146.525
perf,cpu,brgconv_1x1:avx512_core,"resnet_50:res3a_branch2a",--conv --mode=P --dir=FWD_I g1mb1ic256ih56oc128oh28kh1sh2ph0n"resnet_50:res3a_branch2a",0.0513802,0,0.374268,137.282,0.377579,136.078
perf,cpu,brgconv:avx512_core,"resnet_50:res3a_branch2b*4",--conv --mode=P --dir=FWD_I mb1ic128ih28oc128oh28kh3ph1n"resnet_50:res3a_branch2b*4",0.220332,0,1.48828,148.045,1.49279,147.597
perf,cpu,brgconv_1x1:avx512_core,"resnet_50:res3a_branch2c*4",--conv --mode=P --dir=FWD_I mb1ic128ih28oc512oh28kh1ph0n"resnet_50:res3a_branch2c*4",0.10276,0,0.708984,144.94,0.714111,143.9
perf,cpu,brgconv_1x1:avx512_core,"resnet_50:res3b_branch2a*3",--conv --mode=P --dir=FWD_I mb1ic512ih28oc128oh28kh1ph0n"resnet_50:res3b_branch2a*3",0.10276,0,0.693115,148.259,0.696134,147.616
perf,cpu,brgconv_1x1:avx512_core,"resnet_50:res4a_branch1",--conv --mode=P --dir=FWD_I g1mb1ic512ih28oc1024oh14kh1sh2ph0n"resnet_50:res4a_branch1",0.205521,0,1.52905,134.411,1.53804,133.625
perf,cpu,brgconv_1x1:avx512_core,"resnet_50:res4a_branch2a",--conv --mode=P --dir=FWD_I g1mb1ic512ih28oc256oh14kh1sh2ph0n"resnet_50:res4a_branch2a",0.0513802,0,0.437744,117.375,0.440833,116.553
perf,cpu,brgconv:avx512_core,"resnet_50:res4a_branch2b*6",--conv --mode=P --dir=FWD_I mb1ic256ih14oc256oh14kh3ph1n"resnet_50:res4a_branch2b*6",0.209715,0,1.36279,153.886,1.36691,153.423
perf,cpu,brgconv_1x1:avx512_core,"resnet_50:res4a_branch2c*6",--conv --mode=P --dir=FWD_I mb1ic256ih14oc1024oh14kh1ph0n"resnet_50:res4a_branch2c*6",0.10276,0,0.695801,147.687,0.699016,147.007
perf,cpu,brgconv_1x1:avx512_core,"resnet_50:res4b_branch2a*5",--conv --mode=P --dir=FWD_I mb1ic1024ih14oc256oh14kh1ph0n"resnet_50:res4b_branch2a*5",0.10276,0,0.698975,147.016,0.701806,146.423
perf,cpu,brgconv_1x1:avx512_core,"resnet_50:res5a_branch1",--conv --mode=P --dir=FWD_I g1mb1ic1024ih14oc2048oh7kh1sh2ph0n"resnet_50:res5a_branch1",0.205521,0,1.50342,136.702,1.50703,136.374
perf,cpu,brgconv_1x1:avx512_core,"resnet_50:res5a_branch2a",--conv --mode=P --dir=FWD_I g1mb1ic1024ih14oc512oh7kh1sh2ph0n"resnet_50:res5a_branch2a",0.0513802,0,0.376465,136.481,0.378724,135.666
perf,cpu,brgconv:avx512_core,"resnet_50:res5a_branch2b*3",--conv --mode=P --dir=FWD_I mb1ic512ih7oc512oh7kh3ph1n"resnet_50:res5a_branch2b*3",0.189268,0,1.54028,122.879,1.54427,122.562
perf,cpu,brgconv_1x1:avx512_core,"resnet_50:res5a_branch2c*3",--conv --mode=P --dir=FWD_I mb1ic512ih7oc2048oh7kh1ph0n"resnet_50:res5a_branch2c*3",0.10276,0,0.738525,139.143,0.742989,138.307
perf,cpu,brgconv_1x1:avx512_core,"resnet_50:res5b_branch2a*2",--conv --mode=P --dir=FWD_I mb1ic2048ih7oc512oh7kh1ph0n"resnet_50:res5b_branch2a*2",0.10276,0,0.781494,131.492,0.784267,131.027
tests:20 passed:20 skipped:0 mistrusted:0 unimplemented:0 failed:0 listed:0
total perf: min(ms):19.2239 avg(ms):19.3103 

$ sudo echo 0 > /sys/devices/system/cpu/intel_pstate/no_turbo
$ KMP_AFFINITY=compact,granularity=fine OMP_NUM_THREADS=1 numactl -l ./tests/benchdnn/benchdnn --conv --mode=p --cfg=f32 --dir=fwd_i --mb=1 --batch=tests/benchdnn/inputs/conv/shapes_resnet_50
Output template: perf,%engine%,%impl%,%name%,%prb%,%Gops%,%Gfreq%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,brgconv:avx512_core,"resnet_50:conv1",--conv --mode=P --dir=FWD_I g1mb1ic3ih224oc64oh112kh7sh2ph3n"resnet_50:conv1",0.232429,0,1.1958,194.371,1.21861,190.733
perf,cpu,brgconv_1x1:avx512_core,"resnet_50:res2a_branch1*4",--conv --mode=P --dir=FWD_I mb1ic64ih56oc256oh56kh1ph0n"resnet_50:res2a_branch1*4",0.10276,0,0.569092,180.569,0.572359,179.539
perf,cpu,brgconv_1x1:avx512_core,"resnet_50:res2a_branch2a",--conv --mode=P --dir=FWD_I mb1ic64ih56oc64oh56kh1ph0n"resnet_50:res2a_branch2a",0.0256901,0,0.173584,147.998,0.175669,146.242
perf,cpu,brgconv:avx512_core,"resnet_50:res2a_branch2b*3",--conv --mode=P --dir=FWD_I mb1ic64ih56oc64oh56kh3ph1n"resnet_50:res2a_branch2b*3",0.225739,0,1.13135,199.531,1.1368,198.575
perf,cpu,brgconv_1x1:avx512_core,"resnet_50:res2b_branch2a*2",--conv --mode=P --dir=FWD_I mb1ic256ih56oc64oh56kh1ph0n"resnet_50:res2b_branch2a*2",0.10276,0,0.508057,202.262,0.511268,200.991
perf,cpu,brgconv_1x1:avx512_core,"resnet_50:res3a_branch1",--conv --mode=P --dir=FWD_I g1mb1ic256ih56oc512oh28kh1sh2ph0n"resnet_50:res3a_branch1",0.205521,0,1.00244,205.02,1.00638,204.219
perf,cpu,brgconv_1x1:avx512_core,"resnet_50:res3a_branch2a",--conv --mode=P --dir=FWD_I g1mb1ic256ih56oc128oh28kh1sh2ph0n"resnet_50:res3a_branch2a",0.0513802,0,0.267578,192.02,0.270058,190.256
perf,cpu,brgconv:avx512_core,"resnet_50:res3a_branch2b*4",--conv --mode=P --dir=FWD_I mb1ic128ih28oc128oh28kh3ph1n"resnet_50:res3a_branch2b*4",0.220332,0,1.07056,205.811,1.07488,204.983
perf,cpu,brgconv_1x1:avx512_core,"resnet_50:res3a_branch2c*4",--conv --mode=P --dir=FWD_I mb1ic128ih28oc512oh28kh1ph0n"resnet_50:res3a_branch2c*4",0.10276,0,0.522949,196.502,0.526402,195.213
perf,cpu,brgconv_1x1:avx512_core,"resnet_50:res3b_branch2a*3",--conv --mode=P --dir=FWD_I mb1ic512ih28oc128oh28kh1ph0n"resnet_50:res3b_branch2a*3",0.10276,0,0.501709,204.821,0.504241,203.792
perf,cpu,brgconv_1x1:avx512_core,"resnet_50:res4a_branch1",--conv --mode=P --dir=FWD_I g1mb1ic512ih28oc1024oh14kh1sh2ph0n"resnet_50:res4a_branch1",0.205521,0,1.07397,191.365,1.07843,190.574
perf,cpu,brgconv_1x1:avx512_core,"resnet_50:res4a_branch2a",--conv --mode=P --dir=FWD_I g1mb1ic512ih28oc256oh14kh1sh2ph0n"resnet_50:res4a_branch2a",0.0513802,0,0.314697,163.269,0.317179,161.991
perf,cpu,brgconv:avx512_core,"resnet_50:res4a_branch2b*6",--conv --mode=P --dir=FWD_I mb1ic256ih14oc256oh14kh3ph1n"resnet_50:res4a_branch2b*6",0.209715,0,0.996826,210.383,1.00009,209.696
perf,cpu,brgconv_1x1:avx512_core,"resnet_50:res4a_branch2c*6",--conv --mode=P --dir=FWD_I mb1ic256ih14oc1024oh14kh1ph0n"resnet_50:res4a_branch2c*6",0.10276,0,0.512695,200.432,0.515161,199.473
perf,cpu,brgconv_1x1:avx512_core,"resnet_50:res4b_branch2a*5",--conv --mode=P --dir=FWD_I mb1ic1024ih14oc256oh14kh1ph0n"resnet_50:res4b_branch2a*5",0.10276,0,0.520508,197.423,0.524121,196.062
perf,cpu,brgconv_1x1:avx512_core,"resnet_50:res5a_branch1",--conv --mode=P --dir=FWD_I g1mb1ic1024ih14oc2048oh7kh1sh2ph0n"resnet_50:res5a_branch1",0.205521,0,1.14502,179.491,1.14868,178.92
perf,cpu,brgconv_1x1:avx512_core,"resnet_50:res5a_branch2a",--conv --mode=P --dir=FWD_I g1mb1ic1024ih14oc512oh7kh1sh2ph0n"resnet_50:res5a_branch2a",0.0513802,0,0.287598,178.653,0.28956,177.442
perf,cpu,brgconv:avx512_core,"resnet_50:res5a_branch2b*3",--conv --mode=P --dir=FWD_I mb1ic512ih7oc512oh7kh3ph1n"resnet_50:res5a_branch2b*3",0.189268,0,1.21021,156.393,1.21424,155.873
perf,cpu,brgconv_1x1:avx512_core,"resnet_50:res5a_branch2c*3",--conv --mode=P --dir=FWD_I mb1ic512ih7oc2048oh7kh1ph0n"resnet_50:res5a_branch2c*3",0.10276,0,0.564209,182.132,0.567464,181.087
perf,cpu,brgconv_1x1:avx512_core,"resnet_50:res5b_branch2a*2",--conv --mode=P --dir=FWD_I mb1ic2048ih7oc512oh7kh1ph0n"resnet_50:res5b_branch2a*2",0.10276,0,0.606934,169.311,0.609898,168.488
tests:20 passed:20 skipped:0 mistrusted:0 unimplemented:0 failed:0 listed:0
total perf: min(ms):14.1758 avg(ms):14.2615

@guillaumeiooss, I finally had a chance to look at the benchmark you use. The main difference with benchdnn is that it includes execution of reorders in convolution timing. The only thing benchdnn includes into timing is convolution call:

    CHECK(dnnl_primitive_execute(
            convolution_p,          // Primitive to execute
            stream_cpu,             //
            3,                      // Number of elements in args
            conv_p_args             // Arguments (data I/O)
        ));

Working assumption in benchdnn is that activations and weights are kept in blocked format without converting them to plain format after each convolution call.

Closing as the question is addressed. Feel free to reopen with additional data.

oneapi-src / oneDNN

Problem in reproducing performance given a previous paper #1115