ukri-excalibur / excalibur-tests

Performance benchmarks and regression tests for the ExCALIBUR project
https://ukri-excalibur.github.io/excalibur-tests/
Apache License 2.0
18 stars 15 forks source link

Test manually profiling STREAM with `likwid-perfctr` #242

Open tkoskela opened 9 months ago

tkoskela commented 9 months ago

Test profiling with likwid-perfctr on a system where we don't have root access (e.g. Kathleen, Young). Use a simple application, such as STREAM and investigate what metric groups in LIKWID we can have access to with user permissions. Build STREAM manually using the Makefile in the github repository. Run in serial as first step. It's probably fine to run on a login node for debugging, because it only takes a few seconds to run. In the end we need to run on a compute node using the scheduler.

Investigate what metrics are availble from LIKWID, e.g.

+----------------------------+--------------+
|           Metric           |    Core 1    |
+----------------------------+--------------+
|     Runtime (RDTSC) [s]    | 3.522605e-03 |
|    Runtime unhalted [s]    | 1.107221e-04 |
|         Clock [MHz]        | 7.982933e+02 |
|             CPI            | 1.867334e+00 |
|         Branch rate        | 2.191491e-01 |
|  Branch misprediction rate | 1.979745e-02 |
| Branch misprediction ratio | 9.033780e-02 |
|   Instructions per branch  | 4.563103e+00 |
+----------------------------+--------------+`
themkots commented 8 months ago
(EXCALIBUR) ✔ ~/Excalibur/dloads/STREAM [master|✔]
[ucaseko@login02.ib.young:7 STREAM]$ hostname
login02
(EXCALIBUR) ✔ ~/Excalibur/dloads/STREAM [master|✔]
[ucaseko@login02.ib.young:7 STREAM]$ pwd
/home/ucaseko/Excalibur/dloads/STREAM
(EXCALIBUR) ✔ ~/Excalibur/dloads/STREAM [master|✔]
[ucaseko@login02.ib.young:7 STREAM]$ ../../likwid-5.3.0/bin/likwid-perfctr -C S0:1  -g BRANCH ./stream_c.exe
--------------------------------------------------------------------------------
CPU name:       Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
CPU type:       Intel Cascadelake SP processor
CPU clock:      2.49 GHz
--------------------------------------------------------------------------------
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 1
Number of Threads counted = 1
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 8129 microseconds.
   (= 8129 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           10042.2     0.016097     0.015933     0.016232
Scale:          13951.9     0.011647     0.011468     0.011744
Add:            15880.2     0.015263     0.015113     0.015490
Triad:          15587.4     0.015524     0.015397     0.015697
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
--------------------------------------------------------------------------------
Group 1: BRANCH
+------------------------------+---------+------------+
|             Event            | Counter | HWThread 1 |
+------------------------------+---------+------------+
|       INSTR_RETIRED_ANY      |  FIXC0  | 2343230311 |
|     CPU_CLK_UNHALTED_CORE    |  FIXC1  | 2288835488 |
|     CPU_CLK_UNHALTED_REF     |  FIXC2  | 1550699600 |
| BR_INST_RETIRED_ALL_BRANCHES |   PMC0  |  367655406 |
| BR_MISP_RETIRED_ALL_BRANCHES |   PMC1  |       8310 |
+------------------------------+---------+------------+

+----------------------------+--------------+
|           Metric           |  HWThread 1  |
+----------------------------+--------------+
|     Runtime (RDTSC) [s]    |       0.6810 |
|    Runtime unhalted [s]    |       0.9177 |
|         Clock [MHz]        |    3681.2784 |
|             CPI            |       0.9768 |
|         Branch rate        |       0.1569 |
|  Branch misprediction rate | 3.546386e-06 |
| Branch misprediction ratio | 2.260269e-05 |
|   Instructions per branch  |       6.3734 |
+----------------------------+--------------+

+----------------------------+--------------+ | Metric | HWThread 1 | +----------------------------+--------------+ | Runtime (RDTSC) [s] | 0.7105 | | Runtime unhalted [s] | 0.8912 | | Clock [MHz] | 3659.6906 | | CPI | 0.9483 | | Branch rate | 0.1569 | | Branch misprediction rate | 3.707283e-06 | | Branch misprediction ratio | 2.362815e-05 | | Instructions per branch | 6.3734 | +----------------------------+--------------+

- [X] Run same in one of `Kathleen`'s nodes:

(EXCALIBUR) ✔ /lustre/home/ucaseko/Excalibur/dloads/STREAM [master|✔] [ucaseko@node-c11a-069.kathleen:3 STREAM]$ hostname node-c11a-069 (EXCALIBUR) ✔ /lustre/home/ucaseko/Excalibur/dloads/STREAM [master|✔] [ucaseko@node-c11a-069.kathleen:3 STREAM]$ pwd /lustre/home/ucaseko/Excalibur/dloads/STREAM (EXCALIBUR) ✔ /lustre/home/ucaseko/Excalibur/dloads/STREAM [master|✔] [ucaseko@node-c11a-069.kathleen:3 STREAM]$ ../../likwid-5.3.0/bin/likwid-perfctr -C S0:1 -g BRANCH ./stream_c.exe

CPU name: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz CPU type: Intel Cascadelake SP processor CPU clock: 2.49 GHz


STREAM version $Revision: 5.10 $

This system uses 8 bytes per array element.

Array size = 10000000 (elements), Offset = 0 (elements) Memory per array = 76.3 MiB (= 0.1 GiB). Total memory required = 228.9 MiB (= 0.2 GiB). Each kernel will be executed 10 times. The best time for each kernel (excluding the first iteration) will be used to compute the reported bandwidth.

Number of Threads requested = 1 Number of Threads counted = 1

Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 8126 microseconds. (= 8126 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test.

WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer.

Function Best Rate MB/s Avg time Min time Max time Copy: 10291.3 0.015586 0.015547 0.015614 Scale: 13636.7 0.011751 0.011733 0.011781 Add: 15642.4 0.015392 0.015343 0.015443 Triad: 15381.8 0.015624 0.015603 0.015647

Solution Validates: avg error less than 1.000000e-13 on all three arrays


Group 1: BRANCH +------------------------------+---------+------------+ | Event | Counter | HWThread 1 | +------------------------------+---------+------------+ | INSTR_RETIRED_ANY | FIXC0 | 2343250440 | | CPU_CLK_UNHALTED_CORE | FIXC1 | 2245719147 | | CPU_CLK_UNHALTED_REF | FIXC2 | 1559575000 | | BR_INST_RETIRED_ALL_BRANCHES | PMC0 | 367660744 | | BR_MISP_RETIRED_ALL_BRANCHES | PMC1 | 8244 | +------------------------------+---------+------------+

+----------------------------+--------------+ | Metric | HWThread 1 | +----------------------------+--------------+ | Runtime (RDTSC) [s] | 0.6769 | | Runtime unhalted [s] | 0.9004 | | Clock [MHz] | 3591.3189 | | CPI | 0.9584 | | Branch rate | 0.1569 | | Branch misprediction rate | 3.518190e-06 | | Branch misprediction ratio | 2.242285e-05 | | Instructions per branch | 6.3734 | +----------------------------+--------------+

themkots commented 8 months ago

- For comparison, similar excerpt from `Kathleen`'s node used above:

processor : 79 vendor_id : GenuineIntel cpu family : 6 model : 85 model name : Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz stepping : 7 microcode : 0x5003003 cpu MHz : 2500.000 cache size : 28160 KB physical id : 1 siblings : 40 core id : 28 cpu cores : 20 apicid : 121 initial apicid : 121 fpu : yes fpu_exception : yes cpuid level : 22 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_ppin intel_pt ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke avx512_vnni md_clear spec_ctrl intel_stibp flush_l1d arch_capabilities bogomips : 5004.96 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management:


- Not sure if the ID tag of mine, 
`model name      : 13th Gen Intel(R) Core(TM) i7-1365U` vs `Kathleen`'s 
`model name      : Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz`
makes it snag, i.e. not starting with letters but by numbers ... 🤷  ... will look into it -- maybe we need to include / request some new patterns in the `likwid` code, for correct CPU identification.
tkoskela commented 7 months ago

The results on young and kathleen look promising! Could you check what other performance groups are availble on those machines?

tkoskela commented 7 months ago

They have https://github.com/RRZE-HPC/likwid/issues/468 open on support for newer Intel architectures. Does not seem to be much progress happening lately, so your laptop cpu might be still unsupported.