intel / perfmon

https://perfmon-events.intel.com
BSD 3-Clause "New" or "Revised" License
250 stars 39 forks source link

GLC: Mixed up ports in UOPS_DISPATCHED.PORT_X event #149

Open JanLJL opened 9 months ago

JanLJL commented 9 months ago

I believe there is a mistake in the documentation of the incore events of SPR, specifically UOPS_DISPATCHED.PORT_2_3_10 and UOPS_DISPATCHED.PORT_5_11 are mixed up and the first one should count the events of dispatched uops on ports 2, 3, and 11 while the latter should count the dispatched uops on ports 5 and 10.

Based on the Intel Architectures Optimization Reference, we can see on page 62/63 that port 10 (p10) adds a simple integer ALU while port 11 (p11) is used for loading data and address generation. Seeing in the documentation that apparently there is an event counting the load uops on p2 and p3, but not the load uops on p11 and rather the uops of a port used for integer arithmetic made me doubt.

So I added hardware performance counters (using likwid) to a simple benchmark code measuring an ADD on 32-bit general purpose registers, such as add r9d, r10d, where I am sure it should run on all ALU ports, i.e., p0, p1, p5, p6, and p10.

I counted the dispatched uops and - as a metric - print out the ratio of the overall dispatched uops to get a percentage number and 100 for port 0 would mean, all dispatched uops were dispatched on p0.

Instructions per loop: (32 add + 1 inc + 1 cmp + 1 jl) = 35 instructions (apparently there is no macro fusion happening because of the jl) We do 1,000,000 iterations --> 35,000,000,000 uops

+----------------------------------+---------+------------+
|               Event              | Counter | HWThread 0 |
+----------------------------------+---------+------------+
|         INSTR_RETIRED_ANY        |  FIXC0  |   35006880 |
|       CPU_CLK_UNHALTED_CORE      |  FIXC1  |    7049300 |
|       CPU_CLK_UNHALTED_REF       |  FIXC2  |    7048320 |
|    UOPS_DISPATCHED_PORT_PORT_0   |   PMC0  |    6675973 |
|    UOPS_DISPATCHED_PORT_PORT_1   |   PMC1  |    6719374 |
| UOPS_DISPATCHED_PORT_PORT_2_3_10 |   PMC2  |       3076 |
|   UOPS_DISPATCHED_PORT_PORT_4_9  |   PMC3  |       1405 |
|  UOPS_DISPATCHED_PORT_PORT_5_11  |   PMC4  |   13607280 |
|    UOPS_DISPATCHED_PORT_PORT_6   |   PMC5  |    7005328 |
|   UOPS_DISPATCHED_PORT_PORT_7_8  |   PMC6  |       1345 |
+----------------------------------+---------+------------+

+------------------------+------------+
|         Metric         | HWThread 0 |
+------------------------+------------+
|   Runtime (RDTSC) [s]  |     0.0035 |
|  Runtime unhalted [s]  |     0.0035 |
|       Clock [MHz]      |  2000.2744 |
|           CPI          |     0.2014 |
|          Port0         |        100 |
|    Port 0 occupation   |    19.6273 |
|    Port 1 occupation   |    19.7549 |
| Port 2/3/10 occupation |     0.0090 |
|   Port 4/9 occupation  |     0.0041 |
|  Port 5/11 occupation  |    40.0052 |
|    Port 6 occupation   |    20.5956 |
|   Port 7/8 occupation  |     0.0040 |
+------------------------+------------+

We can see that p0, p1, and p6 are occupied 20% of the time, while p5/11 shows 40% occupancy. Since p11 is used for loads and we are not loading any data in the benchmark, this either means a) p10 is not used at all - even though it has an ALU - and that the instruction is scheduled twice as many times on p5, or b) p10 and p11 should be actually swapped and each of the five ALU ports gets 20% of the dispatched uops, which I think is the case and makes more sense.

Could you please confirm this and, if verified, change the documentation accordingly?

Thanks and best, Jan

edwarddavidbaker commented 9 months ago

@JanLJL Thank you for filing a very detailed issue! @vdaneti Please review the above notes and compare to SPR checkout data.

edwarddavidbaker commented 7 months ago

@vdaneti Did you receive documentation feedback on GLC ports 10 and 11?

https://cdrdv2.intel.com/v1/dl/getContent/671488 - Figure 2-2 and Table 2-3 image

JanLJL commented 3 weeks ago

Any updates on this one?

vdaneti commented 3 weeks ago

@edwarddavidbaker please reference the updated arch doc here

edwarddavidbaker commented 3 weeks ago

Re-assigning to myself as a reminder to link v51 of the Optimization Reference Manual when it is posted.

edwarddavidbaker commented 1 week ago

@boomanaiden154 Thanks for opening a ticket and linking the LLVM issue. We are determining the best method to implement documentation updates. I apologize for the delays.

boomanaiden154 commented 1 week ago

We are determining the best method to implement documentation updates. I apologize for the delays.

All good on the timing. Everything is stable on our end, if a bit inconsistent. Given the plan is to update the documentation, it seems like the resolution was that perfmon was correct and the diagrams in the optimization manual need to have ports 10 and 11 swapped?

edwarddavidbaker commented 1 week ago

We are determining the best method to implement documentation updates. I apologize for the delays.

All good on the timing. Everything is stable on our end, if a bit inconsistent. Given the plan is to update the documentation, it seems like the resolution was that perfmon was correct and the diagrams in the optimization manual need to have ports 10 and 11 swapped?

Correct. Ports 10 and 11 need to be swapped in documentation for Golden Cove.

HaohaiWen commented 1 week ago

Another mistake is Intel GoldenCove instruction tpt/lat in: https://www.intel.com/content/www/us/en/content-details/723498/intel-processors-and-processor-cores-based-on-golden-cove-microarchitecture-instruction-throughput-and-latency.html port 11 and 10 also need to be swappwd.