lukego / blog

Luke Gorrie's blog
565 stars 11 forks source link

Execution units and performance counters #3

Open lukego opened 9 years ago

lukego commented 9 years ago

Each Haswell CPU core has eight special-purpose execution units that can each execute some part of an instruction in parallel. For example, calculate an address, load an operand from memory, perform arithmetic.

I realized today that pmu-tools offers some visibility into CPU performance counters that track how much work each execution unit is doing:

$ ocperf.py stat -e cycles,uops_executed_port.port_0,uops_executed_port.port_1,uops_executed_port.port_2,uops_executed_port.port_3,uops_executed_port.port_4,uops_executed_port.port_5,uops_executed_port.port_6,uops_executed_port.port_7 head -c 10000000 /dev/urandom > /dev/null
 Performance counter stats for 'head -c 10000000 /dev/urandom':

     2,065,534,404      cycles                    [44.69%]
       705,149,766      uops_executed_port_port_0                                    [44.93%]
       728,047,007      uops_executed_port_port_1                                    [44.94%]
       405,801,626      uops_executed_port_port_2                                    [44.94%]
       441,800,214      uops_executed_port_port_3                                    [44.50%]
       289,902,540      uops_executed_port_port_4                                    [44.06%]
       733,201,801      uops_executed_port_port_5                                    [44.05%]
       786,927,002      uops_executed_port_port_6                                    [44.64%]
       174,929,604      uops_executed_port_port_7                                    [44.44%]

       0.908605822 seconds time elapsed

This seems rather nifty. I have recently been needing more visibility into the CPU for debugging difficult performance problems like collisions due to cache associativity.

I would love to be better with auditing performance counters. Tips welcome? ("Ten CPU Performance Counters You Won't Believe You Ever Lived Without?").

lukego commented 9 years ago

The output above makes sense. The workload is getting pseudo-random numbers from /dev/urandom and the busy execution units are 0,1,5,6 which are exactly the ones that can perform integer arithmetic. That is gratifying :-).