umd-memsys / DRAMSim2

DRAMSim2: A cycle accurate DRAM simulator
http://www.ece.umd.edu/~blj/papers/cal10-1.pdf
255 stars 151 forks source link

AddressMapping: precompute log2 values to increase performance #52

Closed cota closed 9 years ago

cota commented 9 years ago

The log2 function is slow. But this is not the issue; calling it in the hot path is the real (performance) issue.

The appended precomputes frequently-used values to avoid computing them in the hot path. This results in a lot less instructions executed, and a significant increase in performance.

This closes issue #46: simulation performance.

Before the patch:

Performance counter stats for './DRAMSim -t traces/k6_aoe_02_short.trc -d ini/DDR3_micron_16M_8B_x8_sg15.ini -s system.ini.example -c 10000000' (3 runs):

  20507.071226 task-clock                #    1.000 CPUs utilized            ( +-  1.09% )
            61 context-switches          #    0.000 M/sec                    ( +-  0.95% )
             0 CPU-migrations            #    0.000 M/sec
           468 page-faults               #    0.000 M/sec
58,683,786,689 cycles                    #    2.862 GHz                      ( +-  0.69% ) [83.33%]
13,434,240,170 stalled-cycles-frontend   #   22.89% frontend cycles idle     ( +-  2.80% ) [83.34%]
 5,915,970,070 stalled-cycles-backend    #   10.08% backend  cycles idle     ( +-  4.87% ) [66.67%]

120,280,797,002 instructions # 2.05 insns per cycle

0.11 stalled cycles per insn ( +- 0.00% ) [83.33%]

23,425,385,282 branches                  # 1142.308 M/sec                    ( +-  0.01% ) [83.34%]
   226,637,631 branch-misses             #    0.97% of all branches          ( +-  1.03% ) [83.33%]
  20.514895432 seconds time elapsed                                          ( +-  1.09% )

After:

Performance counter stats for './DRAMSim -t traces/k6_aoe_02_short.trc -d ini/DDR3_micron_16M_8B_x8_sg15.ini -s system.ini.example -c 10000000' (3 runs):

  15562.506598 task-clock                #    1.000 CPUs utilized            ( +-  0.72% )
            55 context-switches          #    0.000 M/sec
             0 CPU-migrations            #    0.000 M/sec                    ( +-100.00% )
           469 page-faults               #    0.000 M/sec                    ( +-  0.07% )
43,650,612,082 cycles                    #    2.805 GHz                      ( +-  0.58% ) [83.33%]
11,878,548,969 stalled-cycles-frontend   #   27.21% frontend cycles idle     ( +-  1.46% ) [83.33%]
 6,125,126,936 stalled-cycles-backend    #   14.03% backend  cycles idle     ( +-  3.74% ) [66.67%]
82,655,485,444 instructions              #    1.89  insns per cycle
                                         #    0.14  stalled cycles per insn  ( +-  0.01% ) [83.33%]
14,515,927,254 branches                  #  932.750 M/sec                    ( +-  0.02% ) [83.34%]
   235,566,078 branch-misses             #    1.62% of all branches          ( +-  1.87% ) [83.34%]

  15.568698124 seconds time elapsed                                          ( +-  0.72% )

Signed-off-by: Emilio G. Cota cota@braap.org