Compute cache & register performance. Investigate gcc vectorization logs

bennahugo commented 10 years ago

In single correlation term gridding mode the perf logs show L1 cache hit rates are quite good >99.8%, though at other levels the hit rate drops significantly to 46.32%. Refactoring code in order to unroll some of the inner loops (specifically the channel loop, in order to up overall cache performance. Should gain a significant gain in speed.

bennahugo commented 10 years ago

Refactored code a bit (without unrolling), cache hits seem to be on the rise: perf stat -r 5 -e cache-references -e cache-misses -e L1-dcache-loads -e L1-dcache-load-misses -e L1-dcache-stores -e L1-dcache-store-misses ./benchmark 1000000 1024 1024 8 4 1 7 .... Performance counter stats for './benchmark 1000000 1024 1024 8 4 1 7' (5 runs): 5,637,659 cache-references ( +- 13.63% ) [66.66%] 2,310,096 cache-misses # 40.976 % of all cache refs ( +- 15.66% ) [66.67%] 14,420,034,592 L1-dcache-loads ( +- 0.05% ) [66.68%] 18,428,127 L1-dcache-load-misses # 0.13% of all L1-dcache hits ( +- 4.23% ) [66.69%] 3,067,134,265 L1-dcache-stores ( +- 0.21% ) [66.66%] 8,084,804 L1-dcache-store-misses ( +- 4.11% ) [66.65%] 9.107082031 seconds time elapsed ( +- 0.07% )

bennahugo commented 10 years ago

Could not significantly improve either the L1 or L2/3 cache performance. Ultimately we're at the mercy of reading and writing to a 2D area in memory (and therefore cache performance will never be too great)

ratt-ru / bullseye

Compute cache & register performance. Investigate gcc vectorization logs #21