Closed learning-chip closed 1 year ago
It is based on the average memory cycle defined in the computer architecture book (see page 75).
PAPI does not give you all counters you will need, and it changes per architecture. We used something like below:
def compute_memory_cycle_for_one_group(row, arch_params):
dl1_miss = row['PAPI_L1_DCM'].values
dl2_miss = row['PAPI_L2_DCM'].values
dl3_miss = row['PAPI_L3_TCM'].values
dl1_access = row['PAPI_LST_INS'].values
l1_mr = dl1_miss / dl1_access
l2_mr = dl2_miss / dl1_miss
l3_mr = dl3_miss / dl2_miss
l1_access_cost = arch_params['L1_ACCESS_TIME']
l2_access_cost = arch_params['L2_ACCESS_TIME']
l3_access_cost = arch_params['L3_ACCESS_TIME']
mm_access_cost = arch_params['MAIN_MEMORY_ACCESS_TIME']
avg_mem_cycle = l1_access_cost + l1_mr*(l2_access_cost + l2_mr*(l3_access_cost + l3_mr*mm_access_cost))
exec_cycle = avg_mem_cycle * dl1_access
return exec_cycle
You will need some architecture parameters. You can improve the code by finding more accurate counters.
Thanks, that makes sense! I find that Vtune also provides an average latency metric, but it's good to calculate and verify it from scratch.
In the HDagg paper Section "Executor Evaluation" it is said that "The average memory access latency is used as a metric to measure locality." and "PAPI’s performance counters are used to measure architecture information needed in computations related to the locality and load balance metrics."
In the
sptrsv_profiler.cpp
example, the PAPI event list is: https://github.com/sympiler/aggregation/blob/da293bb1d1060bc390ad785978f8452943a8909c/profiler/sptrsv_profiler.cpp#L106-L111I wonder how is the memory latency obtained from the above metrics?