How is the average memory access latency measured or calculated?

learning-chip commented 1 year ago

In the HDagg paper Section "Executor Evaluation" it is said that "The average memory access latency is used as a metric to measure locality." and "PAPI’s performance counters are used to measure architecture information needed in computations related to the locality and load balance metrics."

In the sptrsv_profiler.cpp example, the PAPI event list is: https://github.com/sympiler/aggregation/blob/da293bb1d1060bc390ad785978f8452943a8909c/profiler/sptrsv_profiler.cpp#L106-L111

I wonder how is the memory latency obtained from the above metrics?

cheshmi commented 1 year ago

It is based on the average memory cycle defined in the computer architecture book (see page 75).

PAPI does not give you all counters you will need, and it changes per architecture. We used something like below:

def compute_memory_cycle_for_one_group(row, arch_params):
    dl1_miss = row['PAPI_L1_DCM'].values
    dl2_miss = row['PAPI_L2_DCM'].values
    dl3_miss = row['PAPI_L3_TCM'].values
    dl1_access = row['PAPI_LST_INS'].values
    l1_mr = dl1_miss / dl1_access
    l2_mr = dl2_miss / dl1_miss
    l3_mr = dl3_miss / dl2_miss
    l1_access_cost = arch_params['L1_ACCESS_TIME']
    l2_access_cost = arch_params['L2_ACCESS_TIME']
    l3_access_cost = arch_params['L3_ACCESS_TIME']
    mm_access_cost = arch_params['MAIN_MEMORY_ACCESS_TIME']
    avg_mem_cycle = l1_access_cost + l1_mr*(l2_access_cost + l2_mr*(l3_access_cost + l3_mr*mm_access_cost))
    exec_cycle = avg_mem_cycle  * dl1_access
    return exec_cycle

You will need some architecture parameters. You can improve the code by finding more accurate counters.

learning-chip commented 1 year ago

Thanks, that makes sense! I find that Vtune also provides an average latency metric, but it's good to calculate and verify it from scratch.

sympiler / aggregation

How is the average memory access latency measured or calculated? #11