icl-utk-edu / papi

Other
118 stars 52 forks source link

PAPI ROCm: Missed Reads Intercept Mode #69

Open jrodgers-github opened 1 year ago

jrodgers-github commented 1 year ago

Finding evidence that PAPI ROCm PAPI_read operations are missing results when executed in intercept mode. Sample workflows highlighting what I'm seeing:

Consulting with @gcongiu, this may be expected behavior as:

In intercept mode, PAPI_read(s) that happen before a kernel has finished running and/or before rocprofiler has fetched the kernel counters return whatever value was present until that point in the eventset counters (the component does not synchronize the GPU stream internally like old cuda component used to do). Otherwise, it reads the new counters (get_context_counters).

In your example above the behavior looks consistent with the ROCm component's code. If you wish to read counters for a kernel, in intercept mode, you should synchronize the stream first to make sure the kernel has finished running and the counters are collected.

However, it does not look like synchronizing the streams alone is enough to prevent the undesirable behavior, as I’m still detecting the issue after calling hipDeviceSynchronize before & after each read (after is overkill, but I wanted to be sure). Additionally, finding that pairing the device/stream synchronization with any of the following is also unfruitful:

If possible, it would ideal if we could find a means of enforcing a synchronization such that the counters could be resolved with each PAPI_read.

jrodgers-github commented 1 year ago

Attached you will find vector_add.zip, which shows the following behavior on select platforms:

*****ROCm DRIVER VERSION*****
======================= ROCm System Management Interface =======================
========================= Version of System Component ==========================
Driver version: 6.0.5
================================================================================
============================= End of ROCm SMI Log ==============================
*****COMPILE*****
/opt/rocm-5.5.1/hip/bin/hipcc -g --offload-arch=gfx90a -o vector_add.o -c -I/opt/rocm-5.5.1/include -I/opt/rocm-5.5.1/include/hsa -I/<PAPI_PATH>/include vector_add.cpp
/opt/rocm-5.5.1/hip/bin/hipcc -o vector_add vector_add.o -L/opt/rocm-5.5.1/lib -lhsa-runtime64 -L/<PAPI_PATH>/lib64 -lpapi -I/opt/rocm-5.5.1/include -I/opt/rocm-5.5.1/include/hsa -I/<PAPI_PATH>/include 
*****RUN*****
PAPI_read Before Kernel Launch
[JR-DEBUG] intercept_ctx_read dispatch_count=0
              rocm:::GPUBusy:device=0 = 0
              rocm:::SQ_WAVES:device=0 = 0
HIP Kernel Launch
hipDeviceSynchronize
PAPI_read After Kernel Launch
[JR-DEBUG] intercept_ctx_read dispatch_count=0
              rocm:::GPUBusy:device=0 = 0
              rocm:::SQ_WAVES:device=0 = 0
PAPI_read from PAPI_stop
[JR-DEBUG] intercept_ctx_read dispatch_count=1
              rocm:::GPUBusy:device=0 = 100
              rocm:::SQ_WAVES:device=0 = 16384
PASSED!

Note: in the above, the “[JR-DEBUG] intercept_ctx_read dispatch_count={0,1}” lines are a result of adding the following patch to PAPI:

@@ intercept_ctx_read(rocp_ctx_t rocp_ctx, long long **counts)

     unsigned long tid = (*thread_id_fn)();
     int dispatch_count = fetch_dispatch_counter(tid);
+// BEGIN JR TESTING
+    fprintf(stderr, "[JR-DEBUG] intercept_ctx_read dispatch_count=%d\n", dispatch_count);
+// END JR TESTING
     if (dispatch_count == 0) {
         *counts = rocp_ctx->u.intercept.counters;
         goto fn_exit;

Environment configuration prior to launching reproducer:

# Setup PAPI
export PAPI_ROCM_ROOT=${ROCM_PATH}
export ROCP_METRICS=${PAPI_ROCM_ROOT}/rocprofiler/lib/metrics.xml
export HSA_TOOLS_LIB=${PAPI_ROCM_ROOT}/rocprofiler/lib/librocprofiler64.so
# Set PAPI to use intercept instead of default sampling
export ROCP_HSA_INTERCEPT=1

Let me know if there's any issues getting the reproducer going.