Open jrodgers-github opened 1 year ago
Attached you will find vector_add.zip, which shows the following behavior on select platforms:
*****ROCm DRIVER VERSION*****
======================= ROCm System Management Interface =======================
========================= Version of System Component ==========================
Driver version: 6.0.5
================================================================================
============================= End of ROCm SMI Log ==============================
*****COMPILE*****
/opt/rocm-5.5.1/hip/bin/hipcc -g --offload-arch=gfx90a -o vector_add.o -c -I/opt/rocm-5.5.1/include -I/opt/rocm-5.5.1/include/hsa -I/<PAPI_PATH>/include vector_add.cpp
/opt/rocm-5.5.1/hip/bin/hipcc -o vector_add vector_add.o -L/opt/rocm-5.5.1/lib -lhsa-runtime64 -L/<PAPI_PATH>/lib64 -lpapi -I/opt/rocm-5.5.1/include -I/opt/rocm-5.5.1/include/hsa -I/<PAPI_PATH>/include
*****RUN*****
PAPI_read Before Kernel Launch
[JR-DEBUG] intercept_ctx_read dispatch_count=0
rocm:::GPUBusy:device=0 = 0
rocm:::SQ_WAVES:device=0 = 0
HIP Kernel Launch
hipDeviceSynchronize
PAPI_read After Kernel Launch
[JR-DEBUG] intercept_ctx_read dispatch_count=0
rocm:::GPUBusy:device=0 = 0
rocm:::SQ_WAVES:device=0 = 0
PAPI_read from PAPI_stop
[JR-DEBUG] intercept_ctx_read dispatch_count=1
rocm:::GPUBusy:device=0 = 100
rocm:::SQ_WAVES:device=0 = 16384
PASSED!
Note: in the above, the “[JR-DEBUG] intercept_ctx_read dispatch_count={0,1}” lines are a result of adding the following patch to PAPI:
@@ intercept_ctx_read(rocp_ctx_t rocp_ctx, long long **counts)
unsigned long tid = (*thread_id_fn)();
int dispatch_count = fetch_dispatch_counter(tid);
+// BEGIN JR TESTING
+ fprintf(stderr, "[JR-DEBUG] intercept_ctx_read dispatch_count=%d\n", dispatch_count);
+// END JR TESTING
if (dispatch_count == 0) {
*counts = rocp_ctx->u.intercept.counters;
goto fn_exit;
Environment configuration prior to launching reproducer:
# Setup PAPI
export PAPI_ROCM_ROOT=${ROCM_PATH}
export ROCP_METRICS=${PAPI_ROCM_ROOT}/rocprofiler/lib/metrics.xml
export HSA_TOOLS_LIB=${PAPI_ROCM_ROOT}/rocprofiler/lib/librocprofiler64.so
# Set PAPI to use intercept instead of default sampling
export ROCP_HSA_INTERCEPT=1
Let me know if there's any issues getting the reproducer going.
Finding evidence that PAPI ROCm
PAPI_read
operations are missing results when executed in intercept mode. Sample workflows highlighting what I'm seeing:Consulting with @gcongiu, this may be expected behavior as:
However, it does not look like synchronizing the streams alone is enough to prevent the undesirable behavior, as I’m still detecting the issue after calling
hipDeviceSynchronize
before & after each read (after is overkill, but I wanted to be sure). Additionally, finding that pairing the device/stream synchronization with any of the following is also unfruitful:If possible, it would ideal if we could find a means of enforcing a synchronization such that the counters could be resolved with each
PAPI_read
.