icl-utk-edu / papi

Other
106 stars 47 forks source link

PAPI ROCm: Confusion with `HIP_VISIBLE_DEVICES` #73

Open bertwesarg opened 1 year ago

bertwesarg commented 1 year ago

When setting HIP_VISIBLE_DEVICES the id in the :device=%d event name suffix is still the hardware device index, not the HIP device index.

The ./sample_multi_kernel_monitoring test always uses :device=0, so starting it with different HIP_VISIBLE_DEVICES values will result in 0-value results:

$ HIP_VISIBLE_DEVICES=0 ./sample_multi_kernel_monitoring
rocm:::SQ_INSTS_VALU:device=0 : 191459309210
rocm:::SQ_INSTS_SALU:device=0 : 73288502349
rocm:::SQ_WAVES:device=0 : 526324
rocm:::SQ_WAVES_RESTORED:device=0 : 2032
$ HIP_VISIBLE_DEVICES=1 ./sample_multi_kernel_monitoring 
rocm:::SQ_INSTS_VALU:device=0 : 0
rocm:::SQ_INSTS_SALU:device=0 : 0
rocm:::SQ_WAVES:device=0 : 0
rocm:::SQ_WAVES_RESTORED:device=0 : 0
gcongiu commented 1 year ago

Thank you for reporting this. If I understand it correctly HIP_VISIBLE_DEVICES is a list of physical devices that is visible to the process (and set by the resource manager) scheduled to run on the compute node. Let’s assume we have devices 4,5,6,7 set. Should papi_native_avail show events for those using device=0,1,2,3 and then remap them internally to 4,5,6,7? Would the behaviour be similar for cuda?

bertwesarg commented 1 year ago

I'm currently not able to check CUDA. Your understanding is correct. But for ROCm it actually depends on the used runtime. It looks like rocprofiler is on the ROCm level, i.e., the same as ROCm SMI, but each higher-level runtime has there own GPU isolation mechanism. Because PAPI does not know which runtime is used by the application, I think the only solution is, to document that the PAPI ROCm component expects ROCm SMI device indices, and the application using PAPI needs to take care of the mapping.

In case you are interested, here is how Score-P does this mapping via the device UUID:

  1. Create a mapping from UUID to ROCm SMI device index
  2. For each HIP device index, use the UUID to get the ROCm device index
gcongiu commented 1 year ago

Is the ROCR_VISIBLE_DEVICES ever used? It looks like this is the right isolation mechanism for the GPU runtime (including rocprofiler which relies on hsa for detecting agents). My assumption when I wrote the rocm component was that rocprofiler will only see the GPU (agents) in the current partition & number them from 0 to N…

bertwesarg commented 1 year ago

does not work on my side:

$ ROCR_VISIBLE_DEVICES=0 rocm-smi 

========================= ROCm System Management Interface =========================
=================================== Concise Info ===================================
GPU  Temp (DieEdge)  AvgPwr  SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%  
0    41.0c           43.0W   800Mhz  1600Mhz  0%   auto  300.0W    0%   0%    
1    43.0c           42.0W   800Mhz  1600Mhz  0%   auto  300.0W    0%   0%    
====================================================================================
=============================== End of ROCm SMI Log ================================
bertwesarg commented 1 year ago

it works for rocminfo though

$ ROCR_VISIBLE_DEVICES=0 rocminfo | grep -A 1 '^  Name:'
  Name:                    AMD EPYC 7702 64-Core Processor    
  Uuid:                    CPU-XX                             
--
  Name:                    AMD EPYC 7702 64-Core Processor    
  Uuid:                    CPU-XX                             
--
  Name:                    AMD EPYC 7702 64-Core Processor    
  Uuid:                    CPU-XX                             
--
  Name:                    AMD EPYC 7702 64-Core Processor    
  Uuid:                    CPU-XX                             
--
  Name:                    AMD EPYC 7702 64-Core Processor    
  Uuid:                    CPU-XX                             
--
  Name:                    AMD EPYC 7702 64-Core Processor    
  Uuid:                    CPU-XX                             
--
  Name:                    AMD EPYC 7702 64-Core Processor    
  Uuid:                    CPU-XX                             
--
  Name:                    AMD EPYC 7702 64-Core Processor    
  Uuid:                    CPU-XX                             
--
  Name:                    gfx90a                             
  Uuid:                    GPU-f43096f78d390147               
$ ROCR_VISIBLE_DEVICES=1 rocminfo | grep -A 1 '^  Name:'
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
  Name:                    AMD EPYC 7702 64-Core Processor    
  Uuid:                    CPU-XX                             
--
  Name:                    AMD EPYC 7702 64-Core Processor    
  Uuid:                    CPU-XX                             
--
  Name:                    AMD EPYC 7702 64-Core Processor    
  Uuid:                    CPU-XX                             
--
  Name:                    AMD EPYC 7702 64-Core Processor    
  Uuid:                    CPU-XX                             
--
  Name:                    AMD EPYC 7702 64-Core Processor    
  Uuid:                    CPU-XX                             
--
  Name:                    AMD EPYC 7702 64-Core Processor    
  Uuid:                    CPU-XX                             
--
  Name:                    AMD EPYC 7702 64-Core Processor    
  Uuid:                    CPU-XX                             
--
  Name:                    AMD EPYC 7702 64-Core Processor    
  Uuid:                    CPU-XX                             
--
  Name:                    gfx90a                             
  Uuid:                    GPU-5310b8602059ef91               
bertwesarg commented 1 year ago

So I think, PAPI does not do anything in the code at all at the moment, it just needs to be clear, that the component expects the HSA level device index. Neither the HIP/HCA/OpenCL/OpenMP Target device index, nor the SMI/kernel level device index.

gcongiu commented 1 year ago

Agree. What would be the right way of making this clear, in your opinion? Add a comment to the component README?

bertwesarg commented 1 year ago

Yeah, probably the best place. Looks like the PAPI device index is derived from the hsa_iterate_agents, so just mention this too.