rohitdwivedula commented 1 month ago

🐛 Describe the bug

Summary: Device information, correlation IDs, and the bytes field are missing in torch.profiler.profile JSON dumps when this profiling class is used on AMD GPUs.

test_kineto.py contains a script that (i) creates a tensor on each GPU, (ii) performs an allreduce.SUM operation across all ranks, while (iii) using the Kineto profiler to capture all CPU and CUDA events and (iv) dump the trace to JSON. When you run this script on AMD GPUs, the JSON is missing some attributes and information that are available when running the same profiler on Nvidia GPUs.

I ran the test_kineto.py script on (a) two Nvidia A100 80GB GPUs, and (b) two Instinct MI210 GPUs; using Python 3.9.19 and PyTorch 2. The output JSON files I got from these runs are available here (AMD Instinct MI 210) and here (for Nvidia A100).

1. Missing `bytes` field

For example, this is what a gpu_memcpy operation in the Nvidia trace looks like:

  {
    "ph": "X", "cat": "gpu_memcpy", "name": "Memcpy HtoD (Pageable -> Device)", "pid": 0, "tid": 142,
    "ts": 1720537333824533, "dur": 1,
    "args": {
      "External id": 334,
      "device": 0, "context": 1,
      "stream": 142, "correlation": 334,
      "bytes": 8, "memory bandwidth (GB/s)": 0.005813953488372093
    }
  }

On AMD GPUs, a similar gpu_memcpy item in the JSON file looks like this:

  {
    "ph": "X", "cat": "gpu_memcpy", "name": "CopyHostToDevice", "pid": 2, "tid": 0,
    "ts": 1720537542569197, "dur": 32,
    "args": {
      "External id": 131
    }
  },

The bytes field is missing in the latter. The same problem exists for all gpu_memset operations as well.

2. Missing device information

The A100 Kineto trace contains a field deviceProperties - this is completely absent/missing from the AMD trace.

  "deviceProperties": [
  {
      "id": 0, "name": "NVIDIA A100 80GB PCIe", "totalGlobalMem": 84989575168,
      "computeMajor": 8, "computeMinor": 0,
      "maxThreadsPerBlock": 1024, "maxThreadsPerMultiprocessor": 2048,
      "regsPerBlock": 65536, "regsPerMultiprocessor": 65536, "warpSize": 32,
      "sharedMemPerBlock": 49152, "sharedMemPerMultiprocessor": 167936,
      "numSms": 108, "sharedMemPerBlockOptin": 166912
    },
    {
      "id": 1, "name": "NVIDIA A100 80GB PCIe", "totalGlobalMem": 84989575168,
      "computeMajor": 8, "computeMinor": 0,
      "maxThreadsPerBlock": 1024, "maxThreadsPerMultiprocessor": 2048,
      "regsPerBlock": 65536, "regsPerMultiprocessor": 65536, "warpSize": 32,
      "sharedMemPerBlock": 49152, "sharedMemPerMultiprocessor": 167936,
      "numSms": 108, "sharedMemPerBlockOptin": 166912
    }
  ]

3. Correlation ID

Each entry in the JSON generated on Nvidia GPUs contains a correlation ID:

{
    "ph": "X", "cat": "cuda_runtime", "name": "cudaStreamWaitEvent", "pid": 2012624, "tid": 1142494784,
    "ts": 1720537333825191, "dur": 1,
    "args": {
      "External id": 350,
      "cbid": 147, "correlation": 350
    }
  }

This correlation field is not present in any of the events in the AMD GPU dump. For example, the MLCommons/chakra repository uses the torch.profiler.profile class, and assumes that this correlation field is present in the JSON in multiple places (e.g. here)

Versions

Used the rocm/pytorch:latest docker image (image id: b80124b96134) from DockerHub. Output of collect_env.py:

PyTorch version: 2.3.0a0+gitae01701
Is debug build: False
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: 6.1.40091-a8dbc0c19

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.1.0 24103 7db7f5e49612030319346f900c08f474b1f9023a)
CMake version: version 3.26.4
Libc version: glibc-2.31

Python version: 3.9.19 (main, Mar 21 2024, 17:11:28)  [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-112-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: AMD Instinct MI210 (gfx90a:sramecc+:xnack-)
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: 6.1.40091
MIOpen runtime version: 3.1.0
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      52 bits physical, 57 bits virtual
CPU(s):                             256
On-line CPU(s) list:                0-255
Thread(s) per core:                 2
Core(s) per socket:                 64
Socket(s):                          2
NUMA node(s):                       2
Vendor ID:                          AuthenticAMD
CPU family:                         25
Model:                              17
Model name:                         AMD EPYC 9554 64-Core Processor
Stepping:                           1
Frequency boost:                    enabled
CPU MHz:                            1500.000
CPU max MHz:                        3762.9880
CPU min MHz:                        1500.0000
BogoMIPS:                           6200.22
Virtualization:                     AMD-V
L1d cache:                          4 MiB
L1i cache:                          4 MiB
L2 cache:                           128 MiB
L3 cache:                           512 MiB
NUMA node0 CPU(s):                  0-63,128-191
NUMA node1 CPU(s):                  64-127,192-255
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Mitigation; safe RET
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d

Versions of relevant libraries:
[pip3] mypy==1.7.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.20.3
[pip3] optree==0.9.1
[pip3] torch==2.3.0a0+gitae01701
[pip3] torchvision==0.18.0a0+6f0deb9
[pip3] triton==2.1.0
[conda] No relevant packages

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @robieta @chaekit @aaronenyeshi @guotuofeng @guyang3532 @dzhulgakov @davidberard98 @briancoutinho @sraikund16 @sanrise

hongxiayang commented 1 month ago

cc @mwootton

briancoutinho commented 1 week ago

I’m not too familiar with the AMD plugin code but dug in a bit. Some of the missing bits might be there in later versions, is it possible you can try out pytorch nightly ?

Here is what I found-

1) Device properties: This is already supported for HIP/AMD devices. https://github.com/pytorch/kineto/pull/927 (May 2024)

2) Memcpy bytes : Looks like there is a “size” field in memcpy events that should have this- (The metadataJson() function adds the extra fields you see in the trace) https://github.com/pytorch/kineto/blob/main/libkineto/src/RoctracerActivity_inl.h#L161-L170

3) I think correlation ID should be present as per latest code, on all runtime operations. We may need to investigate this in detail though (See metadataJson() again) https://github.com/pytorch/kineto/blob/main/libkineto/src/RoctracerActivity_inl.h#L187-L190

I heard that some collective events are missing correlation ID. But most likely these fixes will be in the AMD ROCTracer so @mwootton / @hongxiayang might be able to help out.

pytorch / pytorch

`torch.profiler.profile` missing attributes on AMD GPUs #130560

🐛 Describe the bug

1. Missing `bytes` field

2. Missing device information

3. Correlation ID

Versions

pytorch / pytorch

`torch.profiler.profile` missing attributes on AMD GPUs #130560

🐛 Describe the bug

1. Missing bytes field

2. Missing device information

3. Correlation ID

Versions

1. Missing `bytes` field