mlcommons / chakra

Repository for MLCommons Chakra schema and tools
https://mlcommons.org/working-groups/research/chakra/
Apache License 2.0
52 stars 26 forks source link

Warnings in `trace_link.py` when running Chakra on AMD GPUs #128

Open rohitdwivedula opened 1 month ago

rohitdwivedula commented 1 month ago

When using chakra_trace_link on AMD Instinct MI210 GPUs, a bunch of warnings crop up when linking the Kineto and ET JSON files:

[2024-07-11 19:37:52,768] trace_linker.py:721 [WARNING]: Missing parent CPU operator for GPU op 'CopyHostToDevice'. Orphaned GPU operator.
[2024-07-11 19:37:52,768] trace_linker.py:775 [WARNING]: No CUDA runtime operator found for correlation ID -1. This is not a common case, and there should be a corresponding CUDA runtime operator for a given GPU kernel operator. It can be a case where CUDA runtime operators are not properly identified and added to the map, kineto_correlation_cuda_runtime_map. Please manually check if the corresponding CUDA runtime operator with the correlation is dropped by mistake. It is likely that it is because of incomplete map, cuda_launch_operations, in is_cuda_launch_op. Please update the map properly to cover all CUDA runtime launch operators.
[2024-07-11 19:37:52,768] trace_linker.py:721 [WARNING]: Missing parent CPU operator for GPU op 'void at::native::modern::elementwise_kernel<at::native::mse_kernel_cuda(at::TensorIteratorBase&)::{lambda()#1}::operator()() const::{lambda()#2}::operator()() const::{lambda(float, float)#1}, at::detail::Array<char*, 3> >(int, at::native::mse_kernel_cuda(at::TensorIteratorBase&)::{lambda()#1}::operator()() const::{lambda()#2}::operator()() const::{lambda(float, float)#1}, at::detail::Array<char*, 3>)'. Orphaned GPU operator.
[2024-07-11 19:37:52,768] trace_linker.py:775 [WARNING]: No CUDA runtime operator found for correlation ID -1. This is not a common case, and there should be a corresponding CUDA runtime operator for a given GPU kernel operator. It can be a case where CUDA runtime operators are not properly identified and added to the map, kineto_correlation_cuda_runtime_map. Please manually check if the corresponding CUDA runtime operator with the correlation is dropped by mistake. It is likely that it is because of incomplete map, cuda_launch_operations, in is_cuda_launch_op. Please update the map properly to cover all CUDA runtime launch operators.

Steps to Reproduce

  1. Copy the code from toy_model_train.py to your local AMD GPU setup.
  2. Run python3 toy_model_train.py
  3. Two files KINETO_demo.json and ET_demo.json are generated.
  4. Attempt running chakra_trace_link --pytorch-et-file ET_demo.json --kineto-file KINETO_demo.json --output-file LINKED.json
  5. You will see the warnings seen above.

Environment Details

Possible Causes

  1. Chakra assumes that the PyTorch Kineto traces contains the correlation field in the JSON objects. However, on AMD GPUs, the PyTorch Kineto traces do not contain the correlation field - (see this PyTorch issue for more information).
  2. In the is_cuda_launch_op function (link), the cuda_launch_operations list does not contain operation names such as hipLaunchKernel.
rohitdwivedula commented 1 month ago

To solve issue 1: in Nvidia Kineto traces, each entry in the JSON file contains two fields correlation and External id - and they always appear to be the same thing, e.g:

{
    "ph": "X", "cat": "cuda_runtime", "name": "cudaStreamWaitEvent", "pid": 2012624, "tid": 1142494784,
    "ts": 1720537333825191, "dur": 1,
    "args": {
      "External id": 350,
      "cbid": 147, "correlation": 350
    }
  }

AMD traces look like this:

  {
    "ph": "X", "cat": "gpu_memcpy", "name": "CopyHostToDevice", "pid": 2, "tid": 0,
    "ts": 1720537542569197, "dur": 32,
    "args": {
      "External id": 131
    }
  }

It is unclear if External id == correlation always, but in all of the Nvidia traces I've seen so far they have never been different. If they are, indeed, always the same, we could modify the trace_link script to use the External id as a fallback in case correlation is not found as a field.

TaekyungHeo commented 1 month ago

Thanks for sharing this, @rohitdwivedula. We had a chat with the PyTorch profiler team, and they advised us to use the correlation ID to link GPU operators with the launcher operators.

Previously, we used the external ID for linking CPU operators in a Chakra host trace and a Chakra device trace. It turned out that the external ID field is not stable, so we are currently using the rf_id field.

rohitdwivedula commented 1 month ago

Hi @TaekyungHeo - am hoping to open a PR to try to fix this issue and had a quick question. Currently, PyTorch's kineto traces do not contain correlation IDs at all - we opened an issue on the PyTorch repo for this. In the interim, what we have been doing is manually postprocessing the kineto json produced by torch.profile.profile by adding a new correlation field equal to the External ID field. Essentially, we modify each entry in the Kineto JSON from this:

  {
    "ph": "X", "cat": "gpu_memcpy", "name": "CopyHostToDevice", "pid": 2, "tid": 0,
    "ts": 1720537542569197, "dur": 32,
    "args": {
      "External id": 131
    }
  }

to this:

  {
    "ph": "X", "cat": "gpu_memcpy", "name": "CopyHostToDevice", "pid": 2, "tid": 0,
    "ts": 1720537542569197, "dur": 32,
    "args": {
      "External id": 131, "correlation": 131
    }
  }

After making this one change to the JSON, we ran chakra_trace_link on our fork of chakra on a bunch of models and no warnings are being generated at all.

Question: would it be possible for us to upstream the change in our fork (essentially adding all hipLaunch operators to the codebase) using either option 1 or option 2 (described below) while we wait on PyTorch to fix the lack of correlation field in AMD Kineto traces?

Option 1

We add a section to the documentation with the hacky fix mentioned above for AMD hardware. Before passing the kineto script to chakra_trace_link, just pass it through a function like this:

def process_kineto_file(infile, outfile):
    with open(infile, 'r') as f:
        data = f.read()
    data = json.loads(data)

    for i in range(len(data['traceEvents'])):
        if 'args' in data['traceEvents'][i].keys() and 'External id' in data['traceEvents'][i]['args'].keys() and 'correlation' not in data['traceEvents'][i]['args'].keys():
            data['traceEvents'][i]['args']['correlation'] = data['traceEvents'][i]['args']['External id']

    with open(outfile, 'w') as f:
        json.dump(data, f, indent=2)

Option 2

Inside the chakra_trace_link function, we add an extra codepath to use External ID instead of correlation if (1) the trace is an AMD trace, and (2) no correlation IDs are found in the entire file.

srinivas212 commented 2 weeks ago

Thanks for raising this issue, @rohitdwivedula. I prefer option 1 mainly because this issue needs to be fixed in PyTorch. We had faced a ton of issues around this problem in the past and needed to make sure Kineto was doing the right thing for consistent behavior. Simple traces would work but more complex ones would fail.