mlcommons / chakra

Repository for MLCommons Chakra schema and tools
https://mlcommons.org/working-groups/research/chakra/
Apache License 2.0
45 stars 17 forks source link

Fix identification and metadata encoding for SEND and RECV nodes in PyTorchConverter #112

Closed TaekyungHeo closed 5 days ago

TaekyungHeo commented 6 days ago

Summary

This pull request addresses several bugs in identifying SEND and RECV nodes and calculating communication sizes.

Changes Made:

  1. Update PyTorchConverter to Properly Identify SEND and RECV Nodes:

    • Modified the get_chakra_node_type_from_pytorch_node method to ensure only GPU operators are identified as communication operators. Previously, CPU nodes like c10d:send and c10d:recv were incorrectly identified as SEND or RECV nodes.
    • Improved the identification of SEND or RECV nodes by using the ncclDevKernel_SendRecv keyword and determining the type based on the parent or grandparent node name. If the parent node is record_param_comms, the grandparent node name is used; otherwise, the parent node name suffices.
  2. Fix Bug in Encoding Metadata for Communication Nodes:

    • Fixed a bug in encoding metadata where the wrong node (chakra_node instead of chakra_gpu_node) was checked for being a communication operator.
  3. Support List of Tensors in Calculating Communication Size:

    • Enhanced the comm_size calculation in the PyTorchNode class to handle lists of tensors. Previously, only single tensors were supported, but now the calculation iterates through lists of tensors, summing their sizes correctly.

Test Plan

$ pip install .               
Processing /Users/theo/chakra
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: protobuf==4.* in /Users/theo/venv/lib/python3.10/site-packages (from chakra==0.0.4) (4.23.4)
Requirement already satisfied: graphviz in /Users/theo/venv/lib/python3.10/site-packages (from chakra==0.0.4) (0.20.1)
Requirement already satisfied: networkx in /Users/theo/venv/lib/python3.10/site-packages (from chakra==0.0.4) (3.2.1)
Requirement already satisfied: pydot in /Users/theo/venv/lib/python3.10/site-packages (from chakra==0.0.4) (2.0.0)
Requirement already satisfied: pyparsing>=3 in /Users/theo/venv/lib/python3.10/site-packages (from pydot->chakra==0.0.4) (3.1.1)
Building wheels for collected packages: chakra
  Building wheel for chakra (pyproject.toml) ... done
  Created wheel for chakra: filename=chakra-0.0.4-py3-none-any.whl size=52255 sha256=866ce2a4a10231eb98129958f2d2ae83a79de256478fc77cc496b7295d2158c7
  Stored in directory: /private/var/folders/z0/c9mq5j4s6n14n0_gs7nlt6mc0000gp/T/pip-ephem-wheel-cache-78s9_jin/wheels/fa/dc/75/2163b4163bf1e9cf3c7f1cf69fa03716fb707b8c4f5cb271e8
Successfully built chakra
Installing collected packages: chakra
  Attempting uninstall: chakra
    Found existing installation: chakra 0.0.4
    Uninstalling chakra-0.0.4:
      Successfully uninstalled chakra-0.0.4
Successfully installed chakra-0.0.4

$ python3 ci_tools/integration_tests.py --tgz_path tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05.tgz --num_ranks 8 --tolerance 0.05 --expected_times_ms 14597 14597 14968 14638 14649 14700 14677 14735
Extracting tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05.tgz to tests/data/1.0.2-chakra.0.0.4
Running command: chakra_trace_link --pytorch-et-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_host_et_0.json --kineto-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/kineto_0.json --output-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_0.json
Running command: chakra_trace_link --pytorch-et-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_host_et_1.json --kineto-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/kineto_1.json --output-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_1.json
Running command: chakra_trace_link --pytorch-et-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_host_et_2.json --kineto-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/kineto_2.json --output-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_2.json
Running command: chakra_trace_link --pytorch-et-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_host_et_3.json --kineto-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/kineto_3.json --output-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_3.json
Running command: chakra_trace_link --pytorch-et-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_host_et_4.json --kineto-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/kineto_4.json --output-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_4.json
Running command: chakra_trace_link --pytorch-et-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_host_et_5.json --kineto-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/kineto_5.json --output-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_5.json
Running command: chakra_trace_link --pytorch-et-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_host_et_6.json --kineto-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/kineto_6.json --output-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_6.json
Running command: chakra_trace_link --pytorch-et-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_host_et_7.json --kineto-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/kineto_7.json --output-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_7.json
zRunning command: chakra_converter --input_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_0.json --output_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_final_0.chakra --input_type PyTorch --log_filename /tmp/rank_0.log
Running command: chakra_converter --input_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_1.json --output_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_final_1.chakra --input_type PyTorch --log_filename /tmp/rank_1.log
Running command: chakra_converter --input_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_3.json --output_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_final_3.chakra --input_type PyTorch --log_filename /tmp/rank_3.log
Running command: chakra_converter --input_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_4.json --output_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_final_4.chakra --input_type PyTorch --log_filename /tmp/rank_4.log
Running command: chakra_converter --input_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_5.json --output_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_final_5.chakra --input_type PyTorch --log_filename /tmp/rank_5.log
Running command: chakra_converter --input_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_2.json --output_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_final_2.chakra --input_type PyTorch --log_filename /tmp/rank_2.log
Running command: chakra_converter --input_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_6.json --output_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_final_6.chakra --input_type PyTorch --log_filename /tmp/rank_6.log
Running command: chakra_converter --input_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_7.json --output_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_final_7.chakra --input_type PyTorch --log_filename /tmp/rank_7.log
Validation successful for /tmp/rank_0.log: 14802300us is within the acceptable range.
Validation successful for /tmp/rank_1.log: 14785782us is within the acceptable range.
Validation successful for /tmp/rank_2.log: 15233261us is within the acceptable range.
Validation successful for /tmp/rank_3.log: 14878058us is within the acceptable range.
Validation successful for /tmp/rank_4.log: 14892945us is within the acceptable range.
Validation successful for /tmp/rank_5.log: 14993779us is within the acceptable range.
Validation successful for /tmp/rank_6.log: 14936348us is within the acceptable range.
Validation successful for /tmp/rank_7.log: 15031147us is within the acceptable range.
github-actions[bot] commented 6 days ago

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅