mlcommons / chakra

Repository for MLCommons Chakra schema and tools
https://mlcommons.org/working-groups/research/chakra/
Apache License 2.0
45 stars 17 forks source link

Remove get_start_timestamp_for_gpu_op from trace_linker.py #70

Closed TaekyungHeo closed 1 month ago

TaekyungHeo commented 1 month ago

Summary

Remove get_start_timestamp_for_gpu_op from trace_linker.py. In find_parent_cpu_op, the timestamp of a GPU operator must be determined to identify the correct parent CPU operator. The timestamp of a GPU operator is actually determined by the CUDA launcher operator that launched the GPU operator. To determine the timestamp of a GPU operator, we previously used get_start_timestamp_for_gpu_op. However, it turns out that this function is not actually required and buggy. This PR removes get_start_timestamp_for_gpu_op. The bug was due to the limitation of external IDs. Previously, we used external IDs for matching a GPU operator with a CPU operator. However, it is not guaranteed that external IDs always match. Instead, the correlation field appears to be a better way to correlate a GPU operator with a CUDA launcher operator.

Test Plan

$ python3 ci_tools/integration_tests.py --tgz_path tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05.tgz --num_ranks 8 --tolerance 0.05 --expected_times_ms 14597 14597 14968 14638 14649 14
700 14677 14735                                      
Extracting tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05.tgz to tests/data/1.0.2-chakra.0.0.4                                                                                                                    
Running command: chakra_trace_link --pytorch-et-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_host_et_0.json --kineto-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/kineto_0.json --output-fi
le tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_0.json                                 
Running command: chakra_trace_link --pytorch-et-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_host_et_1.json --kineto-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/kineto_1.json --output-fi
le tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_1.json                                 
Running command: chakra_trace_link --pytorch-et-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_host_et_2.json --kineto-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/kineto_2.json --output-fi
le tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_2.json                                 
Running command: chakra_trace_link --pytorch-et-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_host_et_3.json --kineto-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/kineto_3.json --output-fi
le tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_3.json                                                                                                                                           
Running command: chakra_trace_link --pytorch-et-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_host_et_4.json --kineto-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/kineto_4.json --output-fi
le tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_4.json                                 
Running command: chakra_trace_link --pytorch-et-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_host_et_5.json --kineto-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/kineto_5.json --output-fi
le tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_5.json                                 
Running command: chakra_trace_link --pytorch-et-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_host_et_6.json --kineto-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/kineto_6.json --output-fi
le tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_6.json                                 
Running command: chakra_trace_link --pytorch-et-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_host_et_7.json --kineto-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/kineto_7.json --output-fi
le tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_7.json                                 
Running command: chakra_converter --input_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_0.json --output_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_final_0.chakra -
-input_type PyTorch --log_filename /tmp/rank_0.log                                                                                                                                                                  
Running command: chakra_converter --input_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_1.json --output_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_final_1.chakra -
-input_type PyTorch --log_filename /tmp/rank_1.log                                                        
Running command: chakra_converter --input_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_3.json --output_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_final_3.chakra -
-input_type PyTorch --log_filename /tmp/rank_3.log                                                                                                                                                                  
Running command: chakra_converter --input_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_2.json --output_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_final_2.chakra -
-input_type PyTorch --log_filename /tmp/rank_2.log                                                        
Running command: chakra_converter --input_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_4.json --output_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_final_4.chakra -
-input_type PyTorch --log_filename /tmp/rank_4.log                                                                                                                                                                  
Running command: chakra_converter --input_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_6.json --output_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_final_6.chakra -
-input_type PyTorch --log_filename /tmp/rank_6.log                                                                                                                                                                  
Running command: chakra_converter --input_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_5.json --output_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_final_5.chakra -
-input_type PyTorch --log_filename /tmp/rank_5.log                                                        
Running command: chakra_converter --input_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_7.json --output_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_final_7.chakra -
-input_type PyTorch --log_filename /tmp/rank_7.log                                                        
Validation successful for /tmp/rank_0.log: 14802300us is within the acceptable range.                                                                                                                               
Validation successful for /tmp/rank_1.log: 14785782us is within the acceptable range.                     
Validation successful for /tmp/rank_2.log: 15233261us is within the acceptable range.                     
Validation successful for /tmp/rank_3.log: 14878058us is within the acceptable range.                     
Validation successful for /tmp/rank_4.log: 14892945us is within the acceptable range.                                                                                                                               
Validation successful for /tmp/rank_5.log: 14993779us is within the acceptable range.                     
Validation successful for /tmp/rank_6.log: 14936348us is within the acceptable range.                     
Validation successful for /tmp/rank_7.log: 15031147us is within the acceptable range.    
github-actions[bot] commented 1 month ago

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅