Add nsys report analyzer

FindHao commented 1 day ago

This PR add a nsys report analyzer providing metrics

nsys_metrics_to_reports = {
    # the sum of kernel execution time
    "nsys_gpu_kernel_sum": ["cuda_gpu_kern_sum", "nvtx_sum"],
    # the overhead of kernel launch
    "nsys_launch_overhead": ["cuda_gpu_kern_sum", "nvtx_sum"],
    # the names of kernels
    "nsys_kernel_names": ["cuda_gpu_kern_sum"],
    # the durations of kernels
    "nsys_kernel_durations": ["cuda_gpu_kern_sum"],
    # the duration of nvtx range
    "nsys_nvtx_range_duration": ["nvtx_sum"],
    # the number of kernels
    "nsys_num_of_kernels": ["cuda_gpu_kern_sum"],
}

nsys_gpu_kernel_sum is the sum of total GPU kernel execution time on GPUs, the nsys_nvtx_range_duration is the total execution time of the operator, and the nsys_launch_overhead is their difference which indicates the launch overhead. This is one way to measure execution time mentioned in https://github.com/pytorch-labs/tritonbench/issues/50

Fix https://github.com/pytorch-labs/tritonbench/issues/67

Test Plan:

% python run.py --op rope  --num-inputs 1  --metrics nsys_gpu_kernel_sum,nsys_launch_overhead,nsys_kernel_names,nsys_kernel_durations,nsys_nvtx_range_duration,nsys_num_of_kernels --csv --dump-csv
  0%|                                                                                                                                                         | 0/1 [00:00<?, ?it/s]`LlamaRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46
  0%|          | 0/1 [00:00<?, ?it/s]`LlamaRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46
Capture range started in the application.
Capture range ended in the application.
Generating '/tmp/nsys-report-531e.qdstrm'
[1/1] [0%                          ] nsys_output.nsys-repProcessing events...
[1/1] [========================100%] nsys_output.nsys-rep
Generated:
    /tmp/tritonbench/rope/nsys_traces/apply_rotary_pos_emb_0/nsys_output.nsys-rep
  0%|          | 0/1 [00:00<?, ?it/s]`LlamaRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46
Capture range started in the application.
Capture range ended in the application.
Generating '/tmp/nsys-report-39ea.qdstrm'
[1/1] [0%                          ] nsys_output.nsys-repProcessing events...
[1/1] [========================100%] nsys_output.nsys-rep
Generated:
    /tmp/tritonbench/rope/nsys_traces/liger_rotary_pos_emb_0/nsys_output.nsys-rep
  0%|          | 0/1 [00:00<?, ?it/s]`LlamaRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46
Capture range started in the application.
Capture range ended in the application.
Generating '/tmp/nsys-report-e8bf.qdstrm'
[1/1] [0%                          ] nsys_output.nsys-repProcessing events...
[1/1] [========================100%] nsys_output.nsys-rep
Generated:
    /tmp/tritonbench/rope/nsys_traces/inductor_rotary_pos_emb_full_op_0/nsys_output.nsys-rep
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:52<00:00, 52.40s/it]
(H, T);apply_rotary_pos_emb-nsys_kernel_names;apply_rotary_pos_emb-nsys_kernel_durations;apply_rotary_pos_emb-nsys_gpu_kernel_sum;apply_rotary_pos_emb-nsys_num_of_kernels;apply_rotary_pos_emb-nsys_launch_overhead;apply_rotary_pos_emb-nsys_nvtx_range_duration;liger_rotary_pos_emb-nsys_kernel_names;liger_rotary_pos_emb-nsys_kernel_durations;liger_rotary_pos_emb-nsys_gpu_kernel_sum;liger_rotary_pos_emb-nsys_num_of_kernels;liger_rotary_pos_emb-nsys_launch_overhead;liger_rotary_pos_emb-nsys_nvtx_range_duration;inductor_rotary_pos_emb_full_op-nsys_kernel_names;inductor_rotary_pos_emb_full_op-nsys_kernel_durations;inductor_rotary_pos_emb_full_op-nsys_gpu_kernel_sum;inductor_rotary_pos_emb_full_op-nsys_num_of_kernels;inductor_rotary_pos_emb_full_op-nsys_launch_overhead;inductor_rotary_pos_emb_full_op-nsys_nvtx_range_duration
(8192, 1024);['void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::native::BinaryFunctor<float, float, float, at::native::binary_internal::MulFunctor<float>>>(at::TensorIteratorBase &, const T1 &)::[lambda(int) (instance 1)]>(int, T3)', 'void at::native::<unnamed>::CatArrayBatchedCopy<at::native::<unnamed>::OpaqueType<(unsigned int)4>, unsigned int, (int)4, (int)64, (int)64>(T1 *, at::native::<unnamed>::CatArrInputTensorMetadata<T1, T2, T4, T5>, at::native::<unnamed>::TensorSizeStride<T2, (unsigned int)4>, int, T2)', 'void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::native::CUDAFunctor_add<float>>(at::TensorIteratorBase &, const T1 &)::[lambda(int) (instance 1)]>(int, T3)', 'void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::native::neg_kernel_cuda(at::TensorIteratorBase &)::[lambda() (instance 2)]::operator ()() const::[lambda() (instance 7)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIteratorBase &, const T1 &)::[lambda(int) (instance 1)]>(int, T3)'];0.090065;0.351364;4;0.4534;0.804764;['_triton_rope'];0.049281;0.049281;1;0.176437;0.225718;['triton_poi_fused_add_cat_mul_0', 'triton_poi_fused_add_cat_mul_1'];0.0266885;0.053377;2;0.444969;0.498346
[TritonBench] Dumped csv to /tmp/tritonbench/op_rope__z_yqmrz.csv

facebook-github-bot commented 1 day ago

@FindHao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot commented 1 day ago

@FindHao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

FindHao commented 1 day ago

Some results are not correct. working on the fix.

facebook-github-bot commented 1 day ago

@FindHao has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

pytorch-labs / tritonbench

Add nsys report analyzer #65