pytorch-labs / tritonbench

Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
BSD 3-Clause "New" or "Revised" License
20 stars 3 forks source link

Need general flops metric from ncu report #33

Closed FindHao closed 2 weeks ago

xuzhao9 commented 3 weeks ago

Let's differentiate this with the current tflops() metric, the ncu report is the hardware flops where the current tflops is analytic flops (calculated from math).

FindHao commented 3 weeks ago

Let's differentiate this with the current tflops() metric, the ncu report is the hardware flops where the current tflops is analytic flops (calculated from math).

yeah, for sure. how about --metrics hardware_tflops ?

xuzhao9 commented 3 weeks ago

how about we use a shorter name ncu_tflops?

FindHao commented 3 weeks ago

how about we use a shorter name ncu_tflops?

sure. will add this feature later.

antferdom commented 2 weeks ago

@xuzhao9 What about using Triton's Proton profiler metadata metric scope instead? see Triton and cuBLAS matmul kernel

xuzhao9 commented 2 weeks ago

@antferdom Yes we plan to support the Proton profiler. However, the flops number defined are "analytic flops" and it is different from the "hardware flops" from NCU. Tritonbench relies on each operator author to add analytic flops, e.g., adding the tflops() function with @register_metric()

FindHao commented 2 weeks ago

@xuzhao9 What about using Triton's Proton profiler metadata metric scope instead? see Triton and cuBLAS matmul kernel

We had some discussions here https://github.com/pytorch/pytorch/pull/136169. I think we are open to do in proton way too if anyone wants to help. I've got a ncu version locally and will push it later.

antferdom commented 2 weeks ago

@xuzhao9 Thanks for the clarification about the target FLOPs number, “analytic flops” (e.g. user defined formula like in Proton) vs NCU GPU hardware counters for precise flops counting. I’m also currently using ncu for automatically profiling Torch Inductor Triton GPU kernels, based on Torch official documentation. @register_metric() is similar to Proton’s metric in scope and metadata_fn

@FindHao I have a simple prototype for automatically annotating using Proton scope contextmanager, as discussed in the issue. Before going further with Proton, I will wait for your ncu runner exemple.

FindHao commented 2 weeks ago

@xuzhao9 Thanks for the clarification about the target FLOPs number, “analytic flops” (e.g. user defined formula like in Proton) vs NCU GPU hardware counters for precise flops counting. I’m also currently using ncu for automatically profiling Torch Inductor Triton GPU kernels, based on Torch official documentation. @register_metric() is similar to Proton’s metric in scope and metadata_fn

@FindHao I have a simple prototype for automatically annotating using Proton scope contextmanager, as discussed in the issue. Before going further with Proton, I will wait for your ncu runner exemple.

The key part to obtain flops is done here https://github.com/pytorch-labs/tritonbench/blob/main/tritonbench/components/ncu/analyzer.py#L86 . The remaining parts are aggregation and add metrics to results. will do it later.