Evaluate Profile-Guided Optimization (PGO) and LLVM BOLT

Hi!

Recently I did many Profile-Guided Optimization (PGO) benchmarks on multiple projects - the results are available here. Here you can find different applications from different domains that were accelerated with PGO: compilers, gRPC workloads, benchmark tools, databases, and much more. So that's why I think it's worth trying to apply PGO to Triton.

I can suggest the following things to do:

Evaluate PGO's applicability to Triton.
If PGO helps to achieve better performance - add a note to Triton's documentation about that. In this case, users and maintainers will be aware of another optimization opportunity for Triton.
Provide PGO integration into the build scripts. It can help users and maintainers easily apply PGO for their own workloads.
Optimize provided by Nvidia Triton binaries with PGO.

After PGO, I can suggest evaluating LLVM BOLT as an additional optimization step after PGO.

triton-inference-server / server

Evaluate Profile-Guided Optimization (PGO) and LLVM BOLT #6304