Closed fluidnumerics-joe closed 4 days ago
Thank you, Joe, for these insights and links!
I've documented omniperf here https://github.com/stas00/ml-engineering/commit/a289a441a830301e5e647a3f6d251ca2428c7fac
And yes indeed the roofline plot is super-useful for optimizing one's kernels!
In reference to a remark on this page :
I have always been irked by theoretical peaks...
There's more than anectdotal reports available out there, however. AMD's omniperf provides roofline analysis where the peak bandwidth and FLOPS reported in the diagram are obtained using empirical measurements from microbenchmarks provided by AMD.. https://hpc.rs/events/developing-hpc-applications-with-amd-gpus/16.%20intro_omniperf_long.pdf
Before omniperf came around, I've always used the gpumembench as a microbenchmark for at least getting sustained peak global memory bandwidth and FLOPS (for a variety of data types and ops).
Great work, by the way!