stas00 / ml-engineering

Machine Learning Engineering Open Book
https://stasosphere.com/machine-learning/
Creative Commons Attribution Share Alike 4.0 International
11.55k stars 703 forks source link

Empirical measurements of sustained bandwidth and flops #77

Closed fluidnumerics-joe closed 4 days ago

fluidnumerics-joe commented 1 week ago

In reference to a remark on this page :

The problem with the advertised theoretical peak FLOPS is that they are very theoretical and can't be achieved in practice even if all the perfect conditions have been provided. Each accelerator has its own realistic FLOPS which is not advertised and there are anecdotal community reports that do their best to find the actual best value, but I'm yet to find any official reports.

If you find solid reports (papers?) showing the actual TFLOPS one can expect from one or more of the high end accelerators discussed in this chapter please kindly submit a PR with this information. The key is to have a reference to a source that the reader can validate the proposed information with.

I have always been irked by theoretical peaks...

There's more than anectdotal reports available out there, however. AMD's omniperf provides roofline analysis where the peak bandwidth and FLOPS reported in the diagram are obtained using empirical measurements from microbenchmarks provided by AMD.. https://hpc.rs/events/developing-hpc-applications-with-amd-gpus/16.%20intro_omniperf_long.pdf

Before omniperf came around, I've always used the gpumembench as a microbenchmark for at least getting sustained peak global memory bandwidth and FLOPS (for a variety of data types and ops).

Great work, by the way!

stas00 commented 4 days ago

Thank you, Joe, for these insights and links!

I've documented omniperf here https://github.com/stas00/ml-engineering/commit/a289a441a830301e5e647a3f6d251ca2428c7fac

And yes indeed the roofline plot is super-useful for optimizing one's kernels!