Empirical measurements of sustained bandwidth and flops

In reference to a remark on this page :

The problem with the advertised theoretical peak FLOPS is that they are very theoretical and can't be achieved in practice even if all the perfect conditions have been provided. Each accelerator has its own realistic FLOPS which is not advertised and there are anecdotal community reports that do their best to find the actual best value, but I'm yet to find any official reports.

If you find solid reports (papers?) showing the actual TFLOPS one can expect from one or more of the high end accelerators discussed in this chapter please kindly submit a PR with this information. The key is to have a reference to a source that the reader can validate the proposed information with.

I have always been irked by theoretical peaks...

There's more than anectdotal reports available out there, however. AMD's omniperf provides roofline analysis where the peak bandwidth and FLOPS reported in the diagram are obtained using empirical measurements from microbenchmarks provided by AMD.. https://hpc.rs/events/developing-hpc-applications-with-amd-gpus/16.%20intro_omniperf_long.pdf

Before omniperf came around, I've always used the gpumembench as a microbenchmark for at least getting sustained peak global memory bandwidth and FLOPS (for a variety of data types and ops).

Great work, by the way!

stas00 / ml-engineering

Empirical measurements of sustained bandwidth and flops #77