Performance degradation on Nvidia GPUs

tdulcet commented 2 years ago

The performance on Nvidia GPUs has slowly degraded up to 15% across all FFT lengths over the last few years. For example, here are some graphs of the GpuOwl performance over the last 522 commits (back to https://github.com/preda/gpuowl/commit/14c60f9224d2a2fd2a190d277b7eb74ced23b01c) on four Nvidia GPUs. Click on each graph to enlarge.

Single exponent

A wavefront first time exponent (6M FFT length).

A100

The speed has degraded from 389 us/iter to now 426 us/iter or about 9.5%.

Tesla V100

The speed has degraded from 599 us/iter to now 654 us/iter or about 9.2%.

Tesla P100

The speed has degraded from 987 us/iter to now 1104 us/iter or about 11.9%.

Tesla K80

The speed has degraded from 4,020 us/iter to now 4,597 us/iter or about 14.4%. The K80 GPU takes longer for the speed to settle, so this was after 20K iterations instead of 10K.

Three exponents

A wavefront DC, a wavefront first time and a slightly over 100 million digits (one FFT length above the wavefront).

A100

Tesla V100

Here is the raw CSV data used to generate these graphs, which shows the individual regressing commits:

A100
- Single exponent: A100 bench.csv
- Three exponents: A100 performance.csv
Tesla V100
- Single exponent: V100 bench.csv
- Three exponents: V100 performance.csv
Tesla P100: P100 bench.csv
Tesla K80: K80 bench2.csv

The severe performance regression was just fixed by @preda in https://github.com/preda/gpuowl/commit/3c23546cae8d24c2034ad55b7e48c02e59ae83c7. I would be happy to share the script I used to generate this data, if anyone would like to test their own GPU.

Huge thanks goes to @Magellan3s for creating these A100 and V100 GPU VMs for me to do this benchmarking. 🙏

@Danc2050 and I are working on updating our GPU notebook for Google Colaboratory to use GpuOwl instead of CUDALucas. These are four of the six GPUs currently used by Colab, so users would obviously want them and all Nvidia GPUs to be as performant as possible.

selroc commented 2 years ago

I have personally tested NVIDIA RTX A6000 and GeForce RTX 3090 OC, both are slower at PRP than AMD Radeon Pro VII and Radeon VII.

kotenok2000 commented 2 years ago

GpuOwl VERSION v7.2-93-ga5402c5-dirty reports 10398 us/it and ETA 9d 09:06 for PRP=N/A,1,2,77936867,-1,75,0 on NVIDIA GeForce GTX 1650

tdulcet commented 1 year ago

You would need to test your GPU with multiple GpuOwl versions to see if there has been any performance regressions on it over time.

I would be happy to share the script I used to generate this data, if anyone would like to test their own GPU.

The script I used to generate the above data is now here: https://gist.github.com/tdulcet/13f7996b42e080e30a1ea46b0958082d. If anyone would like to test their GPU, just set the variables at the top and run bash performance.sh. It will by default benchmark the last 550 commits and save the results in a bench.csv file. If the GPU takes more than 10K iterations for the speed to settle (as the K80 did above), than increase the number of iterations variable accordingly.

It supports all GPUs that GpuOwl supports, not just Nvidia ones. It assumes that one already has the clinfo command installed, as well as git and the other dependencies needed for building GpuOwl, including Make, GCC, GMP and OpenCL. Depending on the number of exponents one sets it to benchmark and the speed of their GPU, it may take anywhere from less than an hour to several hours to test.

jas4711 commented 1 year ago

Thanks for a great script @tdulcet -- how did you generate the graphs?

I can confirm the P100 results at about 1104 us/iter with latest git master. It would be nice to get that back down to the fastest around 1040-1076 us/iter. Is it safe to downgrade to that revision, or has serious bugs been fixed since then?

tdulcet commented 1 year ago

No problem. To generate a graph, just open the resulting bench.csv file in a spreadsheet program, such as LibreOffice Calc or Microsoft Excel. Feel free to post the graph and/or your raw data here.

For the P100 with the exponent I tested with (106,928,347), the fastest was 987 us/iter for several releases between v6.11-99 and v6.11-109. However, for PRP tests, I would not recommend going back past v6.11-318, as those versions do not produce the needed PRP proof files. Version 7.2-13 for example was 1063 us/iter, which is in your provided range.

preda commented 2 months ago

Is there anything left to do here? (specifically, is PRPLL's performance a problem on Nvidia hw?)

tdulcet commented 2 months ago

Unfortunately, I am currently unable to test this due to OpenCL still being busted with recent Nvidia drivers on Linux...

ixfd64 commented 2 months ago

Are you seeing a performance degradation with other software?

I once had a laptop on which mfaktc performance dropped by almost 50% over the course of a year. It turns out the issue was likely due to a thermal paste failure on the GPU because Prime95 performance wasn't affected.

tdulcet commented 2 months ago

Are you seeing a performance degradation with other software?

No, the above graphs were generated with my script within a few hours on multiple systems and GPUs. If OpenCL is working on your GPU, please feel free to run the script yourself and share the results. (It likely would need some tweaking for the newer PRPLL commits.)

preda commented 1 month ago

Closing as old, re-open if any issue remains.

preda / gpuowl