Closed tdulcet closed 1 month ago
I have personally tested NVIDIA RTX A6000 and GeForce RTX 3090 OC, both are slower at PRP than AMD Radeon Pro VII and Radeon VII.
GpuOwl VERSION v7.2-93-ga5402c5-dirty reports 10398 us/it and ETA 9d 09:06 for PRP=N/A,1,2,77936867,-1,75,0 on NVIDIA GeForce GTX 1650
You would need to test your GPU with multiple GpuOwl versions to see if there has been any performance regressions on it over time.
I would be happy to share the script I used to generate this data, if anyone would like to test their own GPU.
The script I used to generate the above data is now here: https://gist.github.com/tdulcet/13f7996b42e080e30a1ea46b0958082d. If anyone would like to test their GPU, just set the variables at the top and run bash performance.sh
. It will by default benchmark the last 550 commits and save the results in a bench.csv
file. If the GPU takes more than 10K iterations for the speed to settle (as the K80 did above), than increase the number of iterations variable accordingly.
It supports all GPUs that GpuOwl supports, not just Nvidia ones. It assumes that one already has the clinfo
command installed, as well as git
and the other dependencies needed for building GpuOwl, including Make, GCC, GMP and OpenCL. Depending on the number of exponents one sets it to benchmark and the speed of their GPU, it may take anywhere from less than an hour to several hours to test.
Thanks for a great script @tdulcet -- how did you generate the graphs?
I can confirm the P100 results at about 1104 us/iter with latest git master. It would be nice to get that back down to the fastest around 1040-1076 us/iter. Is it safe to downgrade to that revision, or has serious bugs been fixed since then?
No problem. To generate a graph, just open the resulting bench.csv
file in a spreadsheet program, such as LibreOffice Calc or Microsoft Excel. Feel free to post the graph and/or your raw data here.
For the P100 with the exponent I tested with (106,928,347), the fastest was 987 us/iter for several releases between v6.11-99 and v6.11-109. However, for PRP tests, I would not recommend going back past v6.11-318, as those versions do not produce the needed PRP proof files. Version 7.2-13 for example was 1063 us/iter, which is in your provided range.
Is there anything left to do here? (specifically, is PRPLL's performance a problem on Nvidia hw?)
Unfortunately, I am currently unable to test this due to OpenCL still being busted with recent Nvidia drivers on Linux...
Are you seeing a performance degradation with other software?
I once had a laptop on which mfaktc performance dropped by almost 50% over the course of a year. It turns out the issue was likely due to a thermal paste failure on the GPU because Prime95 performance wasn't affected.
Are you seeing a performance degradation with other software?
No, the above graphs were generated with my script within a few hours on multiple systems and GPUs. If OpenCL is working on your GPU, please feel free to run the script yourself and share the results. (It likely would need some tweaking for the newer PRPLL commits.)
Closing as old, re-open if any issue remains.
The performance on Nvidia GPUs has slowly degraded up to 15% across all FFT lengths over the last few years. For example, here are some graphs of the GpuOwl performance over the last 522 commits (back to https://github.com/preda/gpuowl/commit/14c60f9224d2a2fd2a190d277b7eb74ced23b01c) on four Nvidia GPUs. Click on each graph to enlarge.
Single exponent
A wavefront first time exponent (6M FFT length).
A100
The speed has degraded from 389 us/iter to now 426 us/iter or about 9.5%.
Tesla V100
The speed has degraded from 599 us/iter to now 654 us/iter or about 9.2%.
Tesla P100
The speed has degraded from 987 us/iter to now 1104 us/iter or about 11.9%.
Tesla K80
The speed has degraded from 4,020 us/iter to now 4,597 us/iter or about 14.4%. The K80 GPU takes longer for the speed to settle, so this was after 20K iterations instead of 10K.
Three exponents
A wavefront DC, a wavefront first time and a slightly over 100 million digits (one FFT length above the wavefront).
A100
Tesla V100
Here is the raw CSV data used to generate these graphs, which shows the individual regressing commits:
The severe performance regression was just fixed by @preda in https://github.com/preda/gpuowl/commit/3c23546cae8d24c2034ad55b7e48c02e59ae83c7. I would be happy to share the script I used to generate this data, if anyone would like to test their own GPU.
Huge thanks goes to @Magellan3s for creating these A100 and V100 GPU VMs for me to do this benchmarking. 🙏
@Danc2050 and I are working on updating our GPU notebook for Google Colaboratory to use GpuOwl instead of CUDALucas. These are four of the six GPUs currently used by Colab, so users would obviously want them and all Nvidia GPUs to be as performant as possible.