Closed mmoadeli closed 7 months ago
@mmoadeli , Thank you for this modification. I will test your changes before doing a merge into the Velocity-Bench repo.
@mmoadeli , Update, I can confirm your modifications indeed gave a performance improvement on NVIDIA's H100 (205MH/s -> 208MH/s)
I will also need to check on other platforms before merging.
@mmoadeli : I have confirmed that your changes do not impact the performance on other platforms. Thank you very much!
@mmoadeli : I have confirmed that your changes do not impact the performance on other platforms. Thank you very much!
Thank you @jgtong
Addressing the issue of Lower unreachable to exit to allow ptxas to accurately reconstruct the CFG could potentially result in performance regression through adding a basic block with an
exit
to the end of the CFG. An example demonstrating noticeable regression resulting from this change can be found in ethminer, where the code is built using DPC++ for the A100 CUDA backend, using below commands.The performance regression in the sample comes from a switch statement with an unreachable default which is lowered to
exit
and moved to the end of the CFG.To address this issue, a pull request was submitted to upstream-llvm. However, it has not been accepted after multiple variations due to concerns about potential side-effects. One suggestion was to modify the source code, which the current pull request offers.
Running ethminer built using DPC++ for the CUDA backend with the command
ethminer -S -M 1
yielded the following measurements before the patch:With the patch applied, the performance improved as follows. It's worth noting that similar measurements will be observed when using a revision of DPC++ built before introduction of Lower unreachable to exit to allow ptxas to accurately reconstruct the CFG:
This improvement indicates around 2% increase in overall hash rate, highlighting the significance of the change for the application's performance.