Open mcourteaux opened 3 months ago
Apparently Windows/OpenCL on the build bot does not have a performance improvement, but even a performance degradation (about 15%):
C:\build_bot\worker\halide-testbranch-main-llvm20-x86-64-windows-cmake\halide-build\bin\performance_fast_arctan.exe
atan: 6.347030 ns per pixel
fast_atan: 7.295760 ns per pixel
atan2: 0.923191 ns per pixel
fast_atan2: 0.926148 ns per pixel
fast_atan more than 10% slower than atan on GPU.
Suggestions?
GPU performance test was severely memory bandwidth limited. This has been worked around by computing many (1024) arctans per output and summing them. Now --at least on my system-- they are faster. See updated performance reports.
Okay, this is ready for review. Vulkan is slow, but that is apparently known well...
Oh dear... I don't even know what WebGPU is... @steven-johnson Is this supposed to be an actual platform that is fast, and where performance metrics make sense? I can treat it like Vulkan, where it's just "meh, at least some are faster..."?
Oh dear... I don't even know what WebGPU is... @steven-johnson Is this supposed to be an actual platform that is fast, and where performance metrics make sense? I can treat it like Vulkan, where it's just "meh, at least some are faster..."?
https://en.wikipedia.org/wiki/WebGPU https://www.w3.org/TR/webgpu/ https://github.com/gpuweb/gpuweb/wiki/Implementation-Status
Okay, this is ready for review. Vulkan is slow, but that is apparently known well...
I don't think Vulkan is necessarily slow ... I think the benchmark loop is including initialization overhead. See my follow up here: https://github.com/halide/Halide/issues/7202
Very cool! I have some concerns with the error metric though. Decimal digits of error isn't a great metric. E.g. having a value of 0.0001 when it's supposed to be zero is much much worse than having a value of 0.3701 when it's supposed to be 0.37. Relative error isn't great either, due to the singularity at zero. A better metric is ULPs, which is the maximum number of distinct floating point values in between the answer and the correct answer.
There are also cases where you want a hard constraint as opposed to a minimization. exp(0) should be exactly one, and I guess I decided its derivative should be exactly one too, which explains the different in coefficients.
A better metric is ULPs, which is the maximum number of distinct floating point values in between the answer and the correct answer.
@abadams I improved the optimization script a lot. I added support for ULP optimization: it optimizes very nicely for maximal bit error.
When instead optimizing for MAE, we see the max ULP distance increase:
I changed the default to the ULP-optimized one, but to keep the maximal absolute error under 1e-5, I had to choose the higher-degree polynomial. Overall still good.
@derek-gerstmann Thanks a lot for investigating the performance issue! I now also get very fast Vulkan performance. I wonder why the overhead is so huge in Vulkan, and not there in other backends?
Vulkan:
atan: 0.009071 ns per atan
fast_atan (Poly2): 0.005076 ns per atan (44.0% faster) [per invokation: 0.340618 ms]
fast_atan (Poly3): 0.005279 ns per atan (41.8% faster) [per invokation: 0.354284 ms]
fast_atan (Poly4): 0.005484 ns per atan (39.5% faster) [per invokation: 0.368018 ms]
fast_atan (Poly5): 0.005925 ns per atan (34.7% faster) [per invokation: 0.397631 ms]
fast_atan (Poly6): 0.006225 ns per atan (31.4% faster) [per invokation: 0.417756 ms]
fast_atan (Poly7): 0.006448 ns per atan (28.9% faster) [per invokation: 0.432734 ms]
fast_atan (Poly8): 0.006765 ns per atan (25.4% faster) [per invokation: 0.453989 ms]
atan2: 0.013717 ns per atan2
fast_atan2 (Poly2): 0.007812 ns per atan2 (43.0% faster) [per invokation: 0.524279 ms]
fast_atan2 (Poly3): 0.007604 ns per atan2 (44.6% faster) [per invokation: 0.510290 ms]
fast_atan2 (Poly4): 0.008016 ns per atan2 (41.6% faster) [per invokation: 0.537952 ms]
fast_atan2 (Poly5): 0.008544 ns per atan2 (37.7% faster) [per invokation: 0.573364 ms]
fast_atan2 (Poly6): 0.008204 ns per atan2 (40.2% faster) [per invokation: 0.550533 ms]
fast_atan2 (Poly7): 0.008757 ns per atan2 (36.2% faster) [per invokation: 0.587663 ms]
fast_atan2 (Poly8): 0.008629 ns per atan2 (37.1% faster) [per invokation: 0.579092 ms]
Success!
CUDA:
atan: 0.010663 ns per atan
fast_atan (Poly2): 0.006854 ns per atan (35.7% faster) [per invokation: 0.459946 ms]
fast_atan (Poly3): 0.006838 ns per atan (35.9% faster) [per invokation: 0.458894 ms]
fast_atan (Poly4): 0.007196 ns per atan (32.5% faster) [per invokation: 0.482914 ms]
fast_atan (Poly5): 0.007646 ns per atan (28.3% faster) [per invokation: 0.513141 ms]
fast_atan (Poly6): 0.008205 ns per atan (23.1% faster) [per invokation: 0.550595 ms]
fast_atan (Poly7): 0.008496 ns per atan (20.3% faster) [per invokation: 0.570149 ms]
fast_atan (Poly8): 0.009008 ns per atan (15.5% faster) [per invokation: 0.604508 ms]
atan2: 0.014594 ns per atan2
fast_atan2 (Poly2): 0.009409 ns per atan2 (35.5% faster) [per invokation: 0.631451 ms]
fast_atan2 (Poly3): 0.009957 ns per atan2 (31.8% faster) [per invokation: 0.668201 ms]
fast_atan2 (Poly4): 0.010289 ns per atan2 (29.5% faster) [per invokation: 0.690511 ms]
fast_atan2 (Poly5): 0.010255 ns per atan2 (29.7% faster) [per invokation: 0.688207 ms]
fast_atan2 (Poly6): 0.010748 ns per atan2 (26.4% faster) [per invokation: 0.721268 ms]
fast_atan2 (Poly7): 0.011497 ns per atan2 (21.2% faster) [per invokation: 0.771529 ms]
fast_atan2 (Poly8): 0.011326 ns per atan2 (22.4% faster) [per invokation: 0.760067 ms]
Success!
Vulkan is now even faster than CUDA! 🤯
@steven-johnson The build just broke on something LLVM related it seems... There seems to be no related commit to Halide. Does LLVM constantly update with every build?
Edit: I found the commit: https://github.com/llvm/llvm-project/commit/75c7bca740935a0cca462e28475dd6b046a6872c
Fix separately PR'd in #8391
@steven-johnson The build just broke on something LLVM related it seems... There seems to be no related commit to Halide. Does LLVM constantly update with every build?
We rebuild LLVM once a day, about 2AM Pacific time.
@abadams I added the check that counts number of wrong mantissa bits:
Testing for precision 1.0e-02 (MAE optimized)...
Testing fast_atan() correctness... Passed: max abs error: 4.96906e-03 max mantissa bits wrong: 19
Testing fast_atan2() correctness... Passed: max abs error: 4.96912e-03 max mantissa bits wrong: 19
Testing for precision 1.0e-03 (MAE optimized)...
Testing fast_atan() correctness... Passed: max abs error: 6.10709e-04 max mantissa bits wrong: 17
Testing fast_atan2() correctness... Passed: max abs error: 6.10709e-04 max mantissa bits wrong: 17
Testing for precision 1.0e-04 (MAE optimized)...
Testing fast_atan() correctness... Passed: max abs error: 8.16584e-05 max mantissa bits wrong: 14
Testing fast_atan2() correctness... Passed: max abs error: 8.17776e-05 max mantissa bits wrong: 14
Testing for precision 1.0e-05 (MAE optimized)...
Testing fast_atan() correctness... Passed: max abs error: 1.78814e-06 max mantissa bits wrong: 9
Testing fast_atan2() correctness... Passed: max abs error: 1.90735e-06 max mantissa bits wrong: 9
Testing for precision 1.0e-06 (MAE optimized)...
Testing fast_atan() correctness... Passed: max abs error: 3.57628e-07 max mantissa bits wrong: 6
Testing fast_atan2() correctness... Passed: max abs error: 4.76837e-07 max mantissa bits wrong: 7
Testing for precision 1.0e-02 (MULPE optimized)...
Testing fast_atan() correctness... Passed: max abs error: 1.31637e-03 max mantissa bits wrong: 15
Testing fast_atan2() correctness... Passed: max abs error: 1.31637e-03 max mantissa bits wrong: 15
Testing for precision 1.0e-03 (MULPE optimized)...
Testing fast_atan() correctness... Passed: max abs error: 1.54853e-04 max mantissa bits wrong: 12
Testing fast_atan2() correctness... Passed: max abs error: 1.54972e-04 max mantissa bits wrong: 12
Testing for precision 1.0e-04 (MULPE optimized)...
Testing fast_atan() correctness... Passed: max abs error: 2.53320e-05 max mantissa bits wrong: 9
Testing fast_atan2() correctness... Passed: max abs error: 2.55108e-05 max mantissa bits wrong: 9
Testing for precision 1.0e-05 (MULPE optimized)...
Testing fast_atan() correctness... Passed: max abs error: 3.63588e-06 max mantissa bits wrong: 6
Testing fast_atan2() correctness... Passed: max abs error: 3.81470e-06 max mantissa bits wrong: 6
Testing for precision 1.0e-06 (MULPE optimized)...
Testing fast_atan() correctness... Passed: max abs error: 5.96046e-07 max mantissa bits wrong: 4
Testing fast_atan2() correctness... Passed: max abs error: 7.15256e-07 max mantissa bits wrong: 4
Success!
Pay attention to the MULPE
optimized ones: they are significantly lower than the MAE
optimized ones.
Ping to @abadams or @zvookin for review
Cut polynomial + merge it + later take care of other transcendentals.
@abadams I updated the PR, and believe this is a nice compromise of options. It is in line with your initial thoughts on just specifying the precision yourself. I have made a table of approximations and their precisions. Then a new auxiliary function selects an approximation from that table that satisfies your requirements. This clears out the header (no more one million enum options), and clears out the source file, by not having the table sitting inside of the fast_atan function.
Looks like this is ready for final review... ?
Addresses #8243. Uses a polynomial approximation with odd powers: this way, it's immediately symmetrical around 0. Coefficients are optimized using my script which does iterative weight-adjusted least-squared-error (also included in PR; see below).
Added API
I designed this new
ApproximationPrecision
such that it can be used for other vectorizable functions at a later point as well, such as forfast_sin
andfast_cos
if we want that at some point. Note that I chose forMAE_1e_5
style of notation, instead of5Decimals
because 5 decimals suggests that there will be 5 decimals correct, which is technically less correct than saying that the maximal absolute error will be below1e-5
.Performance difference:
Linux/CPU (with precision
MAE_1e_5
):On Linux/CUDA, it's slightly faster than the default LLVM implementation (there is no atan instruction in PTX):
On Linux/OpenCL, it is also slightly faster:
Precision tests:
Optimizer
This PR includes a Python optimization script to find the coefficients of the polynomials:
While I didn't do anything very scientific or looked at research papers, I get a hunch that the results from this script are really good (and may actually converge to optimal).
If my optimization makes sense, then I have some funny observation: I get different coefficients for all of the fast approximations we have. See below.
Better coefficients for
exp()
?My result:
versus current Halide code:
https://github.com/halide/Halide/blob/3cdeb5398fb87be699fa830f843ca5d05fe6b983/src/IROperator.cpp#L1432-L1439
Better coefficients for
sin()
?Notice that my optimization gives maximal error of 1.35e-11, instead of the promised 1e-5, with degree 6.
Versus:
https://github.com/halide/Halide/blob/3cdeb5398fb87be699fa830f843ca5d05fe6b983/src/IROperator.cpp#L1390-L1394
If this is true (I don't see a reason why it wouldn't), that would mean we can remove a few terms to get faster version that still provides the promised precision.
Better coefficients for
cos()
?versus:
https://github.com/halide/Halide/blob/3cdeb5398fb87be699fa830f843ca5d05fe6b983/src/IROperator.cpp#L1396-L1400
Better coefficients for
log()
?versus:
https://github.com/halide/Halide/blob/3cdeb5398fb87be699fa830f843ca5d05fe6b983/src/IROperator.cpp#L1357-L1365