perazz / fortran-bessels

Fortran port of the Bessels.jl repository
MIT License
10 stars 2 forks source link

Replace evalpoly algorithm with Horner's method #2

Closed ivan-pi closed 1 year ago

ivan-pi commented 1 year ago
ivan@maxwell:~/fortran/fortran-bessels$ gfortran -ffree-line-length-none -O3 -march=native -ffast-math src/bessels_constants.f90 src/bessels.f90 test/bessels_test.f90 -o bessels_test
ivan@maxwell:~/fortran/fortran-bessels$ ./bessels_test 
INTRINSIC time used:   35.9405 ns/eval, sum(z)=29.170994438817502
PACKAGE   time used:   18.8808 ns/eval, sum(z)=29.170994438745343
[bessels] 3 test completed: 3 passed, 0 failed.
STOP 0

With this modification, the new approximation appears to be faster than the intrinsic and equally precise.

I've also fixed an error introduced in #1 on L216 of bessel_constants.f90.

perazz commented 1 year ago

Wow, this is awesome, thank you for sharing it! I will merge that immediately. I wonder what will happen if we then replace the loop with an ad-hoc implementation for each polynomial degree like: (((((a5*x+a4)*x+a3)*x+a2)*x+a1)*x+a0)

ivan-pi commented 1 year ago

Yesterday I didn't bother to check, but it would be good to look at the output of -fopt-info and -fopt-info-missed to see what optimizations have been applied/missed.

Compared to the function call, an inline Horner statement has the advantage it makes the job of an optimizing compiler easier. For high polynomial degrees, it's sometimes better to split the polynomial into multiple parts which can be evaluated in parallel by a SIMD unit (see https://en.wikipedia.org/wiki/Horner%27s_method#Parallel_evaluation), but I've never tested this.