j3-fortran / fortran_proposals

Proposals for the Fortran Standard Committee
175 stars 14 forks source link

Combined intrinsics #264

Closed perazz closed 1 year ago

perazz commented 2 years ago

I'm often refrained from employing Fortran's array-searching and math intrinsics because they lead to redundant overhead in performance-critical applications. For example, imagine finding the locations of minimum and maximum value in an array:

real(real64) :: myData(:)

dmin = minval(myData,1,...)
dmax = maxval(myData,1,...)

Such are very common cases, which incur pretty large overhead compared to methods that produce both results in a single iteration. In this case, a quick sort would be enough:

call sort(myData)
dmin = myData(1)
dmax = myData(size(myData))

I'm not sure if the better way to go would be to either standardize more flexible versions as opposed to having them implemented in a library, eager to know what the Fortranners out there think!

Here's some functions I'd like to see

1) Min/max bounds and locations, like


! Returns the values of the minimum and maximum elements in an array/matrix
call boundval(array, amin, amax, [dim], [mask])

2) combined sin/cos functions


! Returns sine and cosine of a given 
call sincos(x,sinx,cosx)
certik commented 2 years ago

Related to this is the NumPy's function argsort that returns an integer array that sorts the argument array. Here is a Fortran implementation: https://github.com/certik/fortran-utils/blob/master/src/sorting.f90#L179, although inefficient ($O(n^2)$ instead of $O(n \log(n))$).

perazz commented 2 years ago

That's right, I'm no compiler expert but I believe most compilers would do an internal sorting algorithm or use any sorting networks to implement minloc, maxloc, minval, maxval, because they're all $O(n \log(n))$ methods. Similarly for sin and cos, they could be internally sped up by i.e. running $\sin x = \pm \sqrt{1-\cos^2(x)}$, not mentioning with architectural optimizations like in this project

fazedo commented 2 years ago

For the item 2, just call sin and cos. Most compiler will optimize it to a single call to sincos function. See https://godbolt.org/z/T3zs9vM9E

certik commented 2 years ago

@fazedo true. But also isn't it the case that the sin and sincos instructions are actually slower than vectorized polynomial fit by hand? That is my understanding at least in terms of clock cycle counts.

gklimowicz commented 2 years ago

I'm not aware of any compiler that bothers to sort the array entries. You can determine minloc, minval, etc. just by scanning the array. That's just O(n), and you don't need to create any temporaries along the way.

fazedo commented 2 years ago

@certik I don't know. I rely on intrinsic implementations.

perazz commented 2 years ago

I'm not aware of any compiler that bothers to sort the array entries. You can determine minloc, minval, etc. just by scanning the array. That's just O(n), and you don't need to create any temporaries along the way.

You're right, that's the case of gfortran for example (see this godbolt link). That also shows my point: minloc and maxloc are two identical functions, that perform the same identical scan, for either the max or min value. Having a combined minmax or similar intrinsic would be able to do the same in just 1 scan instead of two

certik commented 2 years ago

@perazz just like with sin and cos, compilers seem to be able to identify this and "merge" them; I wonder if it is better left for compilers to identify minval and maxval and combine them as needed.

The argsort example above is slightly different, I think we need it.