math: add guaranteed FMA

TuomLarsen commented 6 years ago

Please consider adding fused multiply–add (FMA) to the standard library.

FMA computes a*b + c but only with one floating-point rounding, instead of two. If a CPU instruction is available it might even be faster than a separate multiplication and addition but the main reason for using it is the increased precision.

The use cases include calculating dot product, evaluating polynomials, matrix multiplication and many more.

I think the largest difficulty would be to provide a correct fallback in case that the CPU does not support it directly.

agnivade commented 6 years ago

Dup of https://github.com/golang/go/issues/8037. I don't think we want an explicit math.FMA function in the standard library. The compiler should detect statements like a*b + c and replace them with VFAMDD instructions.

TuomLarsen commented 6 years ago

I don't think this is duplicate of #8037, which is about making the compiler recognize a*b + c and emitting FMA instruction, at compiler will. It might choose to replace that expression with FMA or not, sometimes it is even desirable to keep it as is. This proposal is about adding a way to explicitly request FMA or its similar fallback, should the CPU not support it.

Therefore please reconsider opening of this issue, I believe it is about something else.

agnivade commented 6 years ago

Yes, I doubt we would want to expose a function by which you can explicitly request FMA. It is something like intrinsics, and we usually do not expose those in the standard library.

Anyways, I am re-opening this so that the proposal review committee can take a final call.

/cc @ianlancetaylor , @rsc

alexd765 commented 6 years ago

@TuomLarsen: what would be the benefit of declaring that manually compared to the compiler figuring it out automatically?

randall77 commented 6 years ago

@alexd765 : Because if your application needs the extra precision, you can't rely on compiler optimizations accidentally introducing extra precision for you. For instance, it won't be portable.

@agnivade: Do you have a specific application where the extra precision is needed? It's going to be hard to evaluate this proposal without a better understanding of where it is needed.

bmkessler commented 6 years ago

For reference, note that the FMA function is included in IEEE 754-2008 revision to the floating point standards and the fma function was added to C99 math standard library in ISO/IEC 9899:1999 as well as being included in other languages such as Java.

Some brief discussion of the use of an explicit FMA was also mentioned in the original FMA issue #17895

josharian commented 6 years ago

cc @btracey @kortschak

btracey commented 6 years ago

Also: @kunde21

agnivade commented 6 years ago

@agnivade: Do you have a specific application where the extra precision is needed? It's going to be hard to evaluate this proposal without a better understanding of where it is needed.

I think you meant to ping @TuomLarsen

TuomLarsen commented 6 years ago

@randall77 It is useful whenever one needs to compute a*b + c (repeatedly). So mostly numerical algorithms such as dot product, matrix multiplication, polynomial evaluation, Newton's method, convolutions and artificial neural networks, ... (from Wikipedia).

But I guess you knew that already so taking the first option as an example: first hit for "dot product fma" query reveals a poster which compares the classical dot product computation with various FMA-flavoured ones. The main takeaway from it is the last line: "CompDot is about 6 times faster than ... DotXBLAS" while providing the same accuracy as if it was calculated with the unevaluated sum of two doubles (which is almost as good as quadruple precision floats).

rsc commented 6 years ago

What is the proposed implementation of math.FMA on systems that do not have an FMA instruction in hardware?

rsc commented 6 years ago

To elaborate on the previous comment ("What is the proposed fallback implementation when there's no hardware support?"):

If the proposed fallback is return a*b+c, then there is no difference between math.FMA(a,b,c) and a*b+c, so we don't need a separate function, in which case this is a dup of #8037 as @agnivade said. That is, Go's current a*b+c seems to match java.lang.Math.fma, so no new function needed for that.

If the proposed fallback is "do non-trivial work to somehow produce a result that is higher precision than float64(a*b)+c ("FMA disabled") would be", then OK, this is an issue to leave open. That "must be equal in precision and result to IEEE-758 single-rounding FMA" definition would merit its own function math.FMA and would correspond to java.lang.StrictMath.fma. I'm going to assume the proposal is for this "always bit-perfect FMA" meaning, because otherwise we can just close it.

@TuomLarsen, the Wikipedia article correctly says "A fast FMA can speed up and improve the accuracy of many computations that involve the accumulation of products."

If you care about speed alone, then the current Go compiler optimization of compiling a*b+c as an FMA (which Java never does) provides that speed. So if you just wanted a fast dot product, then you'd want to write it with a*b+c expressions, and the compiler would use the fastest implementation available - an FMA where it exists, and otherwise a separate multiply and add. You would not want to call the proposed math.FMA, because it would be very slow on systems without an FMA, where the bit-precise answer would have to be computed in significantly more than 2 floating-point operations.

If you care about accuracy more than speed, that's when you'd want math.FMA, to be able to force the bit-precise FMA results, where you'd be willing to run significantly slower than a*b+c on non-FMA hardware in order to get those bit-precise results. @randall77's question is asking "what is the specific situation where you foresee needing to make that decision?"

TuomLarsen commented 6 years ago

The proposal is about precise, always correct (i.e. "strict") a*b+c with only one rounding.

If it is faster to use FMA than multiplication and addition, that's only good but it is not the main motivation here. The linked Wikipedia page lists quite a lot of modern processors with support for FMA so calling of the slow fallback should be quite rare. What the simple a*b+c fallback (multiply and add) goes, I'm afraid it would not make a lot of sense as the "F" means fused, i.e. only one rounding.

So yes, this proposal cares more accuracy than speed.

PS: What the actual fallback goes, there is probably more than one way of doing it, this is what Julia folks have done, for example.

rsc commented 6 years ago

@TuomLarsen, OK, great, thank you for clarifying that the proposal is about a strict bit-precise FMA.

Please help me read the Wikipedia page: what is an example of a motivating, common application that cares so much about accuracy that it would prefer a slow software FMA implementation over a non-fused multiply-add?

(The point about "quite a lot of modern processors support FMA" cuts against having a special function. If everything is already doing FMA for a*b+c then why bother adding a special math.FMA? There must be some compelling use where you absolutely have to have the bit-precise FMA.)

I'm sorry if we're talking past each other a bit.

bmkessler commented 6 years ago

One benefit to fma is tracking floating point errors, which allows carrying higher precision (>float64, e.g. double-double) internally for certain calculations to ensure an accurate result at the end. Calculating x*y exactly is a building block for extended-precision calculations.

zhi := x*y
zlo := fma(x, y, -zhi)  // zlo = x*y-zhi

then x*y == zhi + zlo exactly.

The following paper from IBM provides some motivating examples.

The Fused Multiply-Add Instruction Leads to Algorithms for Extended-Precision Floating Point: Applications to Java and High-Performance Computing

Finally, we demonstrate the accuracy of the algorithms on example problems (matrix multiplication, 2 x 2 determinant, complex multiplication, and triangle area calculation), for which existing computer arithmetic gives completely inaccurate results in certain instances.

rsc commented 6 years ago

@bmkessler, thanks for that link. Very interesting.

TuomLarsen commented 6 years ago

@rsc In addition to what @bmkessler said: FMA improves the accuracy of a*b+c, usually at least by just a bit. But then there are these Error-free transformations which in addition to normal floating-point operations account for the rounding errors and allow for almost double of the working floating point precision (so in case of double precision, they are almost as precise as if calculated with quadruple precision). A basic such building block is EFT product, seen e.g. in the poster I linked to, which when implemented without FMA requires a "magic" constant and 17 FLOPS. Whereas with FMA it only takes 2 FLOPS, as was shown by @bmkessler. So if one implemented the EFT product as plain multiplication and addition, the algorithms would return an incorrect results as they rely on the handling of very tiny rounding errors.

"If everything is already doing FMA..." First, I wrote "quite a lot" precisely because I'm not sure if everybody has the FMA instruction so I presume a proper fallback would be necessary. Second, it is also not desirable to automatically replace all a*b+c with FMA, as it may break some algorithms, see e.g. this notice about FMA (search for "THE FMA PROBLEM").

rsc commented 6 years ago

The IBM paper linked by @bmkessler convinced me that this issue is basically the floating-point equivalent of #24813. In short the trick is that given float64 values x, y, the high word of x*y is given by float64(x*y) and the low word is given by math.FMA(x, y, float64(x*y)) (aka x * y - float64(x*y) on FMA-enabled hardware). Those two words together are an exact representation of the product, for the same reason that high and low uint64s can be an exact presentation of a uint64*uint64 product.

That is, a guaranteed-precision FMA exposes double-width floating-point multiply, the same way #24813 is about exposing double-width integer multiply (and other operations). Also, given #24813's bits.Mul64, a software implementation of math.FMA (for the systems without FMA hardware) would not be many lines of code.

I think we should probably accept this issue.

btracey commented 6 years ago

@TuomLarsen Note that you can already prevent FMA with float64(a*b) + c

ianlancetaylor commented 6 years ago

Proposal accepted -- iant for @golang/proposal-review

smasher164 commented 6 years ago

I want to clarify the architectures that require runtime feature detection. Please correct me if I'm wrong.

x86: Yes, depends on the FMA3 instruction set.
arm: Yes, depends on VFPv4.
arm64: No, arm64 == armv8 and above in Go.
mips[64]: Yes, the proper instruction is introduced in Release 6. The CPU must also support double-precision.
ppc64: No. POWER8 assembly can safely assume that FMA exists.
s390x: No. MADBR was introduced with the 390.

Assuming that internal/cpu aggregates the necessary feature-detection code, the FMA procedure would check

arm: cpu.ARM.HasVFPv4
mips, mipsle, mips64, mips64le: cpu.MIPS.HasR6 && cpu.MIPS.HasF64
386, amd64, amd64p32: cpu.x86.HasFMA

Failing these checks would defer to a software fallback, like freebsd's or musl's. The remaining architectures would use an assembly implementation.

gopherbot commented 6 years ago

Change https://golang.org/cl/127458 mentions this issue: math: add guaranteed-precision FMA intrinsic

gopherbot commented 5 years ago

Change https://golang.org/cl/137156 mentions this issue: cmd/compile: add fma intrinsic for amd64

gopherbot commented 5 years ago

Change https://golang.org/cl/131959 mentions this issue: cmd/compile: introduce generic ssa intrinsic for fused-multiply-add

gopherbot commented 5 years ago

Change https://golang.org/cl/142117 mentions this issue: cmd/compile: add fma intrinsic for arm

smasher164 commented 5 years ago

Can this go in 1.12? All of the CLs except one have been reviewed, namely https://golang.org/cl/127458. The implementation has been well tested and benchmarked, as well has being compared with alternative implementations.

randall77 commented 5 years ago

Yes, we should get this in. I've +2'd your final CL.

smasher164 commented 4 years ago

The CLs for hardware implementations arm64, ppc64, s390x, x86, and arm, as well as a software fallback have been merged into master. Note that on MIPS, the software fallback is used for now, since the correct instruction (MADDF.D) is only available on Release 6 processors with double-precision support. I think the MIPS intrinsic should be added at later time given demand and feature-detection support (CLs 200579 and 126657).

For those who want to use the software fallback before 1.14 is released, I've extracted it to a separate package: https://github.com/smasher164/fma.

Finally, I want to thank everybody who has patiently guided my effort in this process. I’ve learned a lot while working on this, and I am thankful for the opportunity.

TuomLarsen commented 4 years ago

Thank you very much!

gopherbot commented 4 years ago

Change https://golang.org/cl/205317 mentions this issue: math, cmd/compile: rename Fma to FMA

golang / go

math: add guaranteed FMA #25819