golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
122.96k stars 17.53k forks source link

spec: allow the use of fused multiply-add floating point instructions #17895

Closed mundaym closed 7 years ago

mundaym commented 7 years ago

Fused multiply-add (FMA) floating point instructions typically provide improved accuracy and performance when compared to independent floating point multiply and add instructions. However they may change the result of such an operation because they omit the rounding operation that would normally take place between a multiply instruction and an add instruction.

This proposal seeks to clarify the guarantees that Go provides as to when rounding to float32 or float64 must be performed so that FMA operations can be safely extracted by SSA rules. I assume that complex{64,128} casts will be lowered to float{32,64} casts for the purposes of this proposal.

The consensus from previous discussions on the subject is that explicit casts should force rounding, as is already specified for constants:

❌ a := float64(x * y) + z  (1)
❌ z += float64(x * y)      (2)

There is also consensus that parentheses should not force rounding. So in the following cases the intermediate rounding stage can be omitted and a FMA used:

✅ a := x * y + z           (3)
✅ a := (x * y) + z         (4)
✅ z += x * y               (5)
✅ z += (x * y)             (6)

It is also proposed that assignments to local variables should not force rounding to take place:

✅ t := x * y; t += z       (7)

I also propose that an assignment to a memory location should force rounding (I lean towards forcing rounding whenever an intermediate result is visible to the program):

❌ *a = x * y; t := *a + z  (8)

(SSA rules could optimize example 8 because they will replace the load from a with a reuse of the result of x * y.)

I think the only real complexity in the implementation is how we plumb the casts from the compiler to the SSA backend so that optimization rules can be blocked as appropriate. I’m not sure if there is a pre-existing mechanism we can use.

See these links for previous discussion of this proposal on golang-dev: https://groups.google.com/d/topic/golang-dev/qvOqcmAkKnA/discussion https://groups.google.com/d/topic/golang-dev/cVnE1K08Aks/discussion

rsc commented 7 years ago

Note: The important part of this outcome is the precedent or more general rule it establishes for avoiding compiler optimization on float32/float64 arithmetic more generally.

tombergan commented 7 years ago

What is a "local variable", what is a "memory location", and what does it mean for an intermediate result to be "visible to the program"? These may seem like simple questions, but in practice the distinction is blurry and depends on choices made by the compiler. Examples:

var global float64
var globalPtr *float64

// z is changed after it escapes. Is rounding forced?
// Note that z is a "local variable", but also a "memory location", since it escapes.
func case1(x, y float64) {
  var z float64
  globalPtr = &z
  z = x + y
}

// z is changed before it escapes. Is rounding forced before or after Println?
func case2(x, y float64) {
  var z float64
  z = x + y
  fmt.Println(z)
  globalPtr = &z
}

// The first store into `global` may not be visible to other goroutines
// since there is no memory barrier between the two stores. The compiler
// may merge these stores into one. Is rounding forced on (x + y), or
// just (x + y + 2)?
func case3(x, y float64) {
  global = x + y
  global += 2
}

// A smart compiler might realize that z doesn't escape, even though its
// address is taken implicitly by the closure. Is rounding forced?
func case4(x, y float64) float64 {
  var z float64
  var wg sync.WaitGroup
  wg.Add(1)
  go func() {
    defer wg.Done()
    z = x + y
  }()
  wg.Wait()
  return z
}

I think I agree with rsc's suggestion from the second linked thread: "I would lean toward allowing FMA aggressively except for an explicit conversion."

mundaym commented 7 years ago

What is a "local variable", what is a "memory location", and what does it mean for an intermediate result to be "visible to the program"? These may seem like simple questions, but in practice the distinction is blurry and depends on choices made by the compiler.

Thanks. You're right of course, I was being imprecise. These properties aren't necessarily clear in code. They are somewhat clearer in the backend, but that might change as the compiler gets cleverer.

I think I agree with rsc's suggestion from the second linked thread: "I would lean toward allowing FMA aggressively except for an explicit conversion."

I think I also agree that would be a good rule but I'm starting to wonder if a strict interpretation of the spec already implies that assignment should force rounding. The spec defines float32 and float64 as:

float32     the set of all IEEE-754 32-bit floating-point numbers
float64     the set of all IEEE-754 64-bit floating-point numbers

Since the intermediate result of a fused multiply-add may not be a valid IEEE-754 32- or 64-bit floating-point number, this definition would seem to suggest that an assignment should prevent the optimization. However gri and rsc's comments on the thread would seem to imply that they don't agree (and they obviously know much better than me).

Assignments to variables currently block full precision constant propagation AFAICT, so forcing rounding at assignments would be in line with that behavior. Could the output of the following program change in future?

package main

import "fmt"

const x = 1.0000000000000001

func main() {
    x0 := x*10
    fmt.Println(x0) // prints 10.000000000000002
    x1 := x
    x1 *= 10
    fmt.Println(x1) // prints 10, but could be 10.000000000000002
}

(https://play.golang.org/p/6OMMzRq0pr)

rsc commented 7 years ago

It sounds like we agree that a float64 conversion should be an explicit signal that a rounded float64 should be materialized, so that for example float64(x*y)+z cannot use an FMA, but x*y+z and (x*y)+z can.

The question raised in @mundaym's latest comment is whether we're sure about case (7) above: if the code does t := x*y; t += z, should we allow t to never be materialized as a float64, so that for example FMA can be used in that case? The argument in favor of allowing optimization here is that generated code or code transformations might introduce temporaries, and we probably don't want that to have optimization effects. The argument against is that there's an explicit variable of type float64. On balance it seems that having a very explicit signal, as in the float64 conversion, would be best, so I would lean toward keeping case (7) allowed to use FMA.

We don't know too much about what other languages do here.

We know Fortran uses parentheses, although that only helps for FMA because * has higher precedence than + so the parens are optional. We'd rather not overload parens this way.

We don't think the C or C++ languages give easy control over this (possibly writing to a volatile and reading it back?).

What about Java? How do they provide access to FMA?

/cc @MichaelTJones for thoughts or wisdom about any of this.

bradfitz commented 7 years ago

Looks like Java is making it explicit with library additions: https://www.mail-archive.com/core-libs-dev@openjdk.java.net/msg39320.html

Commit: http://cr.openjdk.java.net/~darcy/4851642.0/

ianlancetaylor commented 7 years ago

When using GCC the -ffloat-store option can be used with C/C++ to force rounding when a floating-point value is assigned to a variable.

MichaelTJones commented 7 years ago

My sense (anecdotal…but lots of anecdata) is that there are three user postures here and one that touches on the question at hand:

Oblivious--Fused MulAdd Everywhere

Normal case: aggressive QMAF (IBM America/ etc.) is good for most everything…faster with greater precision so pervasive with the greatest reach possible is fine.

Cautious--No Fused MulAdd if I disable it (sorry, a new complier option)

Users who want their results to be unchanged from another machine, examples in a book, data in a database, etc. This may be important to Go users or not. I don’t know.

Explicit—An absolute way to force not doing the fused MuliplyAdd in a specific hand-coded careful expression. This is the user case I was speaking for previously. It is a small group, but it includes the people like Professor Kahan, and all the math library authors who need to track the difference between the two computation modes when evaluating basic functions. Both IBM and Intel/HP figured out how to get +/- 1 ULP evaluation of intrinsic functions (sin, cos, log, exp, …) using QMAF-style math but it needs to be possible to keep track of error between the one rounding and the two rounding cases. The proposal for a float64(xy)+z to mean two ops, each rounded and prevent the fused xy+z is just fine for this purpose. Everywhere else is just like the oblivious case.

Michael

From: Ian Lance Taylor notifications@github.com Reply-To: golang/go reply@reply.github.com Date: Monday, November 28, 2016 at 1:47 PM To: golang/go go@noreply.github.com Cc: Michael T Jones michael.jones@gmail.com, Mention mention@noreply.github.com Subject: Re: [golang/go] proposal: allow the use of fused multiply-add floating point instructions (#17895)

When using GCC the -ffloat-store option can be used with C/C++ to force rounding when a floating-point value is assigned to a variable.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

griesemer commented 7 years ago

@MichaelTJones There's also the opposite explicit form: Have a mechanism (intrinsified function call) to specify when to use FMA (and never, otherwise). Go already allows control over when not to use FMA (if available) by forcing an explicit conversion; e.g. float64(x*y) + z (no FMA) vs x*y + z (possibly FMA).

rsc commented 7 years ago

@mundaym, I think everyone agrees about cases 1, 2, 3, 4, 5, 6.

We are less sure about 7 and 8, which may be the same case for a sufficiently clever compiler. The argument for "no" on 7 and 8 is that the assignment implies a fixed storage format. The argument for "yes" on 7 is that otherwise a source-to-source lowering of a program would inhibit certain floating-point optimizations, and also more generally source-to-source translations will have to consider whether they are changing program semantics by combining or splitting expressions. Another argument for "yes" on 7 is that it results in just one way to disable the optimization, instead of two.

I propose that we tentatively assume "yes" on 7 and 8, with the understanding that we can back down from that if a strong real-world example arrives showing that we've made a mistake.

mundaym commented 7 years ago

I propose that we tentatively assume "yes" on 7 and 8, with the understanding that we can back down from that if a strong real-world example arrives showing that we've made a mistake.

Thanks, that sounds good to me. I'll prototype it for ppc64{,le} and s390x.

ianlancetaylor commented 7 years ago

@rsc Can you clarify what "yes" and "no" mean in your comment?

rsc commented 7 years ago

"Yes" means check-mark above (FMA optimization allowed here), and "no" means X above (FMA optimization not allowed here).

rsc commented 7 years ago

On hold until prototype arrives. We still need to figure out wording for the spec.

gopherbot commented 7 years ago

CL https://golang.org/cl/36963 mentions this issue.

mundaym commented 7 years ago

Complex multiplication may be implemented using fused multiply-add instructions. There is no obvious way to prevent them being emitted since multiplication is a single operation. I think this is fine, but I thought I'd note it here.

rsc commented 7 years ago

Thanks. I agree that's probably fine, certainly until it comes up in practice.

gopherbot commented 7 years ago

CL https://golang.org/cl/38095 mentions this issue.

laboger commented 7 years ago

Improvements on ppc64le for CL 38095: poly1305: benchmark old ns/op new ns/op delta Benchmark64-16 172 151 -12.21% Benchmark1K-16 1828 1523 -16.68% Benchmark64Unaligned-16 172 151 -12.21% Benchmark1KUnaligned-16 1827 1523 -16.64%

math: BenchmarkAcos-16 43.9 39.9 -9.11% BenchmarkAcosh-16 57.0 45.8 -19.65% BenchmarkAsin-16 35.8 33.0 -7.82% BenchmarkAsinh-16 68.6 60.8 -11.37% BenchmarkAtan-16 19.8 16.2 -18.18% BenchmarkAtanh-16 65.5 57.5 -12.21% BenchmarkAtan2-16 45.4 34.2 -24.67% BenchmarkGamma-16 37.6 26.0 -30.85% BenchmarkLgamma-16 40.0 28.2 -29.50% BenchmarkLog1p-16 35.1 29.1 -17.09% BenchmarkSin-16 22.7 18.4 -18.94% BenchmarkSincos-16 31.7 23.7 -25.24% BenchmarkSinh-16 146 131 -10.27% BenchmarkY0-16 130 107 -17.69% BenchmarkY1-16 127 107 -15.75% BenchmarkYn-16 278 235 -15.47%

rsc commented 7 years ago

Thanks for the implementation. We experimented with the ppc64 compiler and confirmed that this behaves the way we expected from https://github.com/golang/go/issues/17895#issuecomment-264991901.

@griesemer will send a spec CL.

gopherbot commented 7 years ago

CL https://golang.org/cl/40391 mentions this issue.

mdempsky commented 7 years ago

I looked through the two linked discussions, and I didn't spot any discussion about supporting FMA via the standard library. For example, we could add

// FMA returns a*b+c.
func FMA(a, b, c float64) float64

to package math, and optimize it via compiler intrinsics like Sqrt.

Considering this is the approach C, C++, and Java have already taken, I think we should at least discuss it.

mdempsky commented 7 years ago

I'm not a fan of the current language spec proposal because:

  1. It means assignment to a float64 variable and explicit conversion to float64 now have different semantics; intuitively, I'd expect them to always mean the same thing.
  2. It means redundant conversions now can have side effects on a program's correctness. Tools like mdempsky/unconvert can no longer assume floating-point conversions are redundant.
btracey commented 7 years ago

From what I understand, both of your points are already true in the language. From the spec:

When converting an integer or floating-point number to a floating-point type,
or a complex number to another complex type, the result value is rounded to
the precision specified by the destination type. For instance, the value of a variable
x of type float32 may be stored using additional precision beyond that of an IEEE-754
32-bit number, but float32(x) represents the result of rounding x's value to 32-bit
precision. Similarly, x + 0.1 may use more than 32 bits of precision, but float32(x + 0.1) does not.

This means a statement like y = x + 0.1 + 0.6 can have a different result from than y = float32(x+0.1) + 0.6, since the second one forces intermediate rounding.

mdempsky commented 7 years ago

@btracey Hm, I think you're right. I withdraw my objections.

MichaelTJones commented 7 years ago

One plausible tweak might be to demand that the compiler use FMA when hardware available and spec allowed.

On Tue, Apr 11, 2017 at 3:28 PM Matthew Dempsky notifications@github.com wrote:

@btracey https://github.com/btracey Hm, I think you're right. I withdraw my objections.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/17895#issuecomment-293418745, or mute the thread https://github.com/notifications/unsubscribe-auth/AHgypfblQlAdDN3Rdbw3oJLmMdDpUm-kks5ru_6KgaJpZM4KwF1K .

-- Michael T. Jones michael.jones@gmail.com

ianlancetaylor commented 7 years ago

Although it's true that C/C++ provides a fma function, it is also true that optimizing C/C++ compilers generate fused multiply-add instructions when reasonable. When using GCC you can control this using the (processor-specific) -mfma and -mno-fma options. If you use the (processor-independent) -fexcess-precision=standard option, then fused multiply-add may be used except when there is a cast to a specific type (like this proposal) or an assignment to a variable of specific type. The (processor-independent) -ffloat-store option is similar, but permits fused multiply-add across a cast but not an assignment to a variable. These options are in turn affected by other options like -std= and -ffast-math.

TuomLarsen commented 7 years ago

Would there be a way to "force" the use of FMA, instead of just "allow"? I mean, would there be a way to make sure that a*b+c rounds just once? Even if that meant to use software emulation on platforms with no FMA instructions. IMHO, the benefit of using FMA is precision, not just speed.

rsc commented 7 years ago

@TuomLarsen No, we're not providing a way to do that here. Do other languages do that?

TuomLarsen commented 7 years ago

@rsc I apologise, I now realise it might be out of scope for this proposal: what I meant was basically support for simple math.FMA function, I imagined that math.FMA would be kind of symmetrical to emitting FMA instruction (mandatory vs. optional). In any case, are there plans for such a function? See e.g. http://en.cppreference.com/w/c/numeric/math/fma

josharian commented 7 years ago

@TuomLarsen please file a new issue to propose/discuss. Thanks!

rsc commented 7 years ago

With ppc64 doing this and the spec written, I think this is done.

agnivade commented 7 years ago

Hi everyone,

I was pretty excited to try out the perf benefits of the FMA operation in 1.9. However, it seems that I am still getting the same performance. According to @laboger's comment there seem to be improvements of math functions on ppc64. I was under the impression that I will be able to reap similar benefits on amd64 too ?

However, I still see the same performance.

Here is the code -

package stdtest

import (
    "math"
    "testing"
)

func BenchmarkAtan2(b *testing.B) {
    for n := 0; n < b.N; n++ {
        _ = math.Atan2(480.0, 123.0) * 180 / math.Pi
    }
}

Under 1.8.1

go test  -bench=. -benchmem .
BenchmarkAtan2-4    100000000           22.9 ns/op         0 B/op          0 allocs/op
PASS
ok      stdtest 2.318s

Under 1.9.rc2

go1.9rc2 test  -bench=. -benchmem .
goos: linux
goarch: amd64
pkg: stdtest
BenchmarkAtan2-4    50000000            22.8 ns/op         0 B/op          0 allocs/op
PASS

I have verified that my processor supports fma (cat /proc/cpuinfo | grep fma).

Is this expected ? Or am I doing something wrong ?

bradfitz commented 7 years ago

@agnivade, if you run git grep FMADD src/cmd/compile in your $GOROOT, you'll see there's no FMADD support for amd64 yet. Only s390x and ppc64.

agnivade commented 7 years ago

Thanks @bradfitz ! Yes, I was expecting something like that. The tone of this announcement made it seem like its there for all architectures. However the commits seemed to show only for s390x and ppc64. Hence I was a little confused.

Is adding FMADD to amd64 anywhere on the upcoming roadmap ? Thanks. (I have some expensive math operations on a hot path which will benefit tremendously from it :wink: )

bradfitz commented 7 years ago

@agnivade, closed bugs isn't where we conventionally have discussions. But I found #8037 for tracking x86 FMA. You can watch that bug.

agnivade commented 7 years ago

Thanks a lot ! Appreciate the help :)