A method to control FMA execution

j3-fortran / fortran_proposals

Proposals for the Fortran Standard Committee

175 stars 14 forks source link

A method to control FMA execution #310

Open marshallward opened 11 months ago

marshallward commented 11 months ago

We require bit reproducibility of our models, which in turn requires a predictable order of operations for our floating point arithmetic. For the most part, parentheses are sufficient to define the order of operations, but this becomes more challenging when fused multiply-adds (FMAs) are enabled.

FMA operations are almost ubiquitous now. When platforms report their peak performance, it relies heavily on FMAs. Every modern platform accessible to me includes FMAs as an atomic, vectorizable operation. For the forseeable future, they would seem a fundamental operation, equal to addition or multiplication.

The challenge for us is the difference in accuracy. If rn() is the rounding function, then FMAs increase the accuracy of the operation. For the f = a*b + c,

f(no FMA) = rn(rn(a*b) + c)
f(FMA) = rn(a*b + c)

That is, FMA eliminates one of the rounding operations, thereby changing its bit representation.

For simple a*b + c expressions, the FMA is unambiguous. But consider the following expression.

f = a*b + c*d

With FMAs disabled, the order of operations is unambiguous: Compute a*b and c*d independently, and sum the results. That is,

f = rn(rn(a*b) + rn(c*d))

However, if FMAs are enabled, there are two possible results:

f1 = rn(a*b + rn(c*d))
f2 = rm(rn(a*b) + c*d)

But an expression like a*b + c*d offers no apparent way to select either operation.

I would like to propose that some kind of parsing rules be considered to make this an unambiguous operation.

I am not entirely sure how this could be done - which is why I am posting this as a discussion - but I would start with the following:

f = (a*b) + (c*d)   ! No FMA: Compute a*b and c*d, then sums the result
f = a*b + (c*d)     ! FMA: Compute c*d, then optionally apply FMA
f = a*b + c*d       ! FMA: order is ambiguous, either FMA is permitted.

I have tested this in a few compilers, and ~only GNU appears to do something like this~ (Edit: GNU and Intel can both do this, with appropriate flags). The others seem to ignore the parentheses and simply apply the FMA in all cases. (Compiled with -O0 but FMA enabled; perhaps -O gets overridden here...? I can gather details if there is interest.)

I don't know how far one can carry this; perhaps there are some problems in more complex expressions. But I would like to hear any thoughts on this proposal, especially from compiler developers.

klausler commented 11 months ago

I can gather details if there is interest

Yes, please. The standard requires that "any expression in parentheses shall be treated as a data entity" (10.1.8 in F'202X).

Also, sometimes one really doesn't want FMA: diffSquares = x**2 - y**2 may be more accurate when each of the squares are rounded identically.

marshallward commented 11 months ago

Thanks @klausler, if the computation in parentheses is meant to be computed separately, then perhaps that's the best news.

I created a repository to test this: https://github.com/marshallward/fma_test (Admittedly a bit odd, it wasn't really planned, but it stumbled on a diff)

On GNU 13.1.1, -O2 -mfma, I get this output:

 a*b  +  c*d : 30.0079999999999991 (403E020C49BA5E35)
 c*d  +  a*b : 30.0079999999999956 (403E020C49BA5E34)
(a*b) + (c*d): 30.0079999999999991 (403E020C49BA5E35)
(c*d) + (a*b): 30.0079999999999991 (403E020C49BA5E35)
 a*b  + (c*d): 30.0079999999999991 (403E020C49BA5E35)
 c*d  + (a*b): 30.0079999999999956 (403E020C49BA5E34)
(a*b) +  c*d : 30.0079999999999956 (403E020C49BA5E34)
(c*d) +  a*b : 30.0079999999999991 (403E020C49BA5E35)

Asm inspection shows that (a*b)+(c*d) does not use FMA, whereas the others do. Parentheses more or less work as you describe, calculated first then passed to the FMA instruction. (First and third have same asm, matching the numbers above.)

Intel 2021.7.1, ~-O0 -mfma~ -O2 -fma (but see below; these flags are incorrect):

 a*b  +  c*d : 30.0079999999999991 (403E020C49BA5E35)
 c*d  +  a*b : 30.0079999999999956 (403E020C49BA5E34)
(a*b) + (c*d): 30.0079999999999991 (403E020C49BA5E35)
(c*d) + (a*b): 30.0079999999999956 (403E020C49BA5E34)
 a*b  + (c*d): 30.0079999999999991 (403E020C49BA5E35)
 c*d  + (a*b): 30.0079999999999956 (403E020C49BA5E34)
(a*b) +  c*d : 30.0079999999999991 (403E020C49BA5E35)
(c*d) +  a*b : 30.0079999999999956 (403E020C49BA5E34)

The functions for each case produce identical asm output:

vmovsd    (%rdx), %xmm0                                 #16.8
vmulsd    (%rcx), %xmm0, %xmm0                          #16.14
vmovsd    (%rsi), %xmm1                                 #16.3
vfmadd231sd (%rdi), %xmm1, %xmm0                        #17.1

For Nvidia 22.5 -O0 -Mfma, I get similar result as Intel (albeit opposite order as Intel and GNU):

 a*b  +  c*d : 30.0079999999999956 (403E020C49BA5E34)
 c*d  +  a*b : 30.0079999999999991 (403E020C49BA5E35)
(a*b) + (c*d): 30.0079999999999956 (403E020C49BA5E34)
(c*d) + (a*b): 30.0079999999999991 (403E020C49BA5E35)
 a*b  + (c*d): 30.0079999999999956 (403E020C49BA5E34)
 c*d  + (a*b): 30.0079999999999991 (403E020C49BA5E35)
(a*b) +  c*d : 30.0079999999999956 (403E020C49BA5E34)
(c*d) +  a*b : 30.0079999999999991 (403E020C49BA5E35)

Asm is again identical for all cases:

%0 = bitcast i64* %d to double*, !dbg !35
%1 = load double, double*  %0, align 8, !tbaa !41, !dbg !35
%2 = bitcast i64* %c to double*, !dbg !35
%3 = load double, double*  %2, align 8, !tbaa !43, !dbg !35
%4 = bitcast i64* %b to double*, !dbg !35
%5 = load double, double*  %4, align 8, !tbaa !45, !dbg !35
%6 = bitcast i64* %a to double*, !dbg !35
%7 = load double, double*  %6, align 8, !tbaa !47, !dbg !35
%8 = fmul double  %5,  %7, !dbg !35
%9 = call double @llvm.fma.f64 (double  %1, double  %3, double  %8) mustprogress, !dbg !35

GNU is doing exactly what we want here. If this just means that I have to take the discussions to the vendors, then that would be a lot simpler.

wyphan commented 11 months ago

While it's useful to have some way to control FMA code generation in Fortran, IMHO this should be left to the Fortran processor (compiler) side, for at least the following reasons:

The way it's done in other languages (e.g C) involves #include system header files, resulting in non-portable code. Historically Fortran was designed to encourage portability across different system architectures (e.g. the early computing days on VAX and DEC machines).
Not all microarchitectures have FMA instructions in their respective ISA, i.e., older x86 and Arm hardware. This could be a burden to the compiler engineers to provide a fallback mechanism and warning messages when FMA instructions are not available for the chosen microarchitecture.
As shown in @marshallward 's experiments above, currently FMA code generation is toggled by a compiler flag, and all of the experiments were performed on x86_64 architecture. That might suggest currently there's more than one way to toggle this behavior (e.g. using -O3 or -march and -mtune).
As @klausler mentioned above, depending on the use case, using FMA instructions might not always be suitable or beneficial since it could change the accuracy of the computation.
The mechanism to explicitly request FMA code needs to be specified clearly in the proposal. For instance, would it be new intrinsic functions? What about dealing with KIND parameters?
Finally, we need to assess the "demand" from Fortran users and programmers on this feature. Perhaps a survey on the Fortran discourse would be useful to gauge the interest in the community.

Of course, someone also needs to draft the proposal before the language committee can proceed on discussing this.

marshallward commented 11 months ago

@wyphan Thanks very much for giving me a chance to clarify this point. We are not asking for an explicit FMA instruction, nor are we asking for the ability to direct an expression to use FMAs. In fact, this is the opposite of what we want. We only want clarity on the order of rounding.

To summarize the comments above: there is no issue with a processor deciding to use an FMA when computing a*b + c*d. But we need some means to control which FMA it may (or may not) use. As suggested by @klausler, this may already be achievable in a standards-compilant processor is if we use parentheses. In my (admittedly small) sample of compilers, it appears that only GNU respects this order of operation.

I will respond to your points below, but I believe we are in closer agreement than perhaps suggested by your comment.

I believe you are referring to something like _mm256_fmadd_ps() and friends in <immintrin.h>, which as you point out is not part of the C standard library. As mentioned above, this is not what we want.
Complete agreement, we do not want to write explicit FMA instructions.
Complete agreement, we want the compiler/flags to dictate the inclusion of FMAs. The purpose of the example above is to show that many compilers are using FMAs regardless of construction.
Agreement with respect to @klausler's example. This is in fact the very problem I am trying to address. If one is evaluating x**2 - y**2, can it be written to discourage a compiler to use FMA? e.g. (x**2) - (y**2)?
Complete agreement, though the details of a proposal are a later issue.
Complete agreement, I hope that this post is the beginning of that discussion, or perhaps a continuation of one if it has already been discussed.

Again, to be clear: I am not suggesting that a "FMA" operator be added to the standard. We only want an explict method to clarify the order of rounding. Thank you very much for responding, since it gives me a chance to clarify this point.

marshallward commented 11 months ago

I've gone back to check this more carefully, and Intel appears to do what I expect if I use -assume protect_parens. (I also incorrectly said that Intel was using -O0, but -O2 is required to enable FMA).

It seems that this is not the issue that I thought it was, so I think that I will close this issue and raise it with Nvidia. Thanks to both @klausler and @wyphan for their feedback.

marshallward commented 10 months ago

I've updated the repository so that it is a little more structured.

https://github.com/marshallward/fma-op-order

I will trying to continue tracking the issue here as I learn more.

marshallward commented 10 months ago

After some internal discussion, we've decided that it would be beneficial to continue options to permit more predictable use of FMAs.

I was so caught up in the very real problems with expressions like a*b + c*d that I lost sight of the original ambiguity around a*b + c. We are still left uncertain about its execution:

s1 = (a*b) + c   ! Two operations, two roundings
s2 = a*b + c     ! Single operation, one rounding

As long as we do not know what the compiler would do, we are left uncertain about the outcome.

To echo the concerns by @wyphan, I do not think that a dedicated FMA function or operator should be introduced. It would deviate too far from the "formula translation" foundations of Fortran.

But I wonder if some kind of rules could be considered, where an expression would always apply an FMA where possible (assuming the processor permits it) and (for example) would always be applied left-to-right. And if the processor does not support FMA, then it would fall back to (a*b) + c.

For example, GNU and Nvidia currently apply the FMA to a*b + c*d in opposite orders. The language could instead require that it be computed a*b + (c*d), rather than (a*b) + c*d.

Obviously this may potentially cripple performance in some cases, but so can parentheses. Warnings can inform about a suboptimal ordering. And answer-changing optimizations can always be turned on when appropriate.

I admit that this could be seen as far too harsh from a compiler developer's perspective, and perhaps even pointless from most user's perspectives. But it's the sort of thing that could allow me to convince operational forecasters that FMAs are a valuable optimization.

klausler commented 10 months ago

For example, GNU and Nvidia currently apply the FMA to ab + cd in opposite orders. The language could instead require that it be computed ab + (cd), rather than (ab) + cd.

Why not use one set of parentheses to force a single FMA opportunity here?

marshallward commented 10 months ago

Sorry, I was trying to show how an expression like that might be affected by some kind of left-to-right rule. Parentheses would override such a rule. But perhaps that is a distraction from the discussion.

certik commented 10 months ago

These simple cases can probably be handled with some rules. But how about the following cases:

BLAS (where my understanding is that the order of operations is not defined)
parallel loops (do concurrent) where the loop order is not defined
MPI sum (I think the order is not defined?)
vectorization, say, for a sum: the loop will be tiled differently based on your platform vector register width

klausler commented 10 months ago

These simple cases can probably be handled with some rules. But how about the following cases:

BLAS (where my understanding is that the order of operations is not defined)

parallel loops (do concurrent) where the loop order is not defined

MPI sum (I think the order is not defined?)

vectorization, say, for a sum: the loop will be tiled differently based on your platform vector register width

FMA is a concern only for expressions involving addition and subtraction of real multiplications, and this specific issue deals with such expressions that have multiple multiplications. How does it apply to any of the situations that you listed?

marshallward commented 10 months ago

To follow on @klausler: The issues described above can be circumscribed through various ways. We are focused on how to make a*b + c deterministic on systems which include FMAs.

We can turn the nondeterministic a + b + c to its deterministic forms, (a + b) + c or a + (b + c). We can turn the nondeterministic a*b + c into one of its deterministic forms, (a*b) + c, but not its FMA form. The current syntax does not even provide me a way to write it. It would be nice to write this expression and know that it will be computed by FMA rather than (a*b) + c.

I am hoping to get us to a place where we can use FMAs in our forecasts, but unfortunately the current state is to simply disable them.

(The particular example a*b + c*d above was the catalyst for this discussion because we realized that we lost commutativity in one of our FMA-enabled runs, and reinforced our need to disable them.)

klausler commented 10 months ago

F'202X, coming to your local compiler sometime in the next few years, has IEEE_FMA() to force the use of FMA, but it will (probably? the standard doesn't say) fail if the target processor doesn't support FMA operations. If you want explicit syntax for FMA, here it is.

a*b+c should compile to FMA when the target CPU supports them and they are not disabled on the compiler command line. When part of a larger expression, wrapping it in parentheses to indicate how FMAs should be grouped should be effective.

certik commented 10 months ago

@klausler to answer your question:

FMA is a concern only for expressions involving addition and subtraction of real multiplications, and this specific issue deals with such expressions that have multiple multiplications. How does it apply to any of the situations that you listed?

I am reacting to the requirement from the very top of this issue:

We require bit reproducibility of our models, which in turn requires a predictable order of operations for our floating point arithmetic.

And the list I provided shows all the ways the bit reproducibility gets broken (based on my understanding) and I am asking how that can be addressed to recover bit reproducibility. For example the vectorization: you need to choose a different width for different platforms (and thus get different bit results), or you can disable vectorization altogether (then you can get predictable bit results).

If the goal is not not address any of the above issues that I listed, then I don't see the point of trying to get fma exactly right, since that alone can't get any real world codes that use things from my list to be bit reproducible.

So if you can first motivate the big picture and the main goals of this "bit reproducibility" effort, then it will help me understand how to solve these little steps, such as fma.

marshallward commented 10 months ago

I will preface my reply to say that this is already a requirement for us, and one which is already satisfied by every operational weather and climate model which I have used. So this isn't an aspirational or hypothetical thing.

To answer your earlier question @certik , we simply don't allow most of the operations you suggest. This isn't the place to get into details, but we either use them in situations which are reproducible, or implement alternatives when they are not.

(I presented on some of this at meeting which we both attended, the presentation is here)

Vectorization is a challenge, one on par with FMAs, and you are right to raise it as another example. I don't believe we have enough control to guarantee reproducibility. For now, let me just say that we do not take full advantage of vectorization. But this would deserve a separate issue.

If you are asking why bit reproducibility is a concern, then the short answer is that we need to replicate our forecasts - even those containing known errors - and also need them as reference solutions for new parameterizations which may be introduced after the runs are completed. (Output is stored at a lower time resolution and is often insufficient for analysis.)

I am raising FMAs because it threatened one of the few things left to us - commutativity - and felt that I could articulate the problem. I also worry that we will simply disable them and lose out on significant performance. I'm simply try to avoid that scenario if at all possible.

certik commented 10 months ago

If you are asking why bit reproducibility is a concern,

No, I think bit reproducibility is very useful for the reasons you stated, to catch bugs and ensure that your implementation agrees with reference results. Once it agrees, then you can always enable all optimizations that will change bits, in exchange for performance.

Btw, when I test LFortran against GFortran to ensure that we can compile something correctly, I also require bit to bit reproducibility, it already caught so many bugs and it's easy to catch them (if any bit disagrees after an operation, it's a bug).

How do you tackle MPI decomposition --- you figured out a way to do it that is always bit to bit reproducible for a given N MPI ranks (or even the same bits for any N)?

Overall, defining or figuring out how to make the compilers execute a large program exactly to deliver bit to bit exact results is helpful.

So yes, then the solution using parentheses should work I think, so when you write x = (a*(d*f+e)+b) this will be converted to two fmas: x = fma(a, fma(d, f, e), b). Compilers can implement an option to enforce this: for example --enforce-fma will convert (a*(d*f+e)+b) to fma(a, fma(d, f, e), b). And --no-fma will convert (a*(d*f+e)+b) to add(mul(a, add(mul(d, f), e)), b), or something like that.

@marshallward can you create a "main" issue for bit reproducibility, and then individual issues for each subset, and link this "fma" issue there? Then let's discuss each case separately. There will be "themes" or "approaches" that we can reause for each of these individual issues.

marshallward commented 10 months ago

How do you tackle MPI decomposition --- you figured out a way to do it that is always bit to bit reproducible for a given N MPI ranks (or even the same bits for any N)?

Most of the work in our models are field updates, which are reproducible under decomposition as long as domain halos are updated. Summations and other reduction operations in most models that I have seen gather the values onto a process (either root or all) and do an ordered local sum. Our model uses something like a fixed-precision solution over multiple integer fields. (Default range is 10^-42 to 10^41; see paper), but the idea is similar. Neither is very fast, and the poor performance is only tolerable because it is an infrequent operation.

I may be missing some details, but that is the most of it. You're welcome to reach out if you have other questions.

So yes, then the solution using parentheses should work I think, so when you write x = (a*(d*f+e)+b) this will be converted to two fmas: x = fma(a, fma(d, f, e), b). Compilers can implement an option to enforce this: for example --enforce-fma will convert (a*(d*f+e)+b) to fma(a, fma(d, f, e), b). And --no-fma will convert (a*(d*f+e)+b) to add(mul(a, add(mul(d, f), e)), b), or something like that.

I agree that parentheses help to solve many of these problems, including the initial a*b + c*d case. If we could know whether a*b + c will use fma(a,b,c) or add(mul(a,b),c) solely based on hardware and compiler settings, then I think that might get use to a safe place. And for a complex expression like a*b*c + d*e, a refactor such as yours might be sufficient.

I will check with others and see if this is a path forward for us.

@marshallward can you create a "main" issue for bit reproducibility, and then individual issues for each subset, and link this "fma" issue there? Then let's discuss each case separately. There will be "themes" or "approaches" that we can reause for each of these individual issues.

Good idea, I would appreciate the opportunity! I would not expect the language to accommodate all of our needs, but it could be helpful to document the most challenging use cases.

certik commented 10 months ago

@marshallward I see, thanks for the clarification about the MPI. Makes sense to me. Essentially you trade some performance in exchange for bit reproducibility.

Good idea, I would appreciate the opportunity! I would not expect the language to accommodate all of our needs, but it could be helpful to document the most challenging use cases.

Perfect, go ahead and do that. I think that most of your needs can be accommodated by a compiler or compiler options. I think that compilers should coordinate on many of such options, in a way it is outside of the standard, but I think this site is great for that and we should pursue it.