riscv / riscv-v-spec

Working draft of the proposed RISC-V V vector extension
https://jira.riscv.org/browse/RVG-122
Creative Commons Attribution 4.0 International
961 stars 273 forks source link

Half-precision (SEW=16) floating-point support #349

Open kasanovic opened 4 years ago

kasanovic commented 4 years ago

There is great current interest in half-precision floating-point, either in IEEE FP16 or bfloat16 format. The vector spec already has encoding space for these typesl, but the scalar support for half-precision needs to be defined first.

chuanhua commented 4 years ago

half-precision-floating-point.pdf

Attached is the instruction encoding for half-precision floating-point instructions. The highlighted instructions of FCVT.H.D and FCVT.D.H are the conversion instructions that I am not sure if they will be defined or not. I think they should be defined to reduce the conversion precision loss of using two conversion instructions using single-precision as an intermediate step.

aswaterman commented 4 years ago

There is still some debate as to which form this extension should take: in particular, it might be better to provide widening ADD/MADD in addition to, or instead of, the basic H instructions.

chuanhua commented 4 years ago

Can we define/ratify the base scalar H instructions first and then maybe define those widening/narrowing instructions as an optional extensions covering all H/S base extensions to make the base set consistent across H/S/D?

JamesKenneyImperas commented 4 years ago

Please also consider and define the effect of this on the privileged architecture. How is the presence of half-precision indicated in the misa register, for example? (the H bit is already allocated for another purpose).

kasanovic commented 4 years ago

To support half-precision in the vector unit, it is probably sufficient to define a small subset of the half-precision FP instructions (Zfhmin?), namely FLH, FSH, FMV.X.H, FMV.H.X, FCVT.S.H, FCVT.H.S. FLH and FMV.H.X would NaN box out to FLEN. FSH would ignore high bits as with other FS* instructions. FMV.X.H would sign extend to XLEN (for symmetry with fmv.x.w). I believe someone asked for this subset a while back.

For processors that want to more fully support scalar half-precision (Zfh?), we can have a separate extension to add these instructions with H format. However, when providing fuller support, a widening mul-add will really be needed, as it is rare to want to perform half mul-add into half-precision accumulator (in fact, that might be explicitly omitted as required in extension). One thought is to use one of the unused rounding modes to encode "widening with dynamic rounding mode". So the muladd instructions could also be used to encode singlesingle+double->double as well as halfhalf+single->single. Similarly, the same encoding for multiply would mean N*N->2N, but for add/subtract could mean 2N+/-N->2N. Another pattern for add is N+/-N->2N, but I think this is maybe not useful enough to have separate encoding as opposed to synthesizing with converts.

@jhauser-us ?

jhauser-us commented 4 years ago

To support half-precision in the vector unit, it is probably sufficient to define a small subset of the half-precision FP instructions (Zfhmin?), namely FLH, FSH, FMV.X.H, FMV.H.X, FCVT.S.H, FCVT.H.S.

That's certainly one option. I don't have an opinion whether we need a Foundation-sanctioned minimal subset for half-precision, so I'll leave that one for others to fight over. But I agree with the behaviors you describe.

However, when providing fuller support, a widening mul-add will really be needed, as it is rare to want to perform half mul-add into half-precision accumulator (in fact, that might be explicitly omitted as required in extension).

We probably need to require the non-widening version of fused multiply-add. Since RISC-V doesn't support odd rounding, there won't be an easy way to properly synthesize a half-precision fused multiply-add from the other instructions. When a source language like C has both a half-precision floating-point type (short float?) and a fused multiply-add function (fma), it's reasonable for a programmer to expect fma on half-precision data not to be more than an order of magnitude slower than single-precision.

One thought is to use one of the unused rounding modes to encode "widening with dynamic rounding mode".

Although I'm not wild about this idea, I have to agree it would be the best way to extend the existing scalar floating-point instructions to include those widening functions. If we do this, I recommend using rm code 110, giving us this set of possible new instructions:

 rs3  10  rs2   rs1  110  rd   1000011   FMADD.S.H
 rs3  10  rs2   rs1  110  rd   1000111   FMSUB.S.H
 rs3  10  rs2   rs1  110  rd   1001011   FNMSUB.S.H
 rs3  10  rs2   rs1  110  rd   1001111   FNMADD.S.H
00000 10  rs2   rs1  110  rd   1010011   FADD.S.H
00001 10  rs2   rs1  110  rd   1010011   FSUB.S.H
00010 10  rs2   rs1  110  rd   1010011   FMUL.S.H

 rs3  00  rs2   rs1  110  rd   1000011   FMADD.D.S
 rs3  00  rs2   rs1  110  rd   1000111   FMSUB.D.S
 rs3  00  rs2   rs1  110  rd   1001011   FNMSUB.D.S
 rs3  00  rs2   rs1  110  rd   1001111   FNMADD.D.S
00000 00  rs2   rs1  110  rd   1010011   FADD.D.S
00001 00  rs2   rs1  110  rd   1010011   FSUB.D.S
00010 00  rs2   rs1  110  rd   1010011   FMUL.D.S

 rs3  01  rs2   rs1  110  rd   1000011   FMADD.Q.D
 rs3  01  rs2   rs1  110  rd   1000111   FMSUB.Q.D
 rs3  01  rs2   rs1  110  rd   1001011   FNMSUB.Q.D
 rs3  01  rs2   rs1  110  rd   1001111   FNMADD.Q.D
00000 01  rs2   rs1  110  rd   1010011   FADD.Q.D
00001 01  rs2   rs1  110  rd   1010011   FSUB.Q.D
00010 01  rs2   rs1  110  rd   1010011   FMUL.Q.D

Feel free to hate on my proposed mnemonics.

jhauser-us commented 4 years ago

I wrote:

Since RISC-V doesn't support odd rounding, there won't be an easy way to properly synthesize a half-precision fused multiply-add from the other instructions.

On further thought, I believe (not completely certain) that doing the computation in double-precision and rounding back to half-precision avoids any double-rounding error. However, I wouldn't want to assume that every RISC-V core that supports half-precision will also implement double-precision. Seems to me that single-precision + half-precision would be a reasonable choice for some embedded cores.

kdockser commented 4 years ago

Since RISC-V doesn't support odd rounding, there won't be an easy way to properly synthesize a half-precision fused multiply-add from the other instructions.

Is this a new issue that arises due to adding half-precision support? Or, is this a pre-existing issue with the scalar fused multiply-add instructions? Put another way, how are SP and DP fused multiply-add instructions currently synthesized from other instructions?

Just to clarify, my comment follows what jhauser-us wrote above:

On further thought, I believe (not completely certain) that doing the computation in double-precision and rounding back to half-precision avoids any double-rounding error.

Yes, FP16 FMA can be emulated in DP arithmetic, and SP FMA can be emulated in QP (quad precision), if you have it. We don't have another precision wide enough to emulate DP FMA. The bottom line is the widening fmas do not present a new problem in emulating them that would require us to support round to odd (aka jam).

FP16 and bfloat16 are fairly unique in they each have very limited precision. These data types tend to use widening fused multiply adds (FMAs) almost exclusively; especially in the burgeoning field of machine learning. The larger formats have little need for widening FMAs, so I see little need to add them at this time.

Also, we can save precious opcode space by forgoing widening floating-point adds and subtracts as there is little need for these in floating point. Also, we can eliminate the 16-bit widening FNMADD as it can readily be achieved through a 2 instruction sequence where the first instruction negates the multiplier or multiplicand and the second instruction is an FMSUB. FNMADD can be eliminated similarly. This reduces the instructions proposed above to a more manageable:

FMADD.S.H
FMSUB.S.H
FMUL.S.H
gfavor commented 4 years ago

Scalar IEEE FP16 support is being added to the architecture (as Zfh) to enable adding support for FP16 to the vector extension. What is the plan for adding scalar bfloat16 support (to then enable adding vector bfloat16 support)?

jhauser-us commented 4 years ago

I wrote:

Since RISC-V doesn't support odd rounding, there won't be an easy way to properly synthesize a half-precision fused multiply-add from the other instructions.

kdockser:

Is this a new issue that arises due to adding half-precision support? Or, is this a pre-existing issue with the scalar fused multiply-add instructions? Put another way, how are SP and DP fused multiply-add instructions currently synthesized from other instructions?

The standard RISC-V set of single-precision floating-point instructions includes an instruction for single-precision fused multiply-add. Likewise for double-precision. Hence, there's never a need to synthesize fused multiply-add for single-precision, or for double-precision, out of other instructions, unless no floating-point instructions exist for the format in question, causing all floating-point operations to be synthesized for that format.

Krste suggested making half-precision different in this regard, omitting a purely half-precision fused multiply-add from the set of RISC-V instructions for half-precision. That's why the ability to synthesize a fused multiply-add operation specifically became a concern for half-precision, but not for single-precision or double-precision.

jhauser-us commented 4 years ago

Scalar IEEE FP16 support is being added to the architecture (as Zfh) to enable adding support for FP16 to the vector extension. What is the plan for adding scalar bfloat16 support (to then enable adding vector bfloat16 support)?

One school of thought (in which I happen to share) is that the possible format variations are much too numerous to permit a complete set of distinct encodings in the 32-bit instruction format. Besides bfloat16, there are requests also for posits, and there will be more. This school believes the only way to satisfy the demands for all these different formats, other than larger instructions, is to have CSR mode bits to choose, for example, IEEE-Standard 16-bit floating-point versus bfloat16, just as frm chooses the dynamic rounding mode.

There are additional complications, such as whether and how posits will mimic NaN-boxing. For bfloat16, there's a question of whether subnormals will be implemented, given that existing implementations apparently disagree on this point. If compatibility is important to both camps, it may even be necessary to define two bfloat16 formats, one with subnormals and one without. (I'll be happy to leave those debates to others.)

kasanovic commented 4 years ago

Consensus in last vector task group meeting was that there are (at least) three levels of half-precision support: 1) min set described above 2) obvious set including existing FP instructions with format field set to "half" (fmt=10) 3) expanded set including widening mul-add and maybe others

This would all be for IEEE FP16 by default.

A mode field in the fcsr would change the FP format, with the current value of 0 indicating IEEE. Other modes would include things like bfloat16 (i.e., half-precision in bfloat16, FP32++ instructions remain in standard IEEE).

bfloat16 needs a RISC-V standard behavior defined, which won't be ready soon.

The vector task group thinks we need a separate TG to work on options 1), 2), 3),and the alternate FP mode proposal.

For base V vector standard, we believe we need 2) and just IEEE FP for now, and are following the obvious, if non-ratified, encodings for half-precision support.

vowstar commented 4 years ago

Brain floating-point format is very useful, could just copy Zfh extension spec to bfp16 or make some special load/store instruction to load bfp16 to fp32 and store fp32 to bfp16?

kasanovic commented 3 years ago

Decision was made to not include half-precision by default in initial standard vector extensions, so labeling this as a post v1.0 issue.