Load-Immediate Instructions

liangkaiwang commented 1 year ago

Hi,

Is there any background info (slides/documentation) about the "Load-Immediate instruction" other than the texts in the commentary.

Even though it probably does not cost hardware too much building this, during the Risc-V FP SIG group meeting this week, many of the members have doubts about the value of adding this instruction. Is there any performance analysis done on this? If so, where can I find those information? Note that I am not trying to stop this spec from been on the fast track and/or from been ratified. I just want to make sure the performance analysis, if available, can be shared to the broader audience and/or noted in the spec.

thanks, Liang-Kai

aswaterman commented 1 year ago

I'm responding on behalf of the Unprivileged ISA Committee.

The FLI instructions were added in response to the empirical observation that floating-point constant loads are statically common in short floating-point routines. Sometimes, they account for a material portion of the dynamic instruction count, too: for example, when transcendental functions are called in a loop, constants are resynthesized on each iteration. In these cases, replacing a two-instruction sequence that includes a memory load with a single low-latency instruction is obviously profitable.

To minimize implementation cost, we constrained the instructions to generate only constants with few significant mantissa bits. This constraint significantly reduces the number of gates needed to project the immediate onto the IEEE 754 format. As a consequence of this constraint, approximations of irrational numbers were excluded from consideration.

To avoid defining a new instruction format, and to minimize encoding-space usage, we limited the immediate to 5 bits. We chose the 32 constants by statically analyzing libc, libm, and CMSIS-DSP, choosing the most popular ones that fit our regularity constraint.

These 32 constants seem to have very good coverage of the popular constant space, despite the small number of them and their small mantissas, suggesting we struck a good balance between hardware cost and coverage.

One data point is that 2.5% of the static code in libm.so is spent loading one of the 32 constants that FLI.S and FLI.D are capable of generating. So, these instructions would reduce static code size by 1.3% in this case, in addition to shortening some critical code paths and potentially eliminating some D$ misses.

Note also that similar instructions have precedent in other architectures, e.g. https://developer.arm.com/documentation/dui0802/a/FMOV_float_imm.

liangkaiwang commented 1 year ago

Thanks for the information Andrew!

I think using static analysis as an argument perhaps is only valid if the underlying processors have limited amount of I$ (so perhaps just edge/embedded processors?) But in those processors, I honestly don't know how frequent they need to run transcendental functions and how accurate those functions need to be for their applications. I may be completely wrong, but I would doubt that they may only need simple transcendental functions with very limited input range and ULP, and a simple range reduction function, so alternative solutions without re-synthesizing the constant may be possible.

I am not against adding this instruction to the spec, but instead as a new comer to the community, I am trying to understand generally how instructions are proposed and accepted and if there is any criteria that members need to meet. For example, how much performance data/usefulness the members will need to provide in order to start the process, and how much similarity the proposed instruction need to be from the existing proprietary ISA in order to be considered.

aswaterman commented 1 year ago

There isn’t a specific quantitative threshold for adding new instructions. In this case, the instruction provides a clear but modest benefit, has very low cost, and has precedent in other architectures. Based on experience and pattern matching, that suffices to clear the bar in our eyes. If that sounds unscientific, it’s because there is indeed quite a bit of art and subjectivity in this process.

To your point about static analysis: we always should be concerned about static code size, and although it’s a more important metric for embedded systems than for high-performance processors, it’s important in the latter class, too. Being able to reduce I$ size is a boon for energy efficiency and cost.

Ideally, we would have dynamic statistics to share, as well, but they’re expensive to collect, and we didn’t deem it necessary in this case because we already have enough information to judge that the benefits outweigh the costs.

pdonahue-ventana commented 1 year ago

we always should be concerned about static code size, and although it’s a more important metric for embedded systems than for high-performance processors, it’s important in the latter class, too. Being able to reduce I$ size is a boon for energy efficiency and cost.

This seems to imply that smaller code footprint correlates with smaller I$ footprint. Static analysis of code footprint shows the DRAM footprint but doesn't tell us anything about I$ footprint. A single frequently used function would have more impact on the I$ than a thousand functions that nobody ever calls. A thousand functions that nobody ever calls would have a large impact on DRAM which we all agree is a consideration in embedded systems but I don't think is important in high-performance processors.

aswaterman commented 1 year ago

These instructions appear in functions that do get called.

liangkaiwang commented 1 year ago

Hi Andrew,Thanks so much to describe the flow and some history. This is helpful for newbie like me to quickly understand the process in order to contribute more in the future.ThanksLiang-KaiSent from my iPhoneOn Apr 29, 2023, at 1:31 PM, Andrew Waterman @.***> wrote: There isn’t a specific quantitative threshold for adding new instructions. In this case, the instruction provides a clear but modest benefit, has very low cost, and has precedent in other architectures. Based on experience and pattern matching, that suffices to clear the bar in our eyes. If that sounds unscientific, it’s because there is indeed quite a bit of art and subjectivity in this process. To your point about static analysis: we always should be concerned about static code size, and although it’s a more important metric for embedded systems than for high-performance processors, it’s important in that class, too. Being able to reduce I$ size is a boon for energy efficiency and cost. Ideally, we would have dynamic statistics to share, as well, but it’s expensive to collect, and we didn’t deem it necessary in this case because we already have enough information to judge that the benefits outweigh the costs.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>

aswaterman commented 1 year ago

Glad to have helped!

aamartin0000 commented 1 year ago

Andrew, at the risk of re-opening this can of worms: I realize this uses more opcode space, but did you consider using the rs1+rm fields (bits 19:12) to form an 8-bit "true" immediate, like the ARM instruction you referenced in an earlier reply?

aswaterman commented 1 year ago

I would very much not like to reopen this issue, given the extension is frozen, and the benefit from futzing with the definition of this instruction is so small. The ARM immediate would need some tweaking to retain the benefits of the current FLI definition--for example, the ability to express canonical NaN and +Inf is much more important for static code size than large swaths of the constants the ARM format can express--and so we'd still end up with an irregular design, anyway.

aamartin0000 commented 1 year ago

Ok. I wish I had known about this extension much earlier so I could have raised this then. Other than not wanting to change this instruction, would this have had merit?

+Inf and cNaN and could be expressible in FP8-E3M4 if we wanted to define it (e.g. exp=all-ones, mant=1110 or 1111; other mantissas would be normal). These are easily convertible to wider formats.

kasanovic commented 1 year ago

The majority of the benefit of this instruction comes from just a few of the values - the others were filled as having some demonstrated utility and being easy to generate in hardware given the desire to support the first few and because they don't occupy much opcode space. While this instruction is useful, it doesn't rise to the level of meriting a new instruction format with a wider immediate. All ISA design is an art, and reasonable folks can disagree, but any further improvements are in the noise quantitatively.

riscv / riscv-isa-manual

Load-Immediate Instructions #1009