riscv / riscv-bitmanip

Working draft of the proposed RISC-V Bitmanipulation extension
https://jira.riscv.org/browse/RVG-122
Creative Commons Attribution 4.0 International
203 stars 66 forks source link

gmp (GNU Multiprecision Library) issues #166

Open jim-wilson opened 2 years ago

jim-wilson commented 2 years ago

Not sure if there is a better place to discuss this, but bitmanip seemed like a reasonable place to start.

The GMP author Torbjorn Granlund posted a message in September pointing out that RISC-V gmp code is slower than our competitors due to a lack of some interesting instructions. A double word add is 7 instructions on RISC-V, but only 2 for ARM and x86, because they have add-with-carry instructions. https://gmplib.org/list-archives/gmp-devel/2021-September/006013.html I've worked with Torbjorn in the past, so I exchanged some email with him. I pointed him at V, K, and B instructions that solve some of his problems. V incidentally has add with carry, but using vector instructions inside GMP is probably inconvenient. And we still need extra instructions. We need an extra instruction to produce the carry out, and an extra instruction to move carry out to v0. But at least this is an improvement over the scalar case as there are fewer dependencies between the instructions.

Another related issue here is that a compare negated instruction like the ARM cmn would help. See for instance https://gmplib.org/repo/gmp/file/tip/mpn/riscv/64/aors_n.as The inputs are in a4 and a6, the carry-in in t6. The sequence is followed by "add t6,t2,t3" to produce carry-out. In the add case, the core instructions are add t0, a4, a6 sltu t2, t0, a4 add t4, t0, t6 sltu t3, t4, t0 The second and third instructions both depend on the first instruction but can be run in parallel. The fourth instruction depends on the third instruction. So this is a minimum of 3 cycles. But in the sub case, we have sub t0, a4, a6 sltu t2, a4, a6 sub t4, t0, t6 sltu t3, t0, t6 The first two instructions can be run in parallel. The third and fourth depends on the first one and can be run in parallel. So this is a minimum of 2 cycles. This gives us the odd situation that subtract is faster than add. A compare negative instruction like sltnu would fix this, and then the add sequence would look like the sub sequence except using sltnu instead of sltu.

add-with-carry would be even better, but without the carry flag that is a three input instruction that produces two outputs which doesn't work. We could have separate instructions to produce the sum and carry-out like V does in which case we could do this with two three-input one-output instructions that can be run in parallel, but there is still the encoding space issue. With add and sub we would need four 4-operand instructions which takes a lot of the 32-bit encoding space.

tege17 commented 2 years ago

Nice summary, Jim!

The real workhorse loop for almost any bignum application is the formation of sums R += Ab where R,A are n-word variables and b is a single limb (or perhaps 2 limbs if the single-limb variant just cannot be made fast). Many CPUs today run that at close to n cycles. Apple's M1 is the fastest at 1.25n, but the last generation of PC processors do it in around 1.5n to 1.6n.

I have been involved in advising CPU designers for many years, and have not always been impressed by the instructions they have added to improve the above operation. They might add ever so powerful multiply primitives to their "vector" insn set, but any further operations needs to pass through a narrow path to another register set, and the accumulation is hindered by a long recurrent dependency chain.

I cannot stress this enough: To improve the critical applications that rely on bignum arithmetic, one needs to understand the very few basic but perhaps intricate loops that are used. If only one loop is to be improved, do R += Ab, preferably for single-limb b. If you want to stay competitive at the high end, it needs to run at < 2 cycles per limb.

Necessary but not sufficient requirements:

  1. 1 cycle multiplication throughput for 64b * 64b -> 128b.
  2. 1 cycle latency of carry-in to carry-out for accumulation. Note that R += Ab means we add 3 n-limb operands, i.e., we have to maintain semi-independent 2 carry chains.
  3. If the instruction set is weak, really wide issue of CPUs.
jim-wilson commented 2 years ago

A comment from the libera #riscv IRC channel.

<Sofia> jimwilson: As I've said a few times here regarding adc. We really don't need it. If you're doing high performance math, you want to defer communication between limbs using a reduced radix representation. This cuts the dependency across limbs, so they may be computed in parallel by higher end cores. This could mean your 2-4 independent adds or multiplies all run in the single cycle. You do this same
<Sofia> optimization in the x86 world (where adc is simply too slow with its dependency). Ex.  https://www.chosenplaintext.ca/articles/radix-2-51-trick.html 
<Sofia> Once you've (analytically) done enough additions or multiplies that doing yet another would overflow in the worst case, you can apply the carries as needed. However this worst-case may not be your average case, so you might guard the communication with a fast-path check if your future adds or muls could overflow the current state. Further deferring in the average or best case. Only paying for the
<Sofia> carries if you added enough large numbers (worst case)

That article assumes a 4 wide issue for simple alu instructions. This may not work as well for a machine that can only issue 1 or 2 simple alu instructions per cycle.

tege17 commented 2 years ago

Not much of what Sofia says makes much sense, unfortunately.

Incorrect statement: "adc is simply too slow" [on x86 to be used]. It is not slow, it is widely used for bignum arithmetic. In fact, modern x86 cores do renaming on the carry bit. The latency 1 cycle and one can run 2-3 in parallel with renaming.

Note also that neither Jim nor myself have suggested that RISC V should have any adc-like instruction. Straw man argument.

Using a lower limb base than 2^64 surely works. If you try encoding a bignum add using (say) 2^60 as a base, you will notice that the critical path is still 2 instructions (srl+add). We have a 2 instruction critical path for bignum subtraction already with base 2^64, while it is 3 instructions for bignum add. See Jim's original post here.

I realise that Sofia mainly talks about multiplication when suggesting a smaller limb base. Sure, there one seem to be able to accumulate and do very few carry propagations. Yes, that works. We have done work on that in GMP (some 15 years ago, see the GMP "nails" feature). It is not for free, and modern architectures/implementations have not needed it. The extra cost is significant as it makes multiplication instructions less effective. It also tends to mess badly with branch prediction, which anybody who have done real work (as opposed to merely have theorised) in the area knows.

Sofia's "fast path" check sounds like an awful idea. Data-dependent branches is a major source of side channel leakage.

sofia-snow commented 2 years ago

When I said adc is slow, I was only contrasting the dependent-chain add, adc, adc, adc (4 cycles) with the independent add, add, add, add, add (1-5 cycles; core implementation dependent) as demonstrated in the linked article. The reduced radix form needs one more addition, but eliminates the communication.

Obviously there is a cost in translating between the representations and normalization is inevitable. However those are relatively cold functions. If you save 1-3 cycles per operation, and you do many operations, the savings quickly add up and outweigh the costs.

I would not blindly use a reduced radix form. I'd compare the costs for the use case. Such analysis is probably out of scope of GMP but is not for my own optimizing compiler interests (motivated for cryptography and graphics.)

You say the critical path involves a shift, that sounds like carrying? The purpose of the reduced radix form is reducing the communication. The hot code (the many arithmetic operations) are only their respective instructions. Adding 3 limbs is only 3 independent add instructions.

The "fast path" check can be performed sufficiently early, say before the last several adds, such that its value is known before the branch instruction is even decoded.

I have never used GMP and did not recognize it as designed for real-world cryptography, beyond prototypes and unsafe references. Now I see cryptography is "the main target". :-)

In such case, I agree the data-dependency would be ill-placed if the use involved secrets. However, any non-sensitive uses should be able to benefit from the shortcut.

I'll admit branch predictor details are a weak spot personally.

If I'm missing anything, I'd appreciate corrections.

jim-wilson commented 2 years ago

Mitch Alsup made an interesting suggestion on the isa-dev mailing list when the carry bit came up in the discussion. His suggestion is to add an instruction which decorates several following instructions with extra opcode bits. So the add with carry sequence becomes CARRY R16,{{I}{IO}{IO}{O}} ADD R12,R4,R8 // carry Out only ADD R13,R5,R9 // Carry In and Out ADD R14,R6,R10 // Carry In and Out ADD R15,R7,R11 // Carry In only And likewise for multiply, divide, and shift. This solution is used in Mitch's "My 66000" ISA. https://groups.google.com/a/groups.riscv.org/g/isa-dev/c/gg7ZW0Z__5c/m/SMBrYOEMCwAJ