mit-plv / fiat-crypto

Cryptographic Primitive Code Generation by Fiat
http://adam.chlipala.net/papers/FiatCryptoSP19/FiatCryptoSP19.pdf
Other
705 stars 147 forks source link

Jasmin and alternate Montgomery strategies #783

Open spitters opened 4 years ago

spitters commented 4 years ago

Bedrock2 seems to share a lot of common goals with Jasmin. Jasmin also has an RISC-V branch Have you considered how they could interact?

JasonGross commented 4 years ago

I personally have not. My guess is that adding a Jasmin backend would be comparably hard to adding a bedrock2 backend. Our bedrock2 backend lives in src/Bedrock and was written pretty much entirely by @jadephilipoom, who might have useful things to say. As far as making bedrock2 and Jasmin interact directly, that's probably better asked on the bedrock2 repo, to @andres-erbsen and/or @samuelgruetter .

andres-erbsen commented 4 years ago

The main reason for connecting to bedrock2 is that other projects in our group are connecting to bedrock2, which we intentionally designed as "common denominator" language. Other existing developments include non-trivial heap data structures with specifications in separation logic and drivers for hardware peripherals using memory-mapped i/o.

I still think that a fiat-crypto backend to output Jasmin would be cool (with or without proofs), but IIUC Jasmin requires program variables to be annotated as reg/stack at the source level, which fiat-crypto currently doesn't do.

dfaranha commented 4 years ago

Jasmin indeed requires reg/stack annotations in the source, and the current compiler is unfortunately not able to spill in case there are too many reg variables.

samuelgruetter commented 4 years ago

Tbh the current bedrock2 compiler does not do spilling either, does not offer local variables on the stack, and the register allocation just replaces string names by register names, so if you generate Jasmin source where you annotate all variables with "register", you probably get a better compiler than the bedrock2 compiler as it's now...

dfaranha commented 4 years ago

I have been running some performance experiments in finite field arithmetic between my own Assembly code from RELIC, verified Jasmin and code generated by Fiat-crypto and the differences are pretty interesting.

Timings for Montgomery field multiplication modulo the BLS12-381 prime measured in my Cortex i7-8650U using GCC 10.1.0 below:

The SOS and CIOS methods refer to terminology used in this classic paper. The same ratios were roughly preserved when running a more complex computation, such as a whole pairing evaluation.

As somewhat expected, Jasmin stays as a middle-ground between convenience and efficiency, being easier to write than direct ASM but a bit slower. Inspecting the disassembled code shows that ASM and Jasmin make fairly efficient use of registers in straight-line code. Fiat-crypto is able to use mulx instructions, which is pretty nice, but I see a lot of housekeeping instructions to manage spills and load/write from memory. I partly attribute that to the CIOS method which constantly switches state from the multiplication to the reduction part.

While a Jasmin backend might provide a performance improvement, it is quite possible that a similar speedup can be obtained by trying different approaches for Montgomery multiplication.

andres-erbsen commented 4 years ago

Thanks for the benchmarks and analysis! We have indeed been attributing the slowness of our Montgomery multiplication code to inefficient spilling by C compilers. The differences you describe are compelling enough that I'm thinking I should really write a template that uses Comba+SOS and see how it fares. No clear timeline, though, there are other things I need to finish first.

dfaranha commented 4 years ago

Completely understandable, I meant my comment more as a mental note. We might give it a try at some point! :)

spitters commented 4 years ago

@samuelgruetter I see there's a bit of progress on spilling in bedrock2. Is there a convenient way to use it from fiat-crypto so that we can use it to time against hand written code in jasmin/relic?

jadephilipoom commented 4 years ago

I think it would be a pretty unfair comparison at the moment regardless, because we haven't put time into optimizing the bedrock2 code at all yet. There's obvious low-hanging fruit (like a very silly way of handling carries) that we haven't fixed, so it definitely won't compare well to code that someone has spent a bunch of time hand-optimizing. The whole bedrock2 back end is relatively new, and so far we have focused more on getting some output and getting the proofs working than tweaking that output to be fast. (Because of the nature of our pipeline, where optimizations are proven correct once-and-for-all, these changes won't change the proofs, so it makes sense to get the proofs working first and then add the optimizations.)

samuelgruetter commented 4 years ago

I agree with @jadephilipoom, and yes we're planning to run fiat-crypto code on bedrock2, but it still requires some work in the compiler and some integration work, it's not the case yet that there's a "convenient way to use it".

spitters commented 4 years ago

Thanks @samuelgruetter . Coming back to the original question, with your work, would it now be feasible to write a jasmin backend for bedrock2?

samuelgruetter commented 4 years ago

"Jasmin backend for bedrock2" doesn't make too much sense to me, because they both live in a similar place, so I'm assuming you meant "Jasmin backend for fiat-crypto". This would require deciding which variables to put on the stack vs into registers, so if you go directly from fiat-crypto to Jasmin, you'd have to implement this. Or you could wait for the not-yet-implemented spilling of the bedrock2 compiler, but this would probably introduce several inefficiencies, because we don't currently optimize its performance.

JasonGross commented 3 years ago

At some point we might also want to try the optimizations to CIOS described in https://hackmd.io/@zkteam/modular_multiplication as yielding a 10%--15% speedup, h/t @armfazh

spitters commented 2 years ago

@samuelgruetter I was wondering about the status of this issue. It seems you now have spilling https://github.com/mit-plv/bedrock2/commit/19fe67d23eafc5c7f7a676284c13694402803bf0 and are even working on new spilling https://github.com/mit-plv/bedrock2/commit/170a56324c52bd9d1be623dd51b8fa4fb6215e01 Does that make any progress on this issue?

samuelgruetter commented 2 years ago

The goal of the bedrock2 compiler is still not to spend any time on optimizations, but only to be a platform just good enough to enable Integration Verification projects consisting of end-to-end verified software/hardware stacks (potentially also including interaction with unverified, but clearly specified parts). The reason I'm working on spilling these days is just because I want to be able to call bedrock2-generated RISC-V code from unverified gcc-generated code, so the bedrock2 compiler needs to accept function arguments in the same argument registers as C compilers use, so these argument registers are now "special" and thus I have to adapt the spilling phase.

I'm not sure I understand your goals (maybe you could elaborate a bit?), but if you want to consume code generated by the bedrock2 compiler's very basic register allocator phase, it would be in this FlatImp language, where the abstract type varname has been instantiated to Z, and varnames between 0 and 32 stand for RISC-V registers, while varnames >= 32 stand for spill slots. Using this information, I guess this could be turned into Jasmin source code with annotations on which variables to spill? Alternatively, you could also import the code from after the (not very smart) spilling phase, which would be in the same FlatImp language, but this time with all varnames between 0 and 32.

spitters commented 2 years ago

Thanks for the link to the nice paper!

Our application is not set in stone, and to some extend it is a conceptual question: How can we bridge the gap between (fast) fiat-code and even faster jasmin code; see the timings by Diego above. Jasmin is high-level assembly and there is work on making it portable to other platforms than x86, like Risc-V. I read your answer as: the current spilling implementation will not solve that issue.

BTW: we have some progress on integrating bedrock2 with fiat for BLS, and will report on it soon.