lowRISC / opentitan

OpenTitan: Open source silicon root of trust
https://www.opentitan.org
Apache License 2.0
2.56k stars 762 forks source link

Brainstorming future OTBN ISA extensions #10750

Open mcy opened 2 years ago

mcy commented 2 years ago

@moidx, @jadephilipoom, and I have been discussing future extensions we'd like to see in the OTBN ISA in future revisions. This issue is intended to record those in a central place, but they are in no way plan-of-record or intended to make it into the timeline for the current revision.

(Previously: https://github.com/lowRISC/opentitan/issues/5058).

An incomplete list of things we would like to see:

rswarbrick commented 2 years ago

These look fun! When thinking seriously about these, we'd need to work out why we're adding them. A couple of quick thoughts:

  1. Vectorized instructions Specifying these is always hard because there are so many different things you can do. Compare x86 SSE/AVX and the RISC-V vector extension, for example. We'd probably need some firm use cases to convince ourselves we were building the right thing.
  2. Multi-word BN instructions This is probably possible but running in a single cycle would probably have a big area impact and the resulting processor might be rather "lopsided" (able to do one thing really well, but a bit rubbish at other stuff given its size). There would also be awkwardness on the register file (where you suddenly need either lots more read and write ports or to re-shape it to work in 1024-bit words).
  3. Multi-bit carry Eminently doable! The only difficulty is probably encoding space, but we could cheat and do something like use registers n and n | 1 (like you were suggesting for the multi-word instructions)
  4. SHA-2 instructions Definitely do-able. But the permutation networks and such will probably be reasonably large, so we'd need a good use for them, I think.

@GregAC

jadephilipoom commented 2 years ago

Multi-word add

It could also still be useful for unsaturated limbs to have something that doesn't need a carry between adds -- e.g. I have 4 limbs but none of them are storing the full 256 bits, so an instruction could add the 4 pairs of corresponding limbs but have no need to carry between them (that would be done later as a multi-bit carry).

@felixmiller, any instructions you would like to see?

mcy commented 2 years ago

Rupert: a lot of these have come up in discussions with Jade about things that would simplify or improve the performance of our assembly. I think she and Felix can provide concrete use examples to help pare down what's worth spending area and encoding budget on.

moidx commented 2 years ago

Re. SHA2 instructions we should take SCA considerations into account. This requires careful consideration at the interface level, and it is possible we may want to opt for a micro code implementation instead of following the RISC-V specification.

moidx commented 2 years ago

Marking as P2 as this is not currently part of the current iteration plan.

cdgori commented 2 years ago

Re. SHA2 instructions we should take SCA considerations into account. This requires careful consideration at the interface level, and it is possible we may want to opt for a micro code implementation instead of following the RISC-V specification.

Yea, we should be very careful here with SHA2 instructions. We should discuss the exact use case too.

felixmiller commented 2 years ago

Two things we were discussing in the past:

Carry-less multiplication Nothing that we desperately need right now, but we probably want to support this in the (very) long term to enable polynomial arithmetic.

Composable (multi-limb) pseudo modular addm subm The addmand subminstructions (pseudo modular add/sub) come in very handy for any 256-bit elliptic curve implementations. For any multi-limb implementation this blows up to 3 steps: (1) Arithmetic operation (sub/add), (2) (speculative) modular correction, (3) final select. All this resulting in 1 machine instruction per step and limb (e.g. 6 instructions for P-384). Nice would be a composable multi-limb addm/subm that would take one instruction per limb. The big issue here is probably the register to store the (then larger) modulus.