lowRISC / opentitan

OpenTitan: Open source silicon root of trust
https://www.opentitan.org
Apache License 2.0
2.53k stars 754 forks source link

Questions about LFSR's for providing high bandwidth moderate-security randomized data. #1920

Closed martin-lueker closed 2 years ago

martin-lueker commented 4 years ago

In looking at the comments around #1845, I some general system-wide questions that I'd been hoping to discuss with people at the security meeting, but we ran out of time. @cdgori and I have had a few conversations about this, but this might be another forum to get more opinions.

In the dialog around #1845 it seems like many different IP's are using Internal LFSRs, and hoping that it all converges.

Please tell me your thoughts.

tjaychen commented 4 years ago

Hey Martin, just to comment on the first point. My opinion is indeed..where appropriate (ie, the bandwidth needs are met, and the latency is acceptable), we should request entropy from the CSRNG directly. This I think will have to be looked at on a case by case basis.

As an example, there is a case in key manager where I would prefer to grab entropy from CSRNG, mostly because I can tolerate some latency, and I would prefer not to replicate an LFSR that is constantly counting N times over. The other case I imagined this would be useful (although we'd need to get Felix / Philipp's) opinion is a "get_rand" operation from the asymmetric crypto accelerators (used for blinding). Those as is would pull out of an LFSR, but I think it makes more sense to pull out of CSRNG.

Lastly, even for blocks using LFSR, my expectation is that they would still poll the CSRNG for periodic re-seeding input. My general thinking is this would allow for the block level LFSRs to be a bit more narrow, and a bit more random in its sequence.

Also, I agree with Eunchan, having A LOT of LFSR's in the system gets VERY expensive in terms of power (I have seen this firsthand). So my thinking is that at some point we should go over all the places that use LFSR at some point, and make sure it makes sense. If for example the entropy is used once in a while (clear all internal state once transaction finishes), IMO that should be CSRNG. If it's for some new entropy that's needed every cycle (sbox blinding, the Canright method calls for 16b of entropy per sbox byte, so 256b per cycle), it may be on the borderline of using CSRNG directly (since entropy stalls would stall the pipeline).

It would be great if Chris / others can chime in.

On Wed, Apr 8, 2020 at 3:25 PM Martin Lueker-Boden notifications@github.com wrote:

Assigned #1920 https://github.com/lowRISC/opentitan/issues/1920 to @tjaychen https://github.com/tjaychen.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/lowRISC/opentitan/issues/1920#event-3216076616, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAH2RSW2BMI5DMPP4HZ35VLRLT2UJANCNFSM4MEI2VUA .

msfschaffner commented 4 years ago

I agree with @tjaychen that we should go through the design and look at all instances at a case by case basis. Whether or not you can actually save area or power by replacing a local LFSR with bits coming from CSRNG really depends on how the local LFSR is being employed, and what the security requirements are.

Just to provide an example here that is along the lines of what Tim mentioned RE re-seeding of local LFSRs:

We are using a 32bit LFSR in the alert handler ping mechanism where a random peripheral ID and a random waiting period (8bit - 24bit) are drawn every once in a while (10'000s of cycles apart). This does not require high bandwidth at all, but when the ID and waiting periods are drawn, they are drawn within a few cycles, and the number of draws can vary.

https://github.com/lowRISC/opentitan/blob/2b1f219f7174b90cf2b75151b78adf404f1c84c1/hw/ip/alert_handler/rtl/alert_handler_ping_timer.sv#L70-L86

My current implementation already assumes that it can draw 1 bit of entropy from CSRNG which gets XOR'ed into the state every time the LFSR is activated in order to offset the sequence. We are using "only" 1 bit for reseeding here since we were not too concerned about security for this particular use case. The purpose of this mechanism is just to make ping timing difficult to predict and is not part of a cryptographic data path.

Since this LFSR is not constantly spinning, its area and power consumption should be comparable to a solution where 1bit entropy from the CSRNG is buffered up in a local shift register, with the difference that the LFSR allows me to get around backpressure issues if I have to draw 2-3 values right one after the other.

RE the point on sharing LFSRs among multiple modules, I do not think that this will buy us much if the LFSRs are not constantly spinning, and there are not too many of them (say <10 instances). I rather have the feeling that we would then add unnecessary complexity, since more intermodule connections and arbitration / back pressure handling would be required... after all, one of the main advantages that I see in having a local (exclusive) LFSR instance is that it allows you to quickly generate a PRNG sequence on demand, and in a cheap way.

tjaychen commented 4 years ago

Martin / Michael, do you guys think we should start going through this exercise soon? I think we certainly will need to anyways to understand how we should carve out the CSRNG outputs for example.

Let's say each crank gets us 256b (i'm making this up), we'll need to divvy up that number between the 1b entropy updates (to blocks such as alert handler), parallel load (to blocks like key manager), and perhaps other blocks that want to just use this as an LFSR seed. I have some thoughts, but would like to hear your input.

As Michael said though, the parallel load interfaces inevitably will involve some kind of handshaking. For the 1b interface, we likewise should think about if we need a valid bit, or if the downstream LFSRs are happy to just constantly churn on the same entropy bit (I'm guessing no probably...).

On Wed, Apr 8, 2020 at 7:11 PM Michael Schaffner notifications@github.com wrote:

I agree with @tjaychen https://github.com/tjaychen that we should go through the design and look at all instances at a case by case basis. Whether or not you can actually save area or power by replacing a local LFSR with bits coming from CSRNG really depends on how the local LFSR is being employed, and what the security requirements are.

Just to provide an example here that is along the lines of what Tim mentioned RE re-seeding of local LFSRs:

We are using a 32bit LFSR in the alert handler ping mechanism where a random peripheral ID and a random waiting period (8bit - 24bit) are drawn every once in a while (10'000s of cycles apart). This does not require high bandwidth at all, but when the ID and waiting periods are drawn, they are drawn within a few cycles, and the number of draws can vary.

https://github.com/lowRISC/opentitan/blob/2b1f219f7174b90cf2b75151b78adf404f1c84c1/hw/ip/alert_handler/rtl/alert_handler_ping_timer.sv#L70-L86

My current implementation already assumes that it can draw 1 bit of entropy from CSRNG which gets XOR'ed into the state every time the LFSR is activated in order to offset the sequence. We are using "only" 1 bit for reseeding here since we were not too concerned about security for this particular use case. The purpose of this mechanism is just to make ping timing difficult to predict and is not part of a cryptographic data path.

Since this LFSR is not constantly spinning, its area and power consumption should be comparable to a solution where 1bit entropy from the CSRNG is buffered up in a local shift register, with the difference that the LFSR allows me to get around backpressure issues if I have to draw 2-3 values right one after the other.

RE the point on sharing LFSRs among multiple modules, I do not think that this will buy us much if the LFSRs are not constantly spinning, and there are not too many of them (say <10 instances). I rather have the feeling that we would then add unnecessary complexity, since more intermodule connections and arbitration / back pressure handling would be required... after all, one of the main advantages that I see in having a local (exclusive) LFSR instance is that it allows you to quickly generate a PRNG sequence on demand, and in a cheap way.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lowRISC/opentitan/issues/1920#issuecomment-611287981, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAH2RSSNRJJ7BW4QEP4YES3RLUVEHANCNFSM4MEI2VUA .

vogelpi commented 4 years ago

Thanks @martin-lueker for starting this discussion.

Based on our previous discussions, I was aware during the implementation of #1845 that the LFSR used for clearing the AES state might eventually be kicked out to use the CSRNG output directly. For this reason, I integrated it in a somewhat latency-tolerant way: The AES-internal FSMs can wait on the CSRNG for a couple of cycles (8-12 depending on key length or number of rounds) without stalling.

But as @tjaychen mentioned, this won't work for the S-Box masking where randomness is needed in every cycle. Maybe as long as a local LFSR is available, it can be re-used for this as well. Alternatively, a new LFSR is added just for this. Or, a second high-bandwidth CSRNG port needs to be added. It's not yet clear to me what we will need. Maybe there will be separate low- and high-rate CSRNG.

As for the forward/backward, do you meant that it should not be possible to find out future or past values based on the current value? If yes, I think that is not needed for our current use cases of the LFSR such as clearing registers and randomizing the ping time in the alert handler. (To get some more variation we also combine the LFSR with cipher S-Boxes or permutations in some of the cases.) But also here, I don't think this will be sufficient for the purpose of masking.

msfschaffner commented 4 years ago

@martin-lueker should we maybe start a sheet / doc to keep track of the modules that need entropy (annotated with required security strength, bitwidth and update rate)? That could help us go through the individual cases and make a more informed decision...

martin-lueker commented 4 years ago

Short comment: I think that is a great idea.

martin-lueker commented 4 years ago

Given the preference for getting entropy driectly from a CSRNG, I think we should also consider making a new pipelined AES primitive that is extremely focused on bandwidth and less on obfuscation. I haven't looked at the quality of the RTL, but there are at least some Apache v2 examples of high-bandwidth AES blocks: Example: https://opencores.org/projects/tiny_aes We could either restyle it as is (and cite the original) or make our own. Thoughts? Is there an OT policy for adopting external open source blocks?

vogelpi commented 4 years ago

Thanks @martin-lueker for the update on the CSRNG yesterday. That was interesting. It made me think a bit on a bandwidth-optimized AES primitive. I had a quick look at the core you linked. What we she should use depends a bit on the requirements.

At the moment, I would suggest to build a high-bandwidth cipher core from the components we have in our own AES module. The reasons are:

Implementing a high-bandwidth core based on our source code is not too much work. There are a couple of questions regarding pipelining and latency as well as initial latency after changing the key. The most efficient option (resources + design effort) would be to let the key expand module generate the round keys iteratively and store into registers. This will take 14 cycles after a key switch. After that, we get 1 output block per cycle and parallel cipher data path. Otherwise we need to unroll the key expand module as well. This means more design and verification effort. Maybe this discussion is a bit off-topic here and be better discuss separately.

martin-lueker commented 4 years ago

Hi @vogelpi, sorry for the delay.
This all sounds great. Any thing we can do to make a dramatically faster AES would be excellent in my view. This would be very beneficial, not just for CSRNG, but also for future products targetting a different balance between security vs. performance. I'm super excited to hear you say that it would not be much work. It's also nice to hear that the interface would remain the same.
@rasmus-madsen: Do you think the VIP could be used as is?

rasmus-madsen commented 4 years ago

@martin-lueker If the interface is the same I don't see anything hindering the use the current dv. if the interface is the same we should be able use the full environment as is. some adaptation to sequences might be needed.

one thought regarding the RTL design, Currently it takes a minimum of 4 clock cycles to provide one block of data. (assuming a TL-UL burst of 4 beats)

so without a different method of input the best throughput we can achieve will be 1/4 output per cycle. not taking init cycles into account.

vogelpi commented 4 years ago

Good point @rasmus-madsen on the I/O bandwidth limitation.

At the moment, I don't think there is the need for a higher-bandwidth, full-fledged AES module in the project. The main purpose of the high-bandwidth cipher core would be for CSRNG. In my view, re-using the DV framework of the medium performance AES module we currently have will help us to get some initial verification for this high-bandwidth cipher core. To thoroughly verify it, a lot more work will be needed.

tjaychen commented 4 years ago

i think maybe we don't have to rush for a high bandwidth, fully unrolled AES for CSRNG yet...i feel like even if we do, we're unlikely to make this fast enough to support the REALLY high performance cases (AES masking). It probably makes more sense to complete the entropy exercise first and then decide..

On Mon, May 4, 2020 at 2:38 AM Pirmin Vogel notifications@github.com wrote:

Good point @rasmus-madsen https://github.com/rasmus-madsen on the I/O bandwidth limitation.

At the moment, I don't think there is the need for a higher-bandwidth, full-fledged AES module in the project. The main purpose of the high-bandwidth cipher core would be for CSRNG. In my view, re-using the DV framework of the medium performance AES module we currently have will help us to get some initial verification for this high-bandwidth cipher core. To thoroughly verify it, a lot more work will be needed.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lowRISC/opentitan/issues/1920#issuecomment-623362263, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAH2RSQ5ARDONHYXCYDMXKDRP2EKLANCNFSM4MEI2VUA .

tjaychen commented 3 years ago

this is something we'll need to come back to. There's been some rumblings about not using something like the lfsr but something that has more of a..unbiased distribution (ie, lightweight cipher).

msfschaffner commented 3 years ago

ok I see. do you mean we could consider using either PRESENT or PRINCE instead of AES to create a lightweight PRNG that is still based on a block-cipher?

tjaychen commented 3 years ago

yeah that's right. But we don't have to jump on that yet. It's not super clear yet if that's the direction we need to progress in.

On Tue, Dec 22, 2020 at 12:24 PM Michael Schaffner notifications@github.com wrote:

ok I see. do you mean we could consider using either PRESENT or PRINCE instead of AES to create a lightweight PRNG that is still based on a block-cipher?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lowRISC/opentitan/issues/1920#issuecomment-749758971, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAH2RSUWFKCFCAV65CYOZLTSWD57ZANCNFSM4MEI2VUA .

vogelpi commented 3 years ago

If I got this right, it's not about replacing AES inside CSRNG but instead about using PRESENT/PRINCE instead of LFSRs + permutations for producing the pseudo-random data required e.g. for masking the AES accelerator (not the AES core inside the CSRNG). We need a lot of randomness there:

Analyzing the distribution of the masking PRNG inside AES is something that I have on my todo list for a long time now. Hopefully, I can look into that in Jan.

tjaychen commented 3 years ago

yes that's right. From what I gathered, for masking purposes at least, this gives a better distribution of random numbers vs lfsr + permutation. But i've not dug up any literature to support this, more or less just conversations in passing.

On Tue, Dec 22, 2020 at 1:55 PM Pirmin Vogel notifications@github.com wrote:

If I got this right, it's not about replacing AES inside CSRNG but instead about using PRESENT/PRINCE instead of LFSRs + permutations for producing the pseudo-random data required e.g. for masking the AES accelerator (not the AES core inside the CSRNG). We need a lot of randomness there:

  • with Canright masking its at least 360 bits every clock cycle
  • with domain-oriented masking it's at least 560 bits every 5 clock cycles.

Analyzing the distribution of the masking PRNG inside AES is something that I have on my todo list for a long time now. Hopefully, I can look into that in Jan.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lowRISC/opentitan/issues/1920#issuecomment-749794541, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAH2RSUGWPW3O2ADBYDBGNTSWEIUTANCNFSM4MEI2VUA .

msfschaffner commented 3 years ago

Ok, I think for that many random bits, a good starting point could be a shallow (e.g. 3 round) PRINCE cipher. This yields 64bit per instance, hence it needs to be replicated accordingly.

RE the feedback construction, I am not sure what guarantees you get from a (shallow) cipher that feeds on itself (i.e., whether the generated PRNG sequence is guaranteed to cover the complete set of states in the 64bit space, or not). If that is of a concern, we could also feed the output of a 64bit LFSR into a shallow block cipher instead. Due to the bijectiveness of the cipher, the PRNG sequence is then guaranteed to cover all states eventually (except for the encrypted version of the all-zero state).

mwbranstad commented 3 years ago

This issue is over a year old. I think many of the concerns have been worked out and implemented already. I proposed we close this issue and open new ones to address any remaining specific issues.

cdgori commented 3 years ago

@mwbranstad at this point, I don't think this should be gating/blocking on CSRNG but rather on consumers of CSRNG entropy?

Agreed that a lot of things have been resolved but we do need to figure out for at least: alert handler, AES (with masking), KMAC (with masking), and OTBN (URND) if we will use LFSRs, some shallow PRINCE structure, etc.

Would be good for everyone to have a quick think-through and see if I've missed any other consumers. ( @tjaychen / @eunchan / @msfschaffner / @imphil for visibility)

[Small edit RND -> URND by Philipp.]

msfschaffner commented 3 years ago

I went through these designs and did a slightly more formal accounting below, with a short description of what each mechanism does.

Block Function High Statistical Fidelity State Visible To SW PRNG Mechanism Accepted & Resolved Comments RTL
ALERT_HANDLER ping timer No No LFSR + permutation yes link
OTP_CTRL background check timer No No LFSR + permutation yes link
SRAM_CTRL SRAM initialization No Yes* LFSR + permutation yes Although the state is SW visible, this is not intended as a secure wipe feature. link
IBEX random instructions No No LFSR + permutation yes link
KMAC masking Yes No LFSR + permutation + 4bit Sbox layer yes link
KMAC clearing Yes No LFSR + permutation + 4bit Sbox layer yes Implementation pending https://github.com/lowRISC/opentitan/issues/9041 N/A
AES masking Yes No LFSR + permutation + 4bit Sbox layer yes link
AES clearing Yes No LFSR + permutation + 4bit Sbox layer yes link
KEYMGR clearing Yes No LFSR + permutation + 4bit Sbox layer yes link
OTBN URND Yes Yes xoshiro256** yes link

Items up for discussion:

What do people think?

(Btwy, please correct me if I missed an LFSR or got the accounting / mechanism description wrong above)

mwbranstad commented 3 years ago

@msfschaffner Keeping the LFSR in ES. It was intended for a faster means of generating entropy with the primary goal of software use.

msfschaffner commented 3 years ago

But do I understand correctly that the entropy generated by that LFSR is still being post processed afterwards in CSRNG? It seems that that one does not really fit into this accounting here, since it is not an entropy consumer with the need of a fast local entropy expansion mechanism...

mwbranstad commented 3 years ago

Correct understanding, the output will ultimately feed into the CSRNG module.

cdgori commented 3 years ago

Questions thinking about other potential LFSR users:

1- Is there no wipe/clear in KeyMgr? That sort of surprises me.

2- Should we be implementing secure clear of the 1600b state in KMAC (as AES does)? There's not an obvious "key register" to clear for KMAC but basically the whole hash state is "dirty" if actually doing a KMAC op.

3- For completeness: in theory, if we support sideload to HMAC then we probably need some secure clear of state as well.

4- Also for completeness: does Ibex need an LFSR for dummy instruction insertion? (or other countermeasures)

Feel free to have a look at an OT block diagram and see if you can think of any other users - I could not come up with anything else beyond these 4.

tjaychen commented 3 years ago
  1. keymgr's lfsr usage is here
  2. For any case where the prng output is used for masking purposes we should just require the permutation (so add kmac to the list).
  3. ibex's lfsr is here
msfschaffner commented 3 years ago

Thanks, I have added the key manager and Ibex LFSRs and added a secure wipe feature for KMAC to the list (that one is marked as not implemented yet).

As for Chris' 3rd point, is HMAC sideload confirmed?

Open items:

tjaychen commented 3 years ago

i don't think it's really necessary for ibex to require the permutation. Since they are just using this to pick the random instructions. Similarly for keymgr, it's not really meant to be high quality or masking, so probably okay to not have the sbox layer.

I think it makes sense to have the sbox4 layer for KMAC.

I'm tempted to say the KMAC clear is okay as is (clears to 0, but would probably need @eunchan help to confirm). The state space is....1600b x 2... that's enormous. I find it really hard to believe that even though we're transitioning back to a known value that that many bits toggling would yield something useful. We can confirm this with the lab.

vogelpi commented 3 years ago

FYI: @vrozic has been investigating the masking PRNG inside AES. If I remember correctly (please correct me if I am wrong Vladimir), the main outcome was:

For any PRNG we should check if there is a way for software to observe the value. If this is the case, we need to construct the PRNGs in a different way such that it's not possible to infer previous states or predict future states. Vladimir had some ideas here.

msfschaffner commented 3 years ago

Does it maybe make sense to go over this table in one of our security meetings in order to decide on the mechanisms and resolve this? @felixmiller @moidx

tjaychen commented 3 years ago

yeah either use the security syncs or the security working groups. I don't feel like any of the decisions here are very controversial, so we can probably move quickly.

msfschaffner commented 3 years ago

I have updated the table above according to today's discussions.

we have the following action items on the design side:

in terms of open items TBD in an upcoming silicon meeting, we have:

msfschaffner commented 3 years ago

Ok the KMAC and KEYMGR have been taken care of.

@vogelpi @tomroberts-lowrisc @imphil @vrozic any updates regarding Ibex, AES, OTBN?

tomeroberts commented 3 years ago

The updates are done in the Ibex repo. I'll tick this off once the latest version of Ibex has been vendored in.

msfschaffner commented 3 years ago

Great, thanks!

tjaychen commented 3 years ago

just a note if it wasn't clear to everyone, the prim_lfsr now has the ability to configure all of these optoins. So you won't have to add additional logic outside to support it.

msfschaffner commented 3 years ago

OK this issue has been updated. The only outstanding point now is that question about KMAC key clearing. CC @eunchan @tjaychen

vogelpi commented 2 years ago

The proposed changes have now also been implemented for AES (aligned permutation before S-Box layer, span second permutation across all parallel LFSRs).

msfschaffner commented 2 years ago

Thanks @vogelpi. Looks like the only two outstanding points here are the question about the KMAC clearing mechanism and the new OTBN PRNG implementation.

mwbranstad commented 2 years ago

This issue has been open for 20 months, but has continuously been updated - impressive! Since this thread is so long now, can we close this issue and break out any remaining issues into new specific ones?

msfschaffner commented 2 years ago

Yes indeed, I believe all items here have been addressed & the remaining action item regarding KMAC clearing has been spun out into a separate issue (https://github.com/lowRISC/opentitan/issues/9041).