[AES] Questions regarding AES implementation, and documentation suggestions

silvestrst commented 4 years ago

I am working on DIF AES. After reading through the OpenTitan AES documentation, I have several questions, and some suggestions where the documentation could potentially be made more descriptive.

What is the exact behaviour of the aes.STATUS.IDLE flag?
- If there is unread data in the aes.DATA_OUTn registers, will the flag be set?
- Does the behaviour of the flag differ between automatic and manual operation mode?
- Does IDLE also cover the STALL condition?
I think this flag could be annotated better to explain exact behaviour.
Is my understanding that the S-box table values are constant for AES correct? If it is, what does the specification mean by "20 S-Boxes are used" in "SubBytes / S-Box" section? I guess that is because SubBytes operation is done per byte, and in order to do the whole input data block at once in parallel, you would need to dup combinational logic blocks for every byte (16) + 4 for key expansion? If this is correct maybe it is worth explaining it in one short sentence?
"Support for AES-192 can be removed to save area, and is enabled/disabled using a compile-time Verilog parameter" - the documentation only mentions 192-bit AES, does this mean that OpenTitan is always compiled with the AES-256 support? If this is the case, it would be useful to explain why the registers and logic that is required for AES-256 is not sufficient to support AES-192?
For 128-bit AES only 4 KEY registers are used (similar for 192-bit), is it enough to load the rest of the registers with 0 (every KEY register is required to be written to, even if is not used), or should the unused registers be loaded with a random number? I see that the existing OpenTitan AES driver just fills unused registers with 0, but just wanted to double check if this is a correct behaviour.
I understand that all the below information is described in this AES documentation, however it would be easier for the developer if we also has simple annotations inline with the register and respective register bits.
- aes.STATUS.STALL, would be useful to annotate that it is only relevant in terms of automatic operation mode, and has no effect during the manual operation mode.
Datapath Architecture and Operation section is very informative, it nicely describes how the peripheral works. However, I think it would be easier to read if instead of having a single list of actions and difference between modes described in italics - to split out this list in three:
- ECB
- CBC
- CTR
It might be possible to have a little intro, where we explain that describe that ECB is the simplest mode where every output block is only dependent on the key and an input block. That it has a set of fundamental operations that are also used by the CBC and CTR with some additional actions and extra registers.

We can then describe the full list of operations for ECB in detail, and for CBC and CTR only describe the additional functionality and deviations from ECB.
Is aes.TRIGGER register only relevant when in manual operation mode, and also for the device de-initialisation. Are there other use-cases when you would want to clear these registers?
Decryption process details for CBC are not explained in the same details as encryption. In particular, to decipher the first block of ciphertext, we require the initial IV value. To decipher subsequent cipher text blocks, we need initial IV value + cipher text block 1, and so on...
- For the decrypt operations, after the last round, is the input value (cipher text block) added to the IV registers automatically, similar how it is done for encrypt operations, but only with input value?
- Unlikely use-case, but we would want to perform encryption on n blocks in CBC mode, and then straight after to decrypt these n blocks. Will the IV value have to be reloaded, because it has been modified during the CBC encryption? I guess a more precise question is, the original IV value is overwritten in the process of encryption/decryption?
Item 7. of Datapath Architecture and Operation is difficult to read, and would be nice if could be explained in a less compact and concise way: "Depending on the cipher mode, the output of the final round is potentially XORed with either the value in the IV registers (CBC decryption) or the value stored in the previous input data register (CTR mode), before being forwarded to the output register in the CSRs. If running in CBC mode, the IV registers are updated with the output data (encryption) or the value stored in the previous input data register (decryption)."
For CTR mode, in AES modes recommendation document, Appendix B: Generation of Counter Blocks: "The specification of the CTR mode requires a unique counter block for each plaintext block that is ever encrypted under a given key, across all messages." ... "There are two aspects to satisfying the uniqueness requirement. First, an incrementing function for generating the counter blocks from any initial counter block can ensure that counter blocks do not repeat within a given message. Second, the initial counter blocks, T1, must be chosen to ensure that counters are unique across all messages that are encrypted under the given key."
- It then talks about, how counter can be implemented in Appendix B.1 The Standard Incrementing Function: "In general, given the initial counter block for a message, the successive counter blocks are derived by applying an incrementing function. As in the above specifications of the modes, n is the number of blocks in the given plaintext message, and b is the number of bits in the block. The standard incrementing function can apply either to an entire block or to a part of a block." In my opinion it would be useful to document the behaviour of the Incrementing Function in this AES HW, as that also affects the format of the initial counter value and how it is calculated (providing I have read the specification correctly).
- I guess this question is a continuation of the previous bullet point, but I don't quite understand the following: "Internally, the counter module uses one 16-bit counter, meaning it requires 8 clock cycles to increment the 128-bit counter value stored in the IV register."
Does "De-Initialization" described in the DIF document need to be performed before device re-configuration, or only at the very end when the system does not need this peripheral anymore?
AES modes recommendation document, Appendix C states: "For the CBC and CFB modes, the IVs must be unpredictable. In particular, for any given plaintext, it must not be possible to predict the IV that will be associated to the plaintext in advance of the generation of the IV." I might be mistaken, but doesn't this mean that in future it would involve getting this from a random number generator? The Cipher mode specification mentions two methods, and If I understand correctly, and these either involve sideloading the IV or extending the AES to generate an appropriate IV? If my understanding is correct, I think it should be documented.
Do I read the documentation correctly - in MANUAL_OPERATION mode, regardless of the status of aes.DATA_IN, aes.DATA_OUT and aes.IV (even if all the registers have not been written or read) setting aes.TRIGGER.START will trigger the encrypt/decrypt operation? However, every aes.KEY MUST be written at least once regardless of the operation mode (automatic or manual)? Reference: _"By setting the MANUALOPERATION bit in CTRL to 1, the AES unit can be operated in manual mode. In manual mode, the AES unit starts encryption/decryption whenever the START bit in TRIGGER is set to 1, irrespective of the status of the IV and input data registers."
aes.STATUS.INPUT_READY, does the AES only load the input data into the internal state, ONLY when it has finished current encrypt or decrypt operation on the previous block of data, and the aes.STATUS.OUTPUT_VALID condition becomes true?

cdgori commented 4 years ago

I can only really address 2 of these (and sort of a 3rd):

Is my understanding that the S-box table values are constant for AES correct? If it is, what does the specification mean by "20 S-Boxes are used" in "SubBytes / S-Box" section? I guess that is because SubBytes operation is done per byte, and in order to do the whole input data block at once in parallel, you would need to dup combinational logic blocks for every byte (16) + 4 for key expansion? If this is correct maybe it is worth explaining it in one short sentance?

The answer you arrived at yourself is correct. 16 S-Boxes are for full parallel SubBytes, 4 are for key expansion. In the very top section ("Features"), the documentation indicates a target of "medium performance (16 parallel S-Boxes, ~1 cycle per round)" - lower performance would use 8 (or 4, or even just 1) S-Box and serialize the SubBytes operation (increasing the cycles per round), while higher performance would have up to 16*10 rounds = 160 (or 16*14 rounds = 224) S-Boxes to enable higher SubBytes throughput, e.g. for high-speed streaming. The S-Boxes are typically the biggest area consumers for an AES implementation, so the choice of how many to have is pretty fundamental.

The documentation (here) is trying to document some of the hardware design decisions / trade-offs and so perhaps might not be as useful to someone implementing the software, but would be very helpful for a crypto hardware reviewer to understand what she is looking at.

"Support for AES-192 can be removed to save area, and is enabled/disabled using a compile-time Verilog parameter" - the documentation only mentions 192-bit AES, does this mean that OpenTitan is always compiled with the AES-256 support? If this is the case, it would be useful to explain why the registers and logic that is required for AES-256 is not sufficient to support AES-192?

AES-192 (at the hardware level) is understood to be quite different compared to the other two. AES-256 architecturally looks like a "big brother" of AES-128, but AES-192 is like a "cousin" or "step-sister" (the analogy is admittedly not perfect) - basically the AES-128 and 256 can be implemented to share the majority of their datapath and the 192 requires more special cases. Personally I think it is OK for our documentation to assume familiarity with things like this, but we may decide we want to provide longer explanations.

Overall (and based on these comments/questions), perhaps we do need to restructure the documentation slightly to separate the users-guide and hardware-design-decisions type of material? The DIF-writer (@silvestrst) probably needs to understand both though.

Also, I should comment here:

Datapath Architecture and Operation section is very informative, it nicely describes how the peripheral works. However, I think it would be easier to read if instead of having a single list of actions and difference between modes described in italics - to split out this list in three: ...

Pirmin and I (along with others) talked about the structure of this section a fair bit. I believe we also noted that if other modes get added (GCM, XTS, etc), we will need to think about how to structure the documentation to avoid a lot of text duplication but also present the operation clearly. This is a pretty hard thing to balance, so comments/suggestions are appreciated here.

silvestrst commented 4 years ago

@cdgori thank you for your explanation.

To avoid any confusion, as I now see that the last sentence might be confusing. I didn't mean to condense this information in a single sentence. I meant expanding the explanation slightly to explicitly say that the S-boxes logic is duplicated for parallelism. I don't deal with RTL often, so at first failed to "connect the dots", eventually though arrived to the right conclusion as you have confirmed.

I find this section very useful for the overall understanding how AES peripheral works, and actually think that the comparison between low/medium/high performance that you gave in your answer useful for better understanding. Maybe it is worth adding to the documentation as well?

The answer you arrived at yourself is correct. 16 S-Boxes are for full parallel SubBytes, 4 are for key expansion. In the very top section ("Features"), the documentation indicates a target of "medium performance (16 parallel S-Boxes, ~1 cycle per round)" - lower performance would use 8 (or 4, or even just 1) S-Box and serialize the SubBytes operation (increasing the cycles per round), while higher performance would have up to 1610 rounds = 160 (or 1614 rounds = 224) S-Boxes to enable higher SubBytes throughput, e.g. for high-speed streaming. The S-Boxes are typically the biggest area consumers for an AES implementation, so the choice of how many to have is pretty fundamental.

Is my understanding that the S-box table values are constant for AES correct? If it is, what does the specification mean by "20 S-Boxes are used" in "SubBytes / S-Box" section? I guess that is because SubBytes operation is done per byte, and in order to do the whole input data block at once in parallel, you would need to dup combinational logic blocks for every byte (16) + 4 for key expansion? If this is correct maybe it is worth explaining it in one short sentence?

silvestrst commented 4 years ago

I apologise for this "essay" like size of this issue.

cdgori commented 4 years ago

I apologise for this "essay" like size of this issue.

I wouldn't apologize, I'm equally guilty of doing such things. :)

But, it actually does prompt me to think, in general should we favor splitting such an issue up into multiple issues so they can be resolved in parallel threads? I'm not sure of best practice here, since I think a lot of the topics you identified are documentation clarifications and should probably be bundled together in one pass through the docs.

vogelpi commented 4 years ago

Thanks @silvestrst for opening up the issue and @cdgori for your replies.

I am still catching up on various things but hopefully will be able to get back to this thread later during the week. Ideally, we can resolve the simple points directly here and I'll probably open up parallel threads for some of the tougher ones.

imphil commented 4 years ago

in general should we favor splitting such an issue up into multiple issues so they can be resolved in parallel threads?

General advise: Yes. Don't worry about the number of issues. One issue == one issue, as small as it might be. And it gives you increased gratification by closing more issues :smile:

vogelpi commented 4 years ago

Hi @silvestrst , I went through all the questions and pushed a PR to clarify some of the points. In the following I am answering all questions that can be easily answered. For some, I will create a new issue.

vogelpi commented 4 years ago

What is the exact behaviour of the aes.STATUS.IDLE flag? If there is unread data in the aes.DATA_OUTn registers, will the flag be set?

The behavior of this flag is unrelated to any of the registers. This flag is set unless either:

the cipher core is running,
the module is busy with clearing registers,
the module is busy with reseeding the PRNG.

Does the behaviour of the flag differ between automatic and manual operation mode?

No.

Does IDLE also cover the STALL condition?

Yes, if the cipher core is stalled (meaning the cipher core is running but cannot finish the current operation because previous output values have not been consumed by software), IDLE will be 0.

I think this flag could be annotated better to explain exact behaviour.

I agree. I will update the hjson + doc, see the linked PR.

vogelpi commented 4 years ago

Is my understanding that the S-box table values are constant for AES correct? If it is, what does the specification mean by "20 S-Boxes are used" in "SubBytes / S-Box" section? I guess that is because SubBytes operation is done per byte, and in order to do the whole input data block at once in parallel, you would need to dup combinational logic blocks for every byte (16) + 4 for key expansion? If this is correct maybe it is worth explaining it in one short sentence?

I tried to clarify this in the doc.

vogelpi commented 4 years ago

"Support for AES-192 can be removed to save area, and is enabled/disabled using a compile-time Verilog parameter" - the documentation only mentions 192-bit AES, does this mean that OpenTitan is always compiled with the AES-256 support? If this is the case, it would be useful to explain why the registers and logic that is required for AES-256 is not sufficient to support AES-192?

As @cdgori mentioned, AES-128 and AES-256 are much more similar. AES-192 requires a lot of additional muxing in the KeyExpand step (because 192 is not a power of two, you the key expanding is less regular and you need more multiplexers with more inputs). This is briefly explained in the KeyExpand section: "Key expanding is the only operation in the AES unit for which the functionality depends on the selected key length. Having a KEM that supports 128-bit key expansion, support for the 256-bit mode can be added at low overhead. In contrast, the 192-bit mode requires much larger muxes. Support for this mode is thus optional and can be enabled/disabled via a design-time parameter."

vogelpi commented 4 years ago

For 128-bit AES only 4 KEY registers are used (similar for 192-bit), is it enough to load the rest of the registers with 0 (every KEY register is required to be written to, even if is not used), or should the unused registers be loaded with a random number? I see that the existing OpenTitan AES driver just fills unused registers with 0, but just wanted to double check if this is a correct behaviour.

From a functionality point of view, there is no difference in writing 0 or a random value. But I think from a security point of view, a random number would be better. I will update the spec.

vogelpi commented 4 years ago

I understand that all the below information is described in this AES documentation, however it would be easier for the developer if we also has simple annotations inline with the register and respective register bits. aes.STATUS.STALL, would be useful to annotate that it is only relevant in terms of automatic operation mode, and has no effect during the manual operation mode.

I agree and updated will update the spec.

vogelpi commented 4 years ago

Is aes.TRIGGER register only relevant when in manual operation mode, and also for the device de-initialisation. Are there other use-cases when you would want to clear these registers?

The aes.TRIGGER.START bit is relevant only when in manual operation mode. All the other triggers are relevant for device de-initialization. aes.TRIGGER.PRNG_RESEED may also become relevant during normal operation. Even while a encryption/decryption is running, software may request a PRNG reseed to "refresh" the randomness. At the moment, this is not yet very relevant as the cipher core does not employ masking (i.e. where most randomness is going to be used). However, as the security hardening works continues, this might change and I will document this accordingly in the future.

vogelpi commented 4 years ago

Decryption process details for CBC are not explained in the same details as encryption. In particular, to decipher the first block of ciphertext, we require the initial IV value. To decipher subsequent cipher text blocks, we need initial IV value + cipher text block 1, and so on... For the decrypt operations, after the last round, is the input value (cipher text block) added to the IV registers automatically, similar how it is done for encrypt operations, but only with input value?

The input value is not "added" to the IV registers, but stored in the IV registers after the last round. This is already mentioned in the spec (see Datapath Architecture and Operation): "If running in CBC mode, the IV registers are updated with the output data (encryption) or the value stored in the previous input data register (decryption)."

Unlikely use-case, but we would want to perform encryption on n blocks in CBC mode, and then straight after to decrypt these n blocks. Will the IV value have to be reloaded, because it has been modified during the CBC encryption? I guess a more precise question is, the original IV value is overwritten in the process of encryption/decryption?

That's a good point. You are correct, the original IV value is overwritten in the process of encryption/decryption (this is already stated in the spec). Consequently, whenever a new message is started, the corresponding IV value has to be written by software. I will update the spec accordingly to clarify this.

vogelpi commented 4 years ago

For CTR mode, in AES modes recommendation document, Appendix B: Generation of Counter Blocks: "The specification of the CTR mode requires a unique counter block for each plaintext block that is ever encrypted under a given key, across all messages." ... "There are two aspects to satisfying the uniqueness requirement. First, an incrementing function for generating the counter blocks from any initial counter block can ensure that counter blocks do not repeat within a given message. Second, the initial counter blocks, T1, must be chosen to ensure that counters are unique across all messages that are encrypted under the given key."

It then talks about, how counter can be implemented in Appendix B.1 The Standard Incrementing Function: "In general, given the initial counter block for a message, the successive counter blocks are derived by applying an incrementing function. As in the above specifications of the modes, n is the number of blocks in the given plaintext message, and b is the number of bits in the block. The standard incrementing function can apply either to an entire block or to a part of a block."

In my opinion it would be useful to document the behaviour of the Incrementing Function in this AES HW, as that also affects the format of the initial counter value and how it is calculated (providing I have read the specification correctly).

Thanks for raising this @silvestrst. I need to update the spec to say that our counter follows the standard incrementing function with a fixed parameter of m=128 (and add a reference to the doc). This is an important missing detail. However, I am against further documenting the standard incrementing function in our spec as it means duplication. This is best looked up in the referenced document which by the way is not AES specific but holds recommendations for any block ciphers.

I guess this question is a continuation of the previous bullet point, but I don't quite understand the following: "Internally, the counter module uses one 16-bit counter, meaning it requires 8 clock cycles to increment the 128-bit counter value stored in the IV register."

This is an implementation detail. Since we need a new counter value every 12 cycles at most (AES-128, 16 for AES-256) there is no point in doing the increment in a single cycle. Instead, we can implement the counter iteratively which allows for some small area savings.

vogelpi commented 4 years ago

Does "De-Initialization" described in the DIF document need to be performed before device re-configuration, or only at the very end when the system does not need this peripheral anymore?

The de-initialization should be performed whenever the AES unit is handed over to a different process/service. The reason is that it must be ensured that internal registers are cleared. Otherwise, the new user could try to infer the key or data of the previous user (even though most registers are write-only). I think this might be something initiated by the OS/driver.

vogelpi commented 4 years ago

AES modes recommendation document, Appendix C states: "For the CBC and CFB modes, the IVs must be unpredictable. In particular, for any given plaintext, it must not be possible to predict the IV that will be associated to the plaintext in advance of the generation of the IV." I might be mistaken, but doesn't this mean that in future it would involve getting this from a random number generator? The Cipher mode specification mentions two methods, and If I understand correctly, and these either involve sideloading the IV or extending the AES to generate an appropriate IV? If my understanding is correct, I think it should be documented.

My understanding is that even if the IV is going to be generated using a PRNG, this won't be done by the AES HW unit itself as the IV needs to be communicated to the receiver via a secure channel. I think this is a higher-level protocol thing that both HW and DIF don't need to care about.

vogelpi commented 4 years ago

Do I read the documentation correctly - in MANUAL_OPERATION mode, regardless of the status of aes.DATA_IN, aes.DATA_OUT and aes.IV (even if all the registers have not been written or read) setting aes.TRIGGER.START will trigger the encrypt/decrypt operation? However, every aes.KEY MUST be written at least once regardless of the operation mode (automatic or manual)? Reference: "By setting the MANUAL_OPERATION bit in CTRL to 1, the AES unit can be operated in manual mode. In manual mode, the AES unit starts encryption/decryption whenever the START bit in TRIGGER is set to 1, irrespective of the status of the IV and input data registers."

This is a tricky one. It seems reasonable to me that software should always first write the config, the key and then IV + data. However, the hardware only checks if the key is written to determine if the decryption key needs to be derived (whenever the key changes and we are doing decryption, this needs to be done). But if software does not write the key, and provide everything else the module starts either way - independent of MANUAL_OPERATION and whether we are doing encryption or decryption. I think we want that the key regs are written at least once after a clear/reset before starting in any mode. But I need to think about how to properly implement this. I added this to my list in the tracking issue.

vogelpi commented 4 years ago

aes.STATUS.INPUT_READY, does the AES only load the input data into the internal state, ONLY when it has finished current encrypt or decrypt operation on the previous block of data, and the aes.STATUS.OUTPUT_VALID condition becomes true?

In the current implementation, the AES unit (every line = 1 clock cycle):

Finishes the current encrypt or decrypt operation.
aes.STATUS.OUTPUT_VALID reads 1. If the next input data is available, the next encrypt/decrypt is started by loading the input data into the internal state.
aes.STATUS.INPUT_READY reads 1.

I'll clarify this in the spec (see Block Operation).

vogelpi commented 4 years ago

and 9. are related and will go into a new issue. I will open it later today.

silvestrst commented 4 years ago

@vogelpi thank you very much for the detailed answers, that helps a lot! Glad some of my comments were useful

cdgori commented 4 years ago

AES modes recommendation document, Appendix C states: "For the CBC and CFB modes, the IVs must be unpredictable. In particular, for any given plaintext, it must not be possible to predict the IV that will be associated to the plaintext in advance of the generation of the IV." I might be mistaken, but doesn't this mean that in future it would involve getting this from a random number generator? The Cipher mode specification mentions two methods, and If I understand correctly, and these either involve sideloading the IV or extending the AES to generate an appropriate IV? If my understanding is correct, I think it should be documented.

My understanding is that even if the IV is going to be generated using a PRNG, this won't be done by the AES HW unit itself as the IV needs to be communicated to the receiver via a secure channel. I think this is a higher-level protocol thing that both HW and DIF don't need to care about.

I believe it's correct to say this that IV establishment is a protocol issue.

vogelpi commented 3 years ago

I am closing this issue. All questions have been answered and the feedback integrated in the hardware, software, documentation. But feel free to re-open if you don't agree.

Niyasmonp commented 10 months ago

Hi, has anyone tested the Opentitan's aes module? when I'm tested with the standard test vectors given by NIST, it is not giving the correct output cipher data. I used aes_wrap.sv for testing. Did I miss anything or any of you got correct result?. Please help, it is urgent

vogelpi commented 10 months ago

Hi @Niyasmonp , aes_wrap.sv is just meant for FI experiments, nevertheless it should produce correct output. Maybe something is wrong with the byte ordering in your setup? Anyway, there is a UVM-based DV environment which also feature a test sequence running NIST vectors. This is run in the nightly regression and tests pass. I am 100% sure everything is fine here. You can run the test locally using

util/dvsim/dvsim.py hw/ip/aes/dv/aes_masked_sim_cfg.hjson -t xcelium -i aes_nist_vectors --purge

Where the -t argument selects the simulator. Please use -t vcs for Synopsys VCS.

Niyasmonp commented 10 months ago

I tried checking each module's outputs, for example SubBytes, Shift Rows, MixColumns and Key Expand. 'SubBytes' module's output is correct. But one thing I noticed is in 'ShiftRows', when I changed its indexing inside the module, its output also comes right. For example, there is a line

// Row 0 is left untouched assign data_o[0] = data_i[0];

but here since it is in little-endian format, Row 0 should be data_i[3] right? When I changed this indexing inside this module, its output is matching with the expected output. MixCoulumn and KeyExpand I still have to check.

Is there anything I'm missing in terms of the byte ordering u mentioned? What I input is the standard NIST vectors?

Niyasmonp commented 10 months ago

I tried with the big-endian format by swapping the byte order, and now every module is giving expected output except 'MixColumns' ! Can someone shed a light on why I'm not getting expected output for 'MixColumns'? Is there still anything that I'm missing in terms of byte ordering?

Niyasmonp commented 10 months ago

Sorted, Thanks!

lowRISC / opentitan

[AES] Questions regarding AES implementation, and documentation suggestions #2382