[MSG_FIFO] TL-UL interposing port for Hash

lowRISC / opentitan

OpenTitan: Open source silicon root of trust

https://www.opentitan.org

Apache License 2.0

2.56k stars 762 forks source link

[MSG_FIFO] TL-UL interposing port for Hash #1801

Closed eunchan closed 2 years ago

eunchan commented 4 years ago

This is just an idea while discussion in #1800. Probably makes no sense at all :)

I have a plan to split out the MSG_FIFO into common module shared by HMAC/SHA2 and KMAC/SHA3. So, one idea comes up that 'what if MSG_FIFO has TL-UL host port that can send the payload (consumed) to another module?'

The scenario is, when boot-up or when the software to verify a payload, it most likely be used to consume the data. In current design the software has to feed the data into the MAC and also copy the data into the actual space (for instance from eFlash to SRAM then copy the SRAM into MAC to check signature).

But if MSG_FIFO has TL-UL port, then any fed data can be written to the designated area after the data is consumed by MAC module, this case the software only needs to copy from non-volatile memory to MSG_FIFO, which also can be done through DMA module.

eFlash -- {DMA reads then writes to} --> MAC --> {MSG_FIFO writes to} --> SRAM

CC: @tjaychen @cdgori @moidx

tjaychen commented 4 years ago

@eunchan regarding the MSG_FIFO sharing, could we wait until we've had the cipher sharing conversation with software? We should be sure that this kind of sharing is okay before implementing it.

just to confirm, are you saying adding the tlul port for writing the signature out to a designated location? I don't think adding a tlul port for that to be worth it since the signature is usually pretty small (might be eating my words for keccak).

I feel like if we were to add a host port, it's main advantage would be for grabbing the data, since in that case, you would only need one tlul transaction instead of 2 like DMA no?

So for MAC operations, IMO there are two main types...the first is probably hashing something that's come over one of the IOs, and the other is checking something in storage to confirm it is valid.

For the first case, I don't think a tlul port would have a ton of advantages, since the IO speed is so much slower (even in SPI quad it would take 128 SPI clocks to get 512b).

For the second case, I could see some benefit for giant payloads. A very efficient memcpy / DMA i assume would burst read maybe...8-words of data back to back, write them all out, and burst read again (I imagine this would be in assembly if it were memcpy). So it feels like this would be a 2x bus bandwidth improvement if we allowed the MAC to read directly...

So I guess this all depends a bit on how fast the operation is going vs how fast the data injection is. The 80 cycle number you quoted before, was that assuming some kind of fast memcpy? I am actually not sure if ibex is capable of doing back to back data reads even if instructed to in assembly... but if it could, would 8 reads basically be ..9 cycles, and 8 writes be about the same?

eunchan commented 4 years ago

On Fri, Mar 20, 2020 at 01:36:14PM -0700, tjaychen wrote:

@eunchan regarding the MSG_FIFO sharing, could we wait until we've had the cipher sharing conversation with software? We should be sure that this kind of sharing is okay before implementing it.

I don't expect we implement separate MSG_FIFO module sooner than bronze delivery.

just to confirm, are you saying adding the tlul port for writing the signature out to a designated location? I don't think adding a tlul port for that to be worth it since the signature is usually pretty small (might be eating my words for keccak).

No. Intention is to have TL-UL port for message data.

I feel like if we were to add a host port, it's main advantage would be for grabbing the data, since in that case, you would only need one tlul transaction instead of 2 like DMA no?

MSG_FIFO could behave as DMA too. Yes, read the data and run hashes then write the data into the designated location. (data := message payload not signature)

So for MAC operations, IMO there are two main types...the first is probably hashing something that's come over one of the IOs, and the other is checking something in storage to confirm it is valid.

For the first case, I don't think a tlul port would have a ton of advantages, since the IO speed is so much slower (even in SPI quad it would take 128 SPI clocks to get 512b).

I personally thought that the data coming from the IO need to be stored somewhere else? So, in case of SPI, the software needs to read the payload from SPI buffer and writes to the MSG_FIFO, then MSG_FIFO will writes the message data into a location in SRAM ( or somewhere else ). Does this make sense and valid scenario?

For the second case, I could see some benefit for giant payloads. A very efficient memcpy / DMA i assume would burst read maybe...8-words of data back to back, write them all out, and burst read again (I imagine this would be in assembly if it were memcpy). So it feels like this would be a 2x bus bandwidth improvement if we allowed the MAC to read directly...

So I guess this all depends a bit on how fast the operation is going vs how fast the data injection is. The 80 cycle number you quoted before, was that assuming some kind of fast memcpy? I am actually not sure if ibex is capable of doing back to back data reads even if instructed to in assembly... but if it could, would 8 reads basically be ..9 cycles, and 8 writes be about the same?

No. 80cycle is based on current Ibex which is absent of STM/LDM. So for 512b, Ibex has to read from a memory location (takes 2 cycles), computes address (1 cycle), then writes to MSG_FIFO (2 cycles) per 32b word. So overall it takes 5 cycles X 16 => 80cycles.

--

tjaychen commented 4 years ago

okay i read this more closely, i think i misunderstood your intent. I actually think this idea is pretty neat idea, although it feels like it should be part of a DMA ..where one read can actually be written to two places (the MAC and another SRAM location).

On whether we need this... I think we should look at it from the use cases..

For SPI host payload hashing for example... I think SPI host is likely to implement some kind of ping-pong. So if it's something like 256B buffers on each, at quad, it would take 512 cycles to receive all the data. If we don't have what you're describing, it would take 2 rounds of read/writes to get the data to another SRAM location and also the MAC module. That i think would be 6452 cycles at least (probably more due to overhead...) ... so it feels like reducing the number of transactions here makes sense... SPI device probably has a similar case, although the data takes MUCH longer to come in, so it may not mater.

The flash verification case probably doesn't make a ton of sense since we'll still execute out of flash, so there's no reason to also move it to another location.

The general data storage in flash i also think MIGHT not make sense, because anything critical stored there will usually be wrapped up, so it may need to to be decrypted first before being fed to a MAC, so could already be in SRAM (this isn't always the case of course).

What do you think? I still think this is a good idea...regardless if it's on the MSG_FIFO or a DMA engine, but it may not be super critical to have immediately.

If it's on the MSG_FIFO, you save another write. Do we have any scenarios elsewhere in the system where data is processed in mass but not transformed? May be good to call a quick sync with a few software people to see if this makes sense.

sjgitty commented 4 years ago

Any updates thoughts on this? Recall that at the moment we have no known high-throughput use cases, so optimizing for performance might not be the right tradeoff. (Read: I think we likely won't need the DMA for this version, though good to be thinking about it.) Area is one that we will likely want to optimize for, but it could be we throw all the pieces in for bronze and take stock, then look for optimizations.

On Fri, Mar 20, 2020 at 3:39 PM tjaychen notifications@github.com wrote:

okay i read this more closely, i think i misunderstood your intent. I actually think this idea is pretty neat idea, although it feels like it should be part of a DMA ..where one read can actually be written to two places (the MAC and another SRAM location).

On whether we need this... I think we should look at it from the use cases..

For SPI host payload hashing for example... I think SPI host is likely to implement some kind of ping-pong. So if it's something like 256B buffers on each, at quad, it would take 512 cycles to receive all the data. If we don't have what you're describing, it would take 2 rounds of read/writes to get the data to another SRAM location and also the MAC module. That i think would be 6452 cycles at least (probably more due to overhead...) ... so it feels like reducing the number of transactions here makes sense... SPI device probably has a similar case, although the data takes MUCH longer to come in, so it may not mater.

The flash verification case probably doesn't make a ton of sense since we'll still execute out of flash, so there's no reason to also move it to another location.

The general data storage in flash i also think MIGHT not make sense, because anything critical stored there will usually be wrapped up, so it may need to to be decrypted first before being fed to a MAC, so could already be in SRAM (this isn't always the case of course).

What do you think? I still think this is a good idea...regardless if it's on the MSG_FIFO or a DMA engine, but it may not be super critical to have immediately.

If it's on the MSG_FIFO, you save another write. Do we have any scenarios elsewhere in the system where data is processed in mass but not transformed? May be good to call a quick sync with a few software people to see if this makes sense.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lowRISC/opentitan/issues/1801#issuecomment-601940273, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJZQKVOV5GGNHOD6U3RF5L3RIPWDBANCNFSM4LQTAF6A .

cdgori commented 4 years ago

What effective "permissions" (or other attributes) does the MSG_FIFO write have in @eunchan 's diagram? Hopefully same permissions as the initial DMA requester? (We haven't really talked about DMA permissions much yet either though.)

Could that MSG_FIFO write ever accidentally write somewhere the original requester wasn't supposed to be able to? Could someone use this to either overwrite a secret (causing a Hamming weight side-channel leak) or escalate permissions of some other action?

Basically any time the hardware does something "automatically" that is side-effecting I start to get very nervous, even if it is potentially a good performance optimization.

eunchan commented 4 years ago

Yeah. I also worried that part too and expect your comment like this. But at least, we need DMA engine to meet the throughput requirement. For HMAC/SHA2 case, it barely meets 80MB/s @ 100Mhz full core clock frequency not with the jittery clock. It is only when the core is fully working on pushing the message into MSG_FIFO. If the core has to read the data from somewhere else and store into internal, then the effective throughput would be much lower.

On Wed, Apr 01, 2020 at 10:45:34AM -0700, Chris Gori wrote:

What effective "permissions" (or other attributes) does the MSG_FIFO write have in @eunchan 's diagram? Hopefully same permissions as the initial DMA requester? (We haven't really talked about DMA permissions much yet either though.)

Could that MSG_FIFO write ever accidentally write somewhere the original requester wasn't supposed to be able to? Could someone use this to either overwrite a secret (causing a Hamming weight side-channel leak) or escalate permissions of some other action?

Basically any time the hardware does something "automatically" that is side-effecting I start to get very nervous, even if it is potentially a good performance optimization.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/lowRISC/opentitan/issues/1801#issuecomment-607395010

tjaychen commented 4 years ago

so i think if we were to have this function, both the DMA and the HMAC would be fronted by an IOPMP. My expectation is that when someone sets up this function, they would also need to re-configure the IOPMP each time to restrict both the DMA and the HMAC to only go to locations it should. This still only guards against logical errors, physical glitching would still be a problem.

Eunchan, regarding your point on throughput.. i guess the thing i'm confused about.. if we did not have this feature, do you agree that the most likely data flow would be ..

read from mem location A -> write to mem location B read from mem location B -> write to HMAC

whereas with your feature suggestion it would be

read from mem location A -> write to HMAC -> write to mem location B?

So we are mainly saving a read correct? So if we treat the first A->B as just latency overhead, wouldnt the throughput of the two cases be pretty similar? Both have to read from a location and try to fill the message fifo at the same rate right? (I still think this is a good argument to be able to support very tightly packed transactions on the processor).

eunchan commented 4 years ago

Eunchan, regarding your point on throughput.. i guess the thing i'm confused about.. if we did not have this feature, do you agree that the most likely data flow would be ..

read from mem location A -> write to mem location B read from mem location B -> write to HMAC

whereas with your feature suggestion it would be

read from mem location A -> write to HMAC -> write to mem location B?

So we are mainly saving a read correct? So if we treat the first A->B as just latency overhead, wouldnt the throughput of the two cases be pretty similar? Both have to read from a location and try to fill the message fifo at the same rate right? (I still think this is a good argument to be able to support very tightly packed transactions on the processor).

You are right. If the software put together the message earlier, the throughput is identical.

eunchan commented 2 years ago

Let me close this issue. As of the current IP status, it is not worth to investigate this option anymore.

tjaychen commented 2 years ago

i actually thought it was a cool idea worthy of future release examination :)