[top_earlgrey] Check Vivado reports after EDN and CSRNG were merged

imphil commented 3 years ago

By enabling the EDN and CSRNG components in top_earlgrey in https://github.com/lowRISC/opentitan/pull/4211 the FPGA bitstream build time increased by ~50 percent (after, before), likely due to Vivado trying harder to meet timing. Since this build step is already the critical path in our CI pipeline the overall CI latency increased by roughly 10 minutes.

Overall slack is still positive at 0.013 ns. We need to have a closer look where this increase in synthesis/P&R time is coming from, and confirm if that's intended or if there are easy ways to improve it.

# before without EDN/CSRNG:
wait_on_run: Time (s): cpu = 00:06:54 ; elapsed = 00:16:42 . Memory (MB): peak = 1426.355 ; gain = 0.000 ; free physical = 7028 ; free virtual = 72861

# after with EDN/CSRNG:
wait_on_run: Time (s): cpu = 00:34:17 ; elapsed = 00:25:10 . Memory (MB): peak = 1373.812 ; gain = 0.000 ; free physical = 5582 ; free virtual = 71402

@mwbranstad can you attempt to do a FPGA synthesis and look at the (timing) report(s)? Let me know if you need help with that or if you're stuck and someone else should have a look.

mwbranstad commented 3 years ago

I will do some DC synthesis runs first (under my direct control) and look for any long paths, assuming that timing might be an issue. The data paths between the blocks typically will launch off of a flop and inputs usually flop before use, so standard practice there. It could be that just adding more logic slows it down, but will take a look at the timing first.

mwbranstad commented 3 years ago

Results for csrng currently show for an aggressive run (500 mhz clock in 40nm tech node) that the number of max logic levels is 15. Would like feedback if 15 levels is reasonable or too deep.

mwbranstad commented 3 years ago

@imphil not looking good for fpga build. The WD flow always takes extra work to get it to run. In the interest of time, can someone provide Vivado logs for analysis?

mwbranstad commented 3 years ago

@imphil did this issue go away with more licenses? I see that the D2 checklist requires FPGA timing to pass. Does the CI do this automatically, or does more need to be done?

mateenfaisal commented 3 years ago

@imphil does @mwbranstad have to debug this issue? The slack is still +ive. Also the overall CI latency increase by 10 minutes does not sound drastic. But perhaps I haven't understood the consequence of this overall CI latency 10 minute increase.

vogelpi commented 3 years ago

It's true that the slack is still positive. However, the fact that the inclusion of CSRNG + EDN led to a 50% increase in build time lets me think that also the critical path grew quite substantially. While this might not be a real problem right now, it can become problem in the future. Maybe a much simpler change like hooking up another device to CSRNG + EDN will tip the design over the edge and it will be hard to figure out for that person what is the exact issue.

@mwbranstad The 15 logic levels on ASIC are not problematic but on FPGA the story is different. I've had a quick look at the timing summary in Vivado and the 9 most critical paths on the main clock are all inside CSRNG, EDN or the entropy source.

csrng_critical_paths

The problem is that there are a couple of very wide muxes in the design and some of them are in series and the FPGA tool can't optimize/combine them. For example, the most critical path is goes from

u_csrng_core/u_csrng_ctr_drbg_upd/u_prim_fifo_sync_bencreg/gen_nromal_fifo.storage_reg

into

u_csrng_core/u_csrng_block_encrypt/u_aes_cipher_core/key_full_q

This is the path of the key. Starting from the destination inside u_aes_cipher_core, there is a 3-input mux. Going up, there is a 2-input mux inside u_csrng_core. Then there is another 2-input mux inside u_prim_fifo_sync_bencreg. This doesn't sound like a lot, but all these muxes are 256-bit wide, so it means a lot of routing on FPGA. Now, the important point here is that the first 2 muxes are necessary and functionally important. However, the third mux inside u_prim_fifo_sync_bencreg doesn't seem necessary. If you provide the parameter OutputZeroIfEmpty = 1'b0 (currently the default of 1 is used), this mux will go away. I don't know if you really need that behavior or if this parameter maybe got added/default changed later in the primitive. But it looks like a very cheap way to save resources and logic + routing delay on the FPGA. The same is true also for most of the other critical paths. They go though a prim_fifo_sync with the default OutputZeroIfEmpty parameter set to 1'b1.

Could you please check all instances of this FIFO type in your design and reason if the OutputZeroIfEmpty parameter could be set to 1'b0 at least on FPGA?

tjaychen commented 3 years ago

hey Pirmin, if you have the reports handy, would you mind adding some of them so we can have a quick look? did anyone also get a chance to see the utilization before and post fpga? I wonder if EDN / CSRNG just pushed us towards the edge, and that's part of the problem also.

On Fri, Feb 19, 2021 at 10:30 AM Pirmin Vogel notifications@github.com wrote:

It's true that the slack is still positive. However, the fact that the inclusion of CSRNG + EDN led to a 50% increase in build time lets me think that also the critical path grew quite substantially. While this might not be a real problem right now, it can become problem in the future. Maybe a much simpler change like hooking up another device to CSRNG + EDN will tip the design over the edge and it will be hard to figure out for that person what is the exact issue.

@mwbranstad https://github.com/mwbranstad The 15 logic levels on ASIC are not problematic but on FPGA the story is different. I've had a quick look at the timing summary in Vivado and the 9 most critical paths on the main clock are all inside CSRNG, EDN or the entropy source.

[image: csrng_critical_paths] https://user-images.githubusercontent.com/20307557/108545886-a881f380-72e8-11eb-8f8c-e378038113a1.png

The problem is that there are a couple of very wide muxes in the design and some of them are in series and the FPGA tool can't optimize/combine them. For example, the most critical path is goes from

u_csrng_core/u_csrng_ctr_drbg_upd/u_prim_fifo_sync_bencreg/gen_nromal_fifo.storage_reg

into

u_csrng_core/u_csrng_block_encrypt/u_aes_cipher_core/key_full_q

This is the path of the key. Starting from the destination inside u_aes_cipher_core, there is a 3-input mux. Going up, there is a 2-input mux inside u_csrng_core. Then there is another 2-input mux inside u_prim_fifo_sync_bencreg. This doesn't sound like a lot, but all these muxes are 256-bit wide, so it means a lot of routing on FPGA. Now, the important point here is that the first 2 muxes are necessary and functionally important. However, the third mux inside u_prim_fifo_sync_bencreg doesn't seem necessary. If you provide the parameter OutputZeroIfEmpty = 1'b0 (currently the default of 1 is used), this mux will go away. I don't know if you really need that behavior or if this parameter maybe got added/default changed later in the primitive. But it looks like a very cheap way to save resources and logic + routing delay on the FPGA. The same is true also for most of the other critical paths. They go though a prim_fifo_sync with the default OutputZeroIfEmpty parameter set to 1'b1.

Could you please check all instances of this FIFO type in your design and reason if the OutputZeroIfEmpty parameter could be set to 1'b0 at least on FPGA?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lowRISC/opentitan/issues/4451#issuecomment-782259053, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAH2RSQMZXSNOSIJCRABGBLS72U5PANCNFSM4UQMEFVQ .

mwbranstad commented 3 years ago

@vogelpi thanks for doing this analysis! It was on my list, but I can't generate these reports, so hard to debug. The OutputZeroIfEmpty = 1'b0 should work for all uses of this fifo and should not have any ill effects. This is an easy switch to pull. In fact, we should consider making this the default, as that is how most fifos of this type typically work. I am going to try and "re-create" this in DC, at least to find the long paths.

vogelpi commented 3 years ago

You're welcome @mwbranstad . I tried to generate the reports on Friday but without success as I was looking for the setup time. But hold time is more critical. Here the reports:

timing summary: top_earlgrey_timing_summary.txt
10 most critical hold paths (I believe 9 are in/around csrng/edn/entropy_src): top_earlgrey_clk_main_hold.txt
top utilization: top_earlgrey_utilization_hierarchical.txt
csrng ultilization: u_csrng_utilization_hierarchical.txt Note that the tool does not fully preserve the hierarchy, some of the logic is shifted around and shows up as part of other sub blocks.

@mwbranstad Regarding setting the OutputZeroIfEmpty parameter to 0: We shouldn't set it to 0 by default as this could be used to extract previous data which may be very bad in some contexts. This is also the reason why we added the parameter and set it to 1 by default. For more information, checkout #2315. Instead, we should judge on a per-case basis what is the most sensible solution. I think for CSRNG, setting it to 0 is preferred. My impression is that it wouldn't be possible to extract a secret out of CSRNG that way (downstream CSRNG output is used as randomness anyway) but could help us to save quite some logic.

@tjaychen I didn't regenerate the utilization reports for before adding csrng/edn so can't really say, altogether they have about the same size as one of the bigger blocks (otbn). But the increase in synthesis/implementation time reported by @imphil suggests that the addition of csrng/edn indeed makes a bit difference.

vogelpi commented 3 years ago

Update: @mwbranstad did PR #5341 to remove the unnecessary output muxes and do some timing optimizations inside CSRNG. This had quite some impact: -7.2 kGE on ASIC (-6.5%), and -1.2 kLUT on FPGA (-18%!). That's nice work! At the same time, I merged some FI hardening changes in AES that increased the area. Without this, Mark's savings would be even higher.

In terms of synthesis/implementation time, things didn't really improve though. However, I think this cannot be expected anymore as the FPGA pretty full now. Or in other words the long build time is not a result of CSRNG/EDN alone with the current config. Looking at the most critical hold time paths we see that these are now more equally split across the design. Among the 10 most critical paths, there are paths inside CSRNG, ES, OTBN, PLIC and HMAC:

fpga_new

This is much more reasonable and I would say we can close this issue.

imphil commented 3 years ago

Yeah, the synthesis time was just an indication that something in the design got worse, it's not a problem in itself.

Thanks for digging into this @vogelpi and thanks for the fix @mwbranstad. It shows that we need carefully monitor our metrics and ensure that we understand the reason if they change. Reducing the ASIC design 7 kGE isn't a bad win.

cdgori commented 3 years ago

Yeah, the synthesis time was just an indication that something in the design got worse, it's not a problem in itself.

Thanks for digging into this @vogelpi and thanks for the fix @mwbranstad. It shows that we need carefully monitor our metrics and ensure that we understand the reason if they change. Reducing the ASIC design 7 kGE isn't a bad win.

@imphil - I had a conversation with @msfschaffner / @tjaychen more than a year back now about maybe tracking in table or graphical form the day-over-day results from synthesis (we were speaking about DC, but Vivado would also be useful/applicable), for exactly the reasons you state.

I just don't know of any existing framework that would make this easy - in all my previous cases we had to cook up some scripts to compile the relevant data and publish to a web page. :(

imphil commented 3 years ago

I just don't know of any existing framework that would make this easy

Good timing! I thought about the same thing yesterday. I have been looking for something of that kind for a while now and never found the right keywords, when the solution is rather simple: just use standard service monitoring metrics platforms, like Graphana or (more basic) Graphite (https://graphiteapp.org/). I have used both in the past in my sysadmin life, I just never made the connection that they are actually well suited to just push synthesis/performance measurement metrics to it and draw them over time.

I did a quick prototype yesterday with Graphite and it could do the trick, if nobody else beats me to it, I'll just continue looking into that in free minutes.

cdgori commented 3 years ago

Well, look at that. That's pretty much what we want :)

I had used some round-robin type graphing tools as well in my sysadmin/cluster-monitoring days, I didn't make the connection either. Great idea.

msfschaffner commented 3 years ago

That's a cool idea @imphil! We've had that graphing feature on our wishlist for Dvsim reports for some time now, just never got around to implement it.

lowRISC / opentitan

[top_earlgrey] Check Vivado reports after EDN and CSRNG were merged #4451