kaist-cp / cs492-uarch

30 stars 1 forks source link

HazardFlow pipelined cpu structure with regard to lab2(branch predictor) #25

Open JongyCysec opened 1 day ago

JongyCysec commented 1 day ago

I've tried to sketch the structure of first three cpu pipeline (fetch, decode, exe) implemented by hazardflow design with regard to lab2 (branch predictor).

I'm not sure whether all the specific comments and types are valid or not. It will be meaningful if some other students get help from the structure and it will be very helpful if someone else points out some errors or modify/complement the structure.

image

Leave the URL for modification/complemenation.

https://app.diagrams.net/#G1-Ozeu2-2i138ylAvP2cd_VND89gwboCG#%7B%22pageId%22%3A%22C0vsTpqooSxowkVAgLYY%22%7D

minseongg commented 1 day ago

First of all, you did a great job! Thank you for your hard work.

One point I would like to mention is that the branch predictor update signal is not shown in the diagram. Since the fsm_map updates its state based on the ingress payload and the current state, the branch predictor update signal should be passed to the ingress payload of the M4. To do this, the payload signals between M1 and M4 should contain HOption<BpUpdate> in their type (e.g., (u32, HOption<BpUpdate>) instead of u32).

We have improved the description related to this in Section 2.4. Please check the updated description for more details!

JongyCysec commented 1 day ago

@minseongg

Thank you for your sincere feedback on the structure above.

Question 1. About Branch Predictor Update Signal

Mis-Understanding.

About your feedback on branch update signal, I've misunderstood the route of branch predictor update signal. At first, I've thought that M4 would deal with BpUpdate(from exe M0) in egress resolver type DecR. ( We may assume that we add BpUpdate member into DecR & ExeR structure.) So, I did not specify the flow of BpUpdate from M1 to M4.

However, M4 module is fsm_map so that it deals with ingress payload and just bypass egress resolver to ingress resolver. Therefore, BpUpdate cannot be dealt in M4 without routed to M0, sourced as payload, and arriving at M4 again.

Senario 1.

I think that it will be better if M4 module can deal with update resolver signal directly instead of passing it to M0 and getting it back as payload. So, I come up with trying to modify existing M4(fsm_map) module to deal with both ingress payload(pre-decode & generate branch prediction result) and egress resolver(update signal from exe).

In other words, can we change M4(fsm_map) to combination of two sub-modules fsm_map and map_resolver_*? Then, fsm_map will pre-decode instruction and generate branch prediction. And map_resolver_* will update the branch predictor based on update signal from exe stage.

At this point, I've realized that two problems exist for such new implementation.

  1. There is no map_resolver_* combinator whose argument f considers its state(S) in contrast to map combinator.
  2. Shared state should be maintained separately in two modules. (So, state transfer from map_resolver to fsm_map required)

About issue#2, we can make map_resolver module sends its state to fsm_map module so that up-to-date branch predictor state can be used by fsm_map. However, there is no naked_fsm_map_resolver_* combinator that sends back its state as ingress resolver like naked_fsm_filter_map. Furthermore, the overhead of transferring state(branch predictor) between such two modules are not expected to be ignorable.

Senario 2.

Since we have problem with management of two equivalent state(branch predictor), the alternative will be using general combinator function fsm in order to generate branch prediction result from ingress payload and update branch predictor by considering update signal(egress resolver) from exe stage Then, code will be not quite easily readable.

Conclusion

Hence, I've concluded that existing implementation design where update signal from exe stage arrives to M0(fetch), sourced as payload, and dealt in M4 module will be the best effort although update signal has to travel a longer distance(exe->M4->M0->M4) to be processed.

Question 2. About each reg_fwd module in each stage.

At each stage(fetch, decode, exe), reg_fwd combinator is used at initial phase and I wonder what is the main purpose or functionality of reg_fwd module. (It is said that it delays one or more cycles to send out the states. So we can reduce reg_fwd to reduce the cycle times.)