Any plan about supporting compressed extension?

rsd-devel / rsd

RSD: RISC-V Out-of-Order Superscalar Processor

Apache License 2.0

934 stars 95 forks source link

Any plan about supporting compressed extension? #10

Closed hyf6661669 closed 4 years ago

hyf6661669 commented 4 years ago

RVC is supposed to lessen the pressure of icache, it would be amazing to support RVC.

shioyadan commented 4 years ago

We have considered implementing the compress extension, but we think it is relatively difficult. In particular, in the extension, 4-byte instructions can cross cache line boundaries, and we currently do not have a good way to handle this problem. Do you have any ideas regarding this issue?

hyf6661669 commented 4 years ago

Hi @shioyadan Actually I'm designing my own superscalar risc-v core which supports RVC. I find it really challenging to design the frontend. I learn some ideas about the frontend design from ariane and riscv-boom and want to share them with you.

Although RVC is supposed to reduce the code size, but it also lead to unaligned icache access problem. So when a 32-bit instruction crosses a cache line, we have to wait another cycle to get instr[31:16] from the next cache line, which would also lead to performance reduction.

What's more, if we want to support 16-bit instructions, now fetching 64-bit data from icache means that we might get 4 instructions at most. So it's more difficult for branch prediction, because you might need to use a 4-read-port PHT/BHT or something else to predict every branch instruction in the 64-bit data. However, since branch instructions only occupy about 1/5~1/6 of the instructions, it is less likely that there are 2 branch instructions in a 64-bit fetched data. That means using a 4-read-port branch prediction unit (bpu) could be very wasteful most of the time. So it's a tradeoff between the accuracy of bpu and FPGA resources ulitization/operating frequency.

If we still use 2-way decode/dispatch in RSD, then the frontend might need some modifications:

1. After getting data from icache, do instruction alignment to find out the number and the positions of the instructions in the 64-bit fetched data.
2. Use a small branch decoder to do predecoding. The small branch decoder just needs to check 
whether the instructions are conditional branch/jal/jalr and whether their behaviours belong to 
CALL/RET. If they are conditional branch/jal/RET, we also need to check the predicted targets' 
correctness (by calculating pc + offset or the content in the return address stack). 
3. Use a 4-port-input/2-port-output instruction fifo between predecode stage and decode stage.

As a result, the pipeline stages of frontend for RSD may look like:

1. NextPC
2. Instruction Fetch
3. Align
4. Predecode
5. Instruction Decode
6. Register Renaming
7. Dispatch

Here we add another stage, which means another cycle for branch misprediction penalty. That's one of reasons that supporting RVC is challenging for the frontend design.

shioyadan commented 4 years ago

Thanks for your detailed comments. I thought again about implementing the compress extension. I now think that the extension can be implemented with:

branch prediction that assumes all instructions have 16 bits width and
front-end decoupling using a fetch queue (like the FIFO you mentioned).

In your proposal, branch prediction is performed according to the result of pre-decoding, but I think it can be implemented without pre-decoding by storing the type of branch in a BTB. As described later, the targets of multiple instructions can be efficiently obtained from the BTB by using a multi-banked RAM. Therefore, assuming that all instructions have 16 bits width, the position of a branch in a fetch group can be determined by obtaining multiple results from the BTB. (PC in the middle of a 32-bit instruction also accesses the BTB, but this never hits.)

You assume using a multi-port memory, but in general, multiple accesses to consecutive addresses can be performed in a multi-bank memory without any bank conflicts. RSD uses such mechanisms in its branch prediction (PHT and BTB) and some other parts. I think that the following sources will be helpful to understand it.

https://github.com/rsd-devel/rsd/blob/master/Processor/Src/FetchUnit/Gshare.sv https://github.com/rsd-devel/rsd/blob/master/Processor/Src/Primitives/RAM.sv

shioyadan commented 4 years ago

Here is a pseudo code that implements my idea:

// BTB and PHT are logically 4-read multi-port RAMs, 
// but they can be implemented with 4-bank multi-banked RAMs.
BTB.readAddr[0] = predictedPC + 0;
BTB.readAddr[1] = predictedPC + 2;  // +16 bits
BTB.readAddr[2] = predictedPC + 4;  // +32 bits
BTB.readAddr[3] = predictedPC + 6;  // +48 bits

PHT.readAddr[0] = predictedPC + 0;
PHT.readAddr[1] = predictedPC + 2;  // +16 bits
PHT.readAddr[2] = predictedPC + 4;  // +32 bits
PHT.readAddr[3] = predictedPC + 6;  // +48 bits

---- 1 cycle later ----

// To the next fetch block
// Note that there is no need to determine instruction boundaries in consecutive sequences in the branch prediction.  
predictedPC = curPC + 4*2;  
for (i = 0; i < 4; i++) begin
    if (BTB.out[i].hit && BTB.out[i].isBr) begin
        predictedPC = curPC + 2*i + (PHT[i] > 1 ? BTB.out[i].target : 0);
    end
end

hyf6661669 commented 4 years ago

I have already studied some source codes from RSD and I know that we can use a multi-bank ram to implement BTB/PHT. But we have to admit that a 4-bank ram still leads to more FPGA resources and lower operating frequency, comparing to a 2-bank ram. And as I said before, branch instructions don't often occur . Therefore it seems not very cost-effective to use a 4-bank ram for branch prediction. What if we want to increase the fetch width ? For example, we need to fetch 128-bit data from icache for a 4-way decode OoO core. Does that mean we need to use a 8-bank ram? My opinion is that we can't just increase the number of the banks of ram, it must have some constraints somewhere. I have looked at the design in this paper:

I don't know why I can't upload the image, please have a look at Figure 5.1 in this paper.
"A Superscalar Out-of-Order x86 Soft Processor for FPGA" by Henry Ting-Hei Wong

Maybe you can use an early BTB bwtweeb NP and IF stage for block-level prediction. The block-level prediction could be like this:

In the block-level prediction we use "fetch_pc" to access BTB.
Consider there are 4 compressed instructions in a cache line:
0x5000: c.jalr
0x5002: c.andi
0x5004: c.add
0x5006: c.jal

If you start to fetch instruction from 0x5000, then the next fetch_pc must be the target of "c.jalr". 
The latter instructions will also be fetched but they are not useful. So if (fetch_pc == 0x5000), 
then its corresponding btb-entry only needs to record the information about "c.jalr". 
When the latter 3 instructions need to be executed, then there must be a branch isntruction 
elsewhere, whose target is 0x5002.
So the corresponding btb-entry of 0x5002 only needs to record the information about "c.jal" 
(Although the PC of "c.jal" is actually 0x5006).

After instructions alignment and predecoding are finished, we could know which instructions in the 64-bit fetched data are branch instructions and their corresponding PCs. Then we use their PCs and global history register to access the PHT. In this situation we might only need a 2-read-port PHT/BTB, because having 3 or more branch instructions in the 64-bit fetched data is really rare. If that really happens we could only do prediction for the 1st/2nd branch instructions and do prediction for the 3rd/4th branch instructions in the next cycle. This method could reduce some hardware requirment as well as reduce some performance. So it's always very hard to find a cost-effective design method.

shioyadan commented 4 years ago

Certainly, as you mentioned, multi-banking requires additional costs. In particular, for more advanced branch predictors such as TAGE, the additional cost would be much greater. As a result, it does not scale, especially when increasing the fetch width.

I completely forgot, but I have implemented the block-level branch prediction, which you mentioned, in a cycle-accurate simulator. This is indeed a cost effective method.

The mechanism I implemented is slightly different from the one you described. My implementation performs PHT lookup in the NP stage. That is, both the BTB and PHT are looked up/updated using the head PC of a fetch group, not the PC of an actual branch instruction.

For example, in your example code:

Fetch group 1 consists of one instruction of 0x5000. Note that, when a branch instruction is fetched, instruction fetch is stopped there.
Fetch group 2 consists of three instructions from 0x5002 to 0x5006. If the BTB/PHT is accessed using 0x5000, the branch information of 0x5000 is obtained, and if it is accessed using 0x5002, the branch information of 0x5006 is obtained.

I have confirmed by simulation that the decrease in branch prediction accuracy is very small in the mechanism I described. This allows for fast branch prediction, and separates a branch prediction into a single stage.

hyf6661669 commented 4 years ago

@shioyadan In my understanding, your method means that, if the fetch width is 64-bit, we could use pc[31:3] to access the PHT/BTB. And every single btb/pht-entry should record information about 4 possible instructions in a 64-bit fetch_group. Am I right?

shioyadan commented 4 years ago

No, that's not right.

In my method, BTB and PHT are looked up using PC [31:1] regardless of the fetch width. Each entry of BTB/PHT records information about a single branch instruction only.

In this method, fetch group boundaries are determined so that each fetch group contains a single branch only and the last instruction in each fetch group is a branch instruction (or all the instructions are not branch instructions). If the direction of a branch instruction is predicted as untaken, instruction fetch is stopped at the branch instruction even if the number of fetched instructions does not reach the fetch width.

BTB and PHT are looked up using the PC of the first instruction of each fetch group. Since each fetch group contains only one branch, this is (basically) logically equivalent to looking up using the PC of the branch instruction in the fetch group. In addition, the position of a branch in each fetch group is stored in BTB.

For example, assume the following code.

    0x5000: bne TARGET1 # fetch group 1 consists of "bne"
    0x5002: add         # fetch group 2 consists of "add", "sub", and "beq".
    0x5004: sub
    0x5006: beq TARGET2

In the first cycle, BTB and PHT are looked up using PC=0x5000.
The BTB returns TARGET1 and the location of the branch in the fetch group. Here, the first instruction (0x5000) is a branch. So, even if the branch direction is predicted to be untaken, only this instruction is fetched in this cycle.
In the next cycle, BTB and PHT are looked up using PC=0x5002.
The BTB returns TARGET2 and the location of the branch, which is the third instruction (0x5006), so in this cycle, three instructions (0x5002~0x5006) is fetched.

I'm sorry, my explanation may be confusing. Also, maybe some modification is needed for the compression extension.

I hope this helps.

hyf6661669 commented 4 years ago

@shioyadan OK, your explanation seems more reasonable, thanks!

shioyadan commented 4 years ago

I am glad to have a good discussion with you. If you get any new insights when implementing your processor, please let us know. Thanks!