ISA Change for Conditional Branches: When to change the operand (and when not to)

MJoergen commented 4 years ago

The short program below leads to what I believe is incorrect assembly when compiled with optimizations (-O3).

The program is:

int a;

void f()
{  
   if (a < 2)
   {
      a=2;
   }
}

int main()
{  
   a = 1;
   f();
}

The generated assembly for the function f() is:

.....
_f:
   move  #_a,R11
   cmp   2,@R11
   abra  @R13++,!v
   move  #_a,R12
   move  2,@R12
l4:
   move  @R13++,R15
.....

The problem as I see it is that the conditional return abra @R13++,!v updates the stack pointer, even if the branch is not taken.

I discovered this problem while writing the tennis program. However, I have so far not been able to construct a short test program that actually crashes, because the compiler (in the example given here) actually inlines the function f() so the problem does not appear.

So before I contact Volker, is this really a compiler bug, or have I misunderstood something about the ISA ?

bernd-ulmann commented 4 years ago

Oh, good question. Without having thought much about it (sorry that I disappeared so thoroughly, everything goes head over heels here due to the new lockdown and "my" university where we have to switch to online teaching on Monday etc.), the ABRA should not change R13 in case the condition is not met.

MJoergen commented 4 years ago

@bernd-ulmann But, isn't an instruction like

abra <label>, z

implemented using @R15++? In which case, R15 should be updated regardless of whether the branch is taken or not ?

bernd-ulmann commented 4 years ago

That is nasty, indeed. In the case of a constant, R15 must be incremented since the constant would otherwise be read as an instruction if the branch/subroutine call is not taken...

The problem is that this should be true only in the special case of R15 being used to store/retrieve a constant. In all other cases a de-/inkrement would change the contents of the register even in the case that the branch is not taken...

I just wonder how the emulator handles this. Did you try it in the emulator?

MJoergen commented 4 years ago

No, I have not tried it yet in the emulator, but I plan to expand the cpu_test.asm.

However, I looked in the emulator source, and found this comment:

      /* We must increment the PC in case of a constant destination address even if the branch is not taken! */
// NO, we must not since the PC has already been incremented during the fetch operation!
//      else if (source_mode == 0x2 && source_regaddr == 0xf) /* This is mode @R15++ */
//        write_register(PC, read_register(PC) + 1);

This could suggest that the behavior has indeed changed at some earlier time?

bernd-ulmann commented 4 years ago

To be honest, I don't think, it has changed, it just has not been documented correctly by me. A branch/subroutine call with a constant destination address has to increment R15 if not taken. In all other cases no change of the operand must be made when the condition is not true.

MJoergen commented 4 years ago

Ok. So I read your statement as saying that the compiler is doing the correct thing by making use of the instruction abra @R13++,!v. So there is no need to involve Volker in this discussion.

Instead, I've made the following TODO list:

[x] Update hardware.
[x] (Possibly) update emulator.
[x] Expand cpu_test.asm (see Issue #180).
[x] Update documentation.

So, @bernd-ulmann , will you take care of updating the documentation ? Then I'll work on the cpu_test.asm and the emulator and hardware.

sy2002 commented 4 years ago

@MJoergen I will take care of the CPU today.

@bernd-ulmann And I assume that this is not only about post-increment but also about pre-decrement?

bernd-ulmann commented 4 years ago

That is a good question. Something like ABRA @--R15, !V is not really a defined behaviour as it would jump into the address which contains the ABRA instruction itself. R15 is somewhat nasty as it has some special meaning in case of constants used in instructions, so IMHO only ABRA @R15++, ... should inkrement R15 in any case as it would otherwise cause the processor to execute the constant following the ABRA instruction as an instruction in case that the condition is not met. I would let ABRA @--R15, ... as is, i.e. a pretty nonsensical instruction. :-)

sy2002 commented 4 years ago

@bernd-ulmann OK, thank you, Bernd. But for all other pre-decrements, such as ABRA @--R0, !V the same logic as described above is applied: only perform the pre-decrement when the condition is true (in this case: when V=0). Right?

@MJoergen Just in case you already have the new CPU test: Feel free to check it in, so that I have a testbed for the hardware ;-)

sy2002 commented 4 years ago

@MJoergen A bit off topic but still very interessting and relevant: I just wondered, if our VBCC backend is smart enough to not do such an admitably elegant abra @R13++,!v in case you use register bank switching within the function. Because in such a case you cannot just return (using abra @R13++,!v): You need to do a DECRB beforehand.

I was not able to provoke that behavior with your above-mentioned code-snipped plus these settings: qvc tmp.c -k -O1 -speed -rw-threshold=0 Because it neither needed any stack nor any register banks. The compiler seems to be too smart.

So this comment is just for the case that you'd like to explore/research this route a bit.

sy2002 commented 4 years ago

Coming back to my https://github.com/sy2002/QNICE-FPGA/issues/179#issuecomment-719960233, where I said "today": Learning: Never claim "today" :-) Obviously "today" did not work out.

I started to refactor the hardware CPU, since I did not want to add this ISA change on top of the in the meantime pretty convoluted CPU code base. I made some nice progress: The code is now much cleaner and it still passes cpu_test.asm - unfortunatelly, it does not pass int_test.asm any more. So it is still work-in-progress.

There are two new commits: One in develop that still passes all tests: This was just the harmless start of my refactoring and is not blocking the develop branch. And then one in dev-cpu-refactor, this is where my current WIP stands. Probably it will take until my next nerd Saturday, until I can finish this.

MJoergen commented 4 years ago

The CPU test has been updated with a rudimentary test of ABRA combined with various addressing modes on R0. The emulator has been updated too. I'm working on adding more tests in cpu_test.asm, but the basic functionality can be tested now.

MJoergen commented 4 years ago

CPU test is now complete.

sy2002 commented 4 years ago

@MJoergen Talking about branch dev-cpu-refactor: I managed to finish my refactoring today. This is a prerequisite for implementing this new ISA behavior. The refactored code works very fine: It passes cpu_test.asm (the old one, before you added the new ISA behavior) and int_test.asm. So far so good. Even though it works fine, it is not achieving timing closure and I am a bit stunned, because I do not really get it why that is... I do not even know where to start. It seems so "arbitrary". Therefore I would be very glad, if you could help me to achieve timing closure, because you have so much more experience, when it comes to that topic. We are talking about 2 these two endpoints (see pictures):

After that, I am now very optimistic that I am able finalize the new ISA behavior very quickly, because the refactored design allows for a very elegant (yet to be implemented) rollback-mechanism to cater for the new requirements specified by this issue.

MJoergen commented 4 years ago

Interesting. I'll look into this.

sy2002 commented 4 years ago

Looked into the implemented design of the Register File: Looks like the lower register window is not implemented as a RAM but using LUTs. Not yet an idea, why this is.

EDIT: No forget about that; it seems to be implemented by RAM but by a ton of "64 X 1 Quad Port Distributed RAM (RAM64M)": Will check now, how this was done in past (pre-refactoring).

MJoergen commented 4 years ago

Looked into the implemented design of the Register File: Looks like the lower register window is not implemented as a RAM but using LUTs. Not yet an idea, why this is.

EDIT: No forget about that; it seems to be implemented by RAM but by a ton of "64 X 1 Quad Port Distributed RAM (RAM64M)": Will check now, how this was done in past (pre-refactoring).

This is the same as the earlier design. The reason is that the register_file allows asynchronous read. The Block RAM in the FPGA only support synchronous read.

MJoergen commented 4 years ago

In the register_file I'm confused that some parts (i.e. the standard registers) are updated on the falling clock edge, whereas other parts (i.e. the special registers) are updated on the rising clock edge.

I think this is what is causing the timing problem: At the falling clock edge, the standard register gets updated, causing a change in the output signal read_data1. This signal must travel through the ALU and then return to the register_file in order to update the SR at the next rising clock edge. So only half a clock cycle is available.

So my question is: Is using two different clock edges in the register_file deliberate? It is certainly counter-intuitive that some registers are updated before/after others.

MJoergen commented 4 years ago

Another thing (not related to timing): Is there really a need for having separate signals for the Upper registers (8-12) and the Special registers (13-15)? To me it seems it should be possible to just have a single signal called Upper registers with the range (8-15).

sy2002 commented 4 years ago

So my question is: Is using two different clock edges in the register_file deliberate? It is certainly counter-intuitive that some registers are updated before/after others.

Yes, this is deliberate: The whole design of the CPU is made in this way: "everything" clocks in on the negative edge (including MMIO devices' registers, memory, ...). But the when it comes to the program counter, status register and stack pointer, the state machine expects them to be updated on the positive edge.

I chose this design back in the days, because it allowed some speed-ups here and there, when you can perform "work" on both edges of the clock, you sometimes need less clock cycles for certain things. (But I for sure did not look into the details any more since a while...)

This was something, that always have been like this and I shied away from refactoring this, because I felt this might be an even bigger refactoring task (that might have impacts that I cannot forsee right now) - including a slow-down of the MIPS throughout.

I think this is what is causing the timing problem:

Well, since this behavior is exactly like this in past, I wonder why this was not a problem then and if there might be a way to mitigate it now without the need of this larger refactoring.

The only thing that actually changed was that I moved these 3 registers into the register file module - before it was inside the CPU module. Maybe the implementation algorithm placed them physically closer to the CPU/ALU while it was still in inside the CPU module? Are there maybe options to tell the system to ignore the "modules" that are defined in VHDL while placing the physical elements?

Is there really a need for having separate signals for the Upper registers (8-12) and the Special registers (13-15)?

I thought this is necessary due to the need of 8-12 being updated on falling edge and 13-15 on rising.

MJoergen commented 4 years ago

Ok, I just made an experiment: Changing the one occurrence of rising_edge to falling_edge in the register_file, and rebuilt. I was expecting timing closure, but the timing was actually much worse. So I guess you're right in the sense that we should leave the design as it is. And that includes keeping Upper registers and Special registers separate.

sy2002 commented 4 years ago

@MJoergen 🎆 🕺 ✌️ I did achieve timing closure (and even a bit better than it was before the refactoring) 😃

The solution was to replace "<" and ">" comparisons by checking for bit patterns. So instead of doing something like

if write_addr_i < 13 then

I am now checking for bit patterns that represent (for example) ">= 8 and < 13":

is_upper_register_wr   <= true when write_addr(3) = '1' and (write_addr(2) = '0' or (write_addr(1) = '0' and write_addr(0) = '0')) else false;

I am doing that globally and every time I need this inside I clocked process I do

if is_upper_register_wr then

I do fully understand why this is faster.

But I do not understand, why such a highly sophisticated tool such as Vivado is not doing this kind of "easy optimization" automatically. If you look how incredibly optimizing for example C compilers are in nowadays (look at VBCC) it is hard to believe that VHDL synthesizers are not and that you can gain a lot by doing this kind of optimizations.

But it is a fact - this incident is the proof - so my learning is, that it is preferable to code optimized VHDL rather than readable.

Doing the above-mentioned tricks (I did it multiple times) brought me down to 9.741 ns for the worst path of the design and then the following trick helped me to come even more down to 9.576 ns:

Replace this:

sel_rbank_i  <= conv_integer(sel_rbank_mul) * 8;

by this:

sel_rbank_mul8(10 downto 0) <= sel_rbank(7 downto 0) & "000";
sel_rbank_i  <= conv_integer(sel_rbank_mul8);

In other words: Telling the synthesizer to just "rewire" the signal: bits 7 downto 0 go up to the bits 10 downto 3 and then fill the lower part with zeros.

P.S. I truly love your cpu_test.asm! ❤️ Running it after such an intense refactoring of the CPU gives me the good feeling, that everything is still in order.

P.S.2: All this was just the prerequisite for the issue at hand, so the drawback of all of this is, that I did not start yet adding the ISA change described here. This is my next step. But not today :-)

MJoergen commented 4 years ago

Wauv, great work!

Vivado is a strange beast. Often times it performs absolute magic with regards to optimizations, but occasionally it fails to deduce even the simplest things. It is a work of art to help Vivado produce efficient netlists!

I would still prefer to code for readability first, and only when needed to make these magic optimizations. As I said, most of the time these optimizations are not necessary.

All this trouble getting timing closure has made me start thinking about pipe lining the CPU (Issue #94), but that will not be in this release!

bernd-ulmann commented 4 years ago

I just updated the programming card and added a short section stating the in the (common) case of @R15++ in an xBRA/xSUB instruction R15 is incremented in either case. :-)

sy2002 commented 4 years ago

@bernd-ulmann What about this code:

SUB   R0, R1
ABRA @--R15, Z

According to the "treat R15 differently rule":

If Z=0 then the predecrement is done and we land at SUB R0, R1and if Z=1, ditto?

Is the special rule that R15 is treated differently only true for post-incrementing R15, but not for predecrementing R15?

BTW: Whatever the answer is: This whole situation (R15 exception) is the first situation, where the QNICE ISA is actually not "nice" but rather convoluted :-)

sy2002 commented 4 years ago

The hardware is now supporting this ISA change and passes the new cpu_test.asm (including a bunch of smoke tests around /qbin and soft- and hardware interrupts)
The CPU refactorings done so far in V1.7 lead to +1 MIPS compared to V1.6.
The latest booster was a refactoring that took place while I implemented the new ISA and that makes sure, that a not-taken branch only takes 1 CPU cycle: Already during fetch, the CPU detects if the branch will be taken or not. Also the "FastPath" feature, which brings down register-to-register operations and INCRB/DECRB to only 2 cycles is responsible for this speed.up.

More details about the performance increase can be found here in MIPS.md

I merged dev-cpu-refactor to develop.

Closing this issue.

sy2002 / QNICE-FPGA

ISA Change for Conditional Branches: When to change the operand (and when not to) #179