Closed MJoergen closed 4 years ago
Oh, good question. Without having thought much about it (sorry that I disappeared so thoroughly, everything goes head over heels here due to the new lockdown and "my" university where we have to switch to online teaching on Monday etc.), the ABRA should not change R13 in case the condition is not met.
@bernd-ulmann But, isn't an instruction like
abra <label>, z
implemented using @R15++
? In which case, R15
should be updated regardless of whether the branch is taken or not ?
That is nasty, indeed. In the case of a constant, R15 must be incremented since the constant would otherwise be read as an instruction if the branch/subroutine call is not taken...
The problem is that this should be true only in the special case of R15 being used to store/retrieve a constant. In all other cases a de-/inkrement would change the contents of the register even in the case that the branch is not taken...
I just wonder how the emulator handles this. Did you try it in the emulator?
No, I have not tried it yet in the emulator, but I plan to expand the cpu_test.asm
.
However, I looked in the emulator source, and found this comment:
/* We must increment the PC in case of a constant destination address even if the branch is not taken! */
// NO, we must not since the PC has already been incremented during the fetch operation!
// else if (source_mode == 0x2 && source_regaddr == 0xf) /* This is mode @R15++ */
// write_register(PC, read_register(PC) + 1);
This could suggest that the behavior has indeed changed at some earlier time?
To be honest, I don't think, it has changed, it just has not been documented correctly by me. A branch/subroutine call with a constant destination address has to increment R15 if not taken. In all other cases no change of the operand must be made when the condition is not true.
Ok. So I read your statement as saying that the compiler is doing the correct thing by making use of the instruction abra @R13++,!v
. So there is no need to involve Volker in this discussion.
Instead, I've made the following TODO list:
cpu_test.asm
(see Issue #180).So, @bernd-ulmann , will you take care of updating the documentation ? Then I'll work on the cpu_test.asm
and the emulator and hardware.
@MJoergen I will take care of the CPU today.
@bernd-ulmann And I assume that this is not only about post-increment but also about pre-decrement?
That is a good question. Something like ABRA @--R15, !V is not really a defined behaviour as it would jump into the address which contains the ABRA instruction itself. R15 is somewhat nasty as it has some special meaning in case of constants used in instructions, so IMHO only ABRA @R15++, ... should inkrement R15 in any case as it would otherwise cause the processor to execute the constant following the ABRA instruction as an instruction in case that the condition is not met. I would let ABRA @--R15, ... as is, i.e. a pretty nonsensical instruction. :-)
@bernd-ulmann OK, thank you, Bernd. But for all other pre-decrements, such as ABRA @--R0, !V the same logic as described above is applied: only perform the pre-decrement when the condition is true (in this case: when V=0). Right?
@MJoergen Just in case you already have the new CPU test: Feel free to check it in, so that I have a testbed for the hardware ;-)
@MJoergen A bit off topic but still very interessting and relevant: I just wondered, if our VBCC backend is smart enough to not do such an admitably elegant abra @R13++,!v
in case you use register bank switching within the function. Because in such a case you cannot just return (using abra @R13++,!v
): You need to do a DECRB beforehand.
I was not able to provoke that behavior with your above-mentioned code-snipped plus these settings: qvc tmp.c -k -O1 -speed -rw-threshold=0
Because it neither needed any stack nor any register banks. The compiler seems to be too smart.
So this comment is just for the case that you'd like to explore/research this route a bit.
Coming back to my https://github.com/sy2002/QNICE-FPGA/issues/179#issuecomment-719960233, where I said "today": Learning: Never claim "today" :-) Obviously "today" did not work out.
I started to refactor the hardware CPU, since I did not want to add this ISA change on top of the in the meantime pretty convoluted CPU code base. I made some nice progress: The code is now much cleaner and it still passes cpu_test.asm
- unfortunatelly, it does not pass int_test.asm
any more. So it is still work-in-progress.
There are two new commits: One in develop
that still passes all tests: This was just the harmless start of my refactoring and is not blocking the develop
branch. And then one in dev-cpu-refactor
, this is where my current WIP stands. Probably it will take until my next nerd Saturday, until I can finish this.
The CPU test has been updated with a rudimentary test of ABRA
combined with various addressing modes on R0
. The emulator has been updated too. I'm working on adding more tests in cpu_test.asm
, but the basic functionality can be tested now.
CPU test is now complete.
@MJoergen Talking about branch dev-cpu-refactor
: I managed to finish my refactoring today. This is a prerequisite for implementing this new ISA behavior. The refactored code works very fine: It passes cpu_test.asm
(the old one, before you added the new ISA behavior) and int_test.asm
. So far so good. Even though it works fine, it is not achieving timing closure and I am a bit stunned, because I do not really get it why that is... I do not even know where to start. It seems so "arbitrary". Therefore I would be very glad, if you could help me to achieve timing closure, because you have so much more experience, when it comes to that topic. We are talking about 2 these two endpoints (see pictures):
After that, I am now very optimistic that I am able finalize the new ISA behavior very quickly, because the refactored design allows for a very elegant (yet to be implemented) rollback-mechanism to cater for the new requirements specified by this issue.
Interesting. I'll look into this.
Looked into the implemented design of the Register File: Looks like the lower register window is not implemented as a RAM but using LUTs. Not yet an idea, why this is.
EDIT: No forget about that; it seems to be implemented by RAM but by a ton of "64 X 1 Quad Port Distributed RAM (RAM64M)": Will check now, how this was done in past (pre-refactoring).
Looked into the implemented design of the Register File: Looks like the lower register window is not implemented as a RAM but using LUTs. Not yet an idea, why this is.
EDIT: No forget about that; it seems to be implemented by RAM but by a ton of "64 X 1 Quad Port Distributed RAM (RAM64M)": Will check now, how this was done in past (pre-refactoring).
This is the same as the earlier design. The reason is that the register_file allows asynchronous read. The Block RAM in the FPGA only support synchronous read.
In the register_file
I'm confused that some parts (i.e. the standard registers) are updated on the falling clock edge, whereas other parts (i.e. the special registers) are updated on the rising clock edge.
I think this is what is causing the timing problem: At the falling clock edge, the standard register gets updated, causing a change in the output signal read_data1
. This signal must travel through the ALU and then return to the register_file in order to update the SR at the next rising clock edge. So only half a clock cycle is available.
So my question is: Is using two different clock edges in the register_file
deliberate? It is certainly counter-intuitive that some registers are updated before/after others.
Another thing (not related to timing): Is there really a need for having separate signals for the Upper registers (8-12) and the Special registers (13-15)? To me it seems it should be possible to just have a single signal called Upper registers with the range (8-15).
So my question is: Is using two different clock edges in the register_file deliberate? It is certainly counter-intuitive that some registers are updated before/after others.
Yes, this is deliberate: The whole design of the CPU is made in this way: "everything" clocks in on the negative edge (including MMIO devices' registers, memory, ...). But the when it comes to the program counter, status register and stack pointer, the state machine expects them to be updated on the positive edge.
I chose this design back in the days, because it allowed some speed-ups here and there, when you can perform "work" on both edges of the clock, you sometimes need less clock cycles for certain things. (But I for sure did not look into the details any more since a while...)
This was something, that always have been like this and I shied away from refactoring this, because I felt this might be an even bigger refactoring task (that might have impacts that I cannot forsee right now) - including a slow-down of the MIPS throughout.
I think this is what is causing the timing problem:
Well, since this behavior is exactly like this in past, I wonder why this was not a problem then and if there might be a way to mitigate it now without the need of this larger refactoring.
The only thing that actually changed was that I moved these 3 registers into the register file module - before it was inside the CPU module. Maybe the implementation algorithm placed them physically closer to the CPU/ALU while it was still in inside the CPU module? Are there maybe options to tell the system to ignore the "modules" that are defined in VHDL while placing the physical elements?
Is there really a need for having separate signals for the Upper registers (8-12) and the Special registers (13-15)?
I thought this is necessary due to the need of 8-12 being updated on falling edge and 13-15 on rising.
Ok, I just made an experiment: Changing the one occurrence of rising_edge
to falling_edge
in the register_file
, and rebuilt. I was expecting timing closure, but the timing was actually much worse. So I guess you're right in the sense that we should leave the design as it is. And that includes keeping Upper registers and Special registers separate.
@MJoergen 🎆 🕺 ✌️ I did achieve timing closure (and even a bit better than it was before the refactoring) 😃
The solution was to replace "<" and ">" comparisons by checking for bit patterns. So instead of doing something like
if write_addr_i < 13 then
I am now checking for bit patterns that represent (for example) ">= 8 and < 13":
is_upper_register_wr <= true when write_addr(3) = '1' and (write_addr(2) = '0' or (write_addr(1) = '0' and write_addr(0) = '0')) else false;
I am doing that globally and every time I need this inside I clocked process I do
if is_upper_register_wr then
I do fully understand why this is faster.
But I do not understand, why such a highly sophisticated tool such as Vivado is not doing this kind of "easy optimization" automatically. If you look how incredibly optimizing for example C compilers are in nowadays (look at VBCC) it is hard to believe that VHDL synthesizers are not and that you can gain a lot by doing this kind of optimizations.
But it is a fact - this incident is the proof - so my learning is, that it is preferable to code optimized VHDL rather than readable.
Doing the above-mentioned tricks (I did it multiple times) brought me down to 9.741 ns
for the worst path of the design and then the following trick helped me to come even more down to 9.576 ns
:
Replace this:
sel_rbank_i <= conv_integer(sel_rbank_mul) * 8;
by this:
sel_rbank_mul8(10 downto 0) <= sel_rbank(7 downto 0) & "000";
sel_rbank_i <= conv_integer(sel_rbank_mul8);
In other words: Telling the synthesizer to just "rewire" the signal: bits 7 downto 0 go up to the bits 10 downto 3 and then fill the lower part with zeros.
P.S. I truly love your cpu_test.asm
! ❤️ Running it after such an intense refactoring of the CPU gives me the good feeling, that everything is still in order.
P.S.2: All this was just the prerequisite for the issue at hand, so the drawback of all of this is, that I did not start yet adding the ISA change described here. This is my next step. But not today :-)
Wauv, great work!
Vivado is a strange beast. Often times it performs absolute magic with regards to optimizations, but occasionally it fails to deduce even the simplest things. It is a work of art to help Vivado produce efficient netlists!
I would still prefer to code for readability first, and only when needed to make these magic optimizations. As I said, most of the time these optimizations are not necessary.
All this trouble getting timing closure has made me start thinking about pipe lining the CPU (Issue #94), but that will not be in this release!
I just updated the programming card and added a short section stating the in the (common) case of @R15++ in an xBRA/xSUB instruction R15 is incremented in either case. :-)
@bernd-ulmann What about this code:
SUB R0, R1
ABRA @--R15, Z
According to the "treat R15 differently rule":
If Z=0 then the predecrement is done and we land at SUB R0, R1
and if Z=1, ditto?
Is the special rule that R15 is treated differently only true for post-incrementing R15, but not for predecrementing R15?
BTW: Whatever the answer is: This whole situation (R15 exception) is the first situation, where the QNICE ISA is actually not "nice" but rather convoluted :-)
cpu_test.asm
(including a bunch of smoke tests around /qbin and soft- and hardware interrupts)More details about the performance increase can be found here in MIPS.md
I merged dev-cpu-refactor
to develop
.
Closing this issue.
The short program below leads to what I believe is incorrect assembly when compiled with optimizations (-O3).
The program is:
The generated assembly for the function
f()
is:The problem as I see it is that the conditional return
abra @R13++,!v
updates the stack pointer, even if the branch is not taken.I discovered this problem while writing the tennis program. However, I have so far not been able to construct a short test program that actually crashes, because the compiler (in the example given here) actually inlines the function
f()
so the problem does not appear.So before I contact Volker, is this really a compiler bug, or have I misunderstood something about the ISA ?