Open MJoergen opened 4 years ago
That is a pretty involved design. :-) Quite cool!
Please don't get me wrong, but I would like to mention a classic floating point processor which has a pretty simple interface and has proven to be quite usable:
This would have the advantage of being very simple and not being more or less a stand-alone-processor. :-)
I like Michaels approach a lot mainly for the following reason:
True parallel processing: The CPU can continue to run and the FPU (since it has own RAM) executes own programs.
And it does feel like OpenCL on a GPU 😄
But I need to admit that I did not look yet the PDF that Bernd has suggested.
@MJoergen About VBCC: Given we had some kind of an FPU: It would be cool, if we found a solution working together with Volker so that you can seamlessly work with floats in C - "just like on any other machine". And if the programmer wants to use the turbo-boosted acceleration properties of your architecture - the "OpenCL" - then you need to write some specific code, that utilizes some library that we would provide. So the programmer would have a choice: Slightly slower, non parallelized "standard floats in C" - or - turbo boosted programs running in parallel to the CPU. Both possible...
We could use the remaining unused QNICE opcode as some kind of "breakout" instruction, i.e. something to tell the QNICE CPU to do nothing while another device may read further words from memory and do the actual instruction decoding. So we could integrate a FPU pretty seamless into the overall system architecture.
An external device such as a FPP would wait for a signal line from QNICE to denote that QNICE has detected such a reserved instruction. It would then increment the address from which this instruction was read (I can latch the address lines when the signal mentioned above is sent) and take over the bus to start reading/decoding instructions.
We could employ a signalling scheme similar to that used in interrupt handling, i.e. two lines:
Thus we would not have true parallel processing but we could extend QNICE by a multitude of external devices using the reserved instruction opcode. We could also use the remaining 12 bits in this "trigger instruction" to denote which external device should take over control.
What do you think?
My intention with my initial proposal was to make something that is simple to implement in hardware (and emulator) AND it requires no changes to the CPU nor to the compiler. On the down side, it requires specially crafted "co-processor assembly code" in order to make use of it. I see no way to get the compiler to support this co-processor. In theory we could write a separate (simple) compiler for this co-processor that accepts programs written in a limited subset of C. Initially such a compiler won't be necessary, but it could be nice to have as a long-term goal.
Bernd's suggestion where the external co-processor takes over the system bus does remove the need for a separate co-processor memory. Instead, the co-processor can access main memory directly. I'm assuming here that the CPU is outputting its Program Counter to the Address Bus so that the co-processor can latch this value and read its instructions from that point in the program. And I'm assuming that the co-processor "returns" a new PC back to the CPU, so it knows where to resume execution. I.e. the "special co-processor instruction" is followed directly by machine code readable by the co-processor, and after that machine code there follows "ordinary" QNICE instructions again.
Bernd's suggestion does give the option of having the C-compiler generate code for the co-processor. And until the C-compiler has been updated, we could live with writing "inline assembly" for the co-processor.
Does the current C-compiler support inline assembly using QNICE instructions?
The difference between the two architectures are quite small, as I see it. The internal part of the co-processor will be the same, only the interface to the CPU is different.
I think Bernd's design is more flexible. The fact that it loses the ability to do parallel processing I don't think is much of a problem. Often, the main CPU won't have anything useful to do anyway before the result is returned.
I see no way to get the compiler to support this co-processor.
Well I think in contrary, this should not be too hard. As far as I remember (it has been a while), when you write
float c = a * b;
VBCC emits function calls such as (pseude code)
push parameter_A
push parameter_B
ABRA __fmul
<result in R8/R9 will be put to C>
So "the only thing" we need to do is to provide the implementation to __fmul
the same way as we did that for example here, when we use the EAE in C for multiplications:
https://github.com/sy2002/QNICE-FPGA/blob/master/c/qnice/vclib/machines/qnice/libsrc/_lmul.s
BTW: This is what I called "the homework" in the email to Volker, where @MJoergen was CC.
I just thought of something. The proposal from Bernd implies that the co-processor needs TWO clock cycles to access each floating point value. The reason is simply that the memory system in QNICE is 16-bit, so it takes two reads (or two writes) to transfer a single 32-bit value. So each floating point instruction will spend 6 clock cycles just moving data back and forth (reading two operands and writing one result) PLUS whatever is needed for the actual calculation.
In contrast, the proposal mentioned at the start of this issue gives the co-processor its own 32-bit memory, and therefore it can transfer an operand/result in just one cycle. So this approach will save 3 clock cycles for each floating point instruction. That was actually the motivation for the initial idea: To avoid spending too much time moving data back and forth.
One additional note. The co-processor could be stack-based. Sort-of emulating the RPN notation from the HP calculators. We could also borrow some ideas from this project: https://github.com/AcheronVM/acheronvm. I especially like the "sliding window" part.
Using our "spare" instruction not necessarily forces the FPP to use 2 cycles to access any FP number, this instruction does not invalidate your idea of a FPP with many internal registers/storage. A FP number stack is a nice :-) idea and we would be in good company with Intel etc. :-)
Why not combine both ideas? My idea with using the spare opcode and your FPP with many internal registers. We could even more simplify the instruction like this:
The spare opcode still has 12 operand bits which are unused. We could say that we just feed those to the FPP which would give it a 12 bit instruction set which should be plenty given a stack architecture.
What do you think?
Interesting. I will think about it the next few days until our meeting next week.
And just to make our meeting really interessting: I do like Michaels original idea a lot because it resembles to how today's high-performace computers use GPUs as FPUs with an own language like OpenCL. :-) And by being like that, it introduces true parallel execution inside the co-processor.
Just my two cents worth: What I fear is that the proposed FPP will show "creeping featurism". It will start simple and then grow and outgrow the underlying QNICE processor, and in the end we might see that what we created is a 32 bit floating point processor which turned out into a stand-alone machine.
Do we want this? Thinking of Mirko's remark that he loves seeing QNICE becoming a great retro game machine I doubt it. That's why I initially proposed a very simplistic approach like the FPP used in early PDP11 systems. If we aim for utmost performance then QNICE is not the right underlying architecture - we might then go down the RISC V route or something like that but definitely not work on a 16 bit processor.
The QNICE architecture has, thanks to you, Mirko and Michael, matured from a pipe dream of mine into a real machine with a real and very impressive software stack. Nevertheless, the system is still simple enough that a good student can actually understand every bit of it and its associated software. The more highly complex features we add, the more cluttered the overall architecture will become and we will create something that might be very cool performance wise but no longer cool for educational or recreational purposes.
The problem is that from your perspective there is nothing too complicated in a computer. :-) But from the perspective of a younger person, QNICE already is on the verge of being comprehensible.
Not that I want to kill any dreams here, but I, personally, would opt for a simple FPP which just executes simple instructions on an internal stack of registers (eight or 16 should be more than sufficient). Even if we spend several cycles for transferring values it would be still much faster than a software implementation.
Speaking of which, we should also think about a software library for floating point operations, what do you think?
Have a great weekend! I am now off to today's lecture (hardware systems, by the way :-) ). :-)
Do we want this? Thinking of Mirko's remark that he loves seeing QNICE becoming a great retro game machine I doubt it.
Yeah good argument, you're right and I am convinced: This is meant to be recreational retro toy and not a high performance computing platform ;-) So let's discuss a still good, but simple approach on Tuesday evening that satisfies this philosphy.
Speaking of which, we should also think about a software library for floating point operations, what do you think?
Absolutely. We do need that in Monitor and we do need it inside VBCC (see my https://github.com/sy2002/QNICE-FPGA/issues/147#issuecomment-698917516)
Thank you Bernd for pointing out what I have missed: The goal to keep the complexity low. Fearure creep is a real "threat" to a project line this, and adding a FPP will greatly increase tge complexity.
The initial aim with this Issue was to have the ability to perfom FP calculations. This can eadily be achieved with a software only solution.
So it seems to me a much better solution to enable FP support in the compiler and the library. That will give the added accuracu needed in the sprite ball demo program. When that is implemented, we can evaluate the performance and then discuss what - if anything - to do about that.
Motivation
The motivation for this issue is the demo program
c/test_programs/demo_sprite_balls.c
. In this program a large number of balls move about and collide, and calculating the new velocities using integer-only arithmetic leads to round-off errors and/or overflow.Proposal
I propose a new I/O device with the following register map
The CSR register has the following interpretation
This I/O device acts as a co-processor with an internal virtual memory consisting of 8192 dwords (a dword is two 16-bit words, i.e. a 32-bit value). The CPU can access this virtual memory by using addresses 00 - 02 in the above register map.
Each 32-bit dword can contain either a floating point value (in the standard IEEE-754 format), or a special 32-bit co-processor instruction, see below.
The intention is that the co-processor has enough storage to contain all the floating point values needed by a running program. For instance, the demo program
c/test_programs/demo_sprite_balls.c
mentioned above would use 5 floating point values (pos_x
,pos_y
,vel_x
,vel_y
,radius
) for each ball. With 50 balls this is a total of 250 dwords. This leaves plenty of room for the co-processor instructions.Usage
The demo program
c/test_programs/demo_sprite_balls.c
could be modified to use this new co-processor. In this modified version the CPU will initially write the initial values to the co-processor, and then write a sequence of co-processor instructions (i.e. a "program") to the co-processor.During normal operation the CPU will send a single
GO
command via theCSR
register. This sets the co-processor's Program Counter (PC), sets the co-processor's BUSY flag, and the co-processor starts executing instructions. After a short while, the co-processor stops execution and clears the BUSY flag. The CPU can now read the result (e.g. the new sprite coordinates) from the register map.An important point here is that there is no need to copy data back and forth between the CPU and the co-processor during normal operation. All the data needed by the co-processor lives entirely inside the co-processor. Going back to the demo program
c/test_programs/demo_sprite_balls.c
the function update() will be replaced entirely by a single write to the co-processorCSR
register. Furthermore, the data structuret_ball balls[NUM_SPRITES]
will be completely removed.Co-processor instruction
Everything in the co-processor is 32-bit wide, including the instruction. The instruction format is as follows:
This allows for 64 difference opcodes, and two 13-bit operand addresses.
The co-processor does not contain any internal registers, so all operands are loaded/stored directly in the virtual memory.
The following opcodes are required in the basic implementation:
The Error bit in the CSR register is set when a floating point error occurs. This could be one of the following:
Resource considerations
This feature requires no changes to the compiler, but will probably need a simple assembler. Using IEEE-754 floating point format makes testing this feature easier.
TBD: To be really useful, this co-processor should implement some possibility of conditional execution.