Open MJoergen opened 10 months ago
We might also want to include additional functions, e.g. square root. This can be done in around 7 clock cycles too. See this link for an implementation of the square root function: https://github.com/MJoergen/math/tree/master/fast_sqrt
Here is an implementation of the division algorithm: https://github.com/MJoergen/nexys4ddr/blob/master/offloader/Episodes/ep10_-_Fact/math/divmod.vhd
Note, all these optimized algorithms are expected to run at a much faster clock rate (> 200 MHz). This is possible because the bit shifting logic is very simple. As long as the fast clock is an integer multiple of the CPU clock, then there should be no Clock Domain Crossing issues.
@MJoergen Sounds like a good idea. We could do this in the "dev-V1.61" branch of QNICE, which is the "MiSTer2MEGA65" branch when time is ripe.
Instead of implementing a CPU_WAIT_FOR_DATA, we might also consider to use the busy
flag in the EAE's CSR register, which was meant to signal to the Monitor's math library that the calculation is still runnning:
https://github.com/sy2002/QNICE-FPGA/blob/dev-V1.61/vhdl/EAE.vhd#L108C27-L108C27
The math library already has code that waits for the calculation to finish as shown here, for example:
https://github.com/sy2002/QNICE-FPGA/blob/master/monitor/math_library.asm#L19
We then would need to ensure, that our MiSTer2MEGA65 version of monitor.asm
has NOT defined EAE_NO_WAIT
:
https://github.com/sy2002/QNICE-FPGA/blob/master/monitor/monitor.asm#L4
So by going down this route, the refactoring would be rather seamless to the QNICE Monitor code.
Interesting alternative. I had indeed forgotten about that "busy" status :-) However, I still prefer the CPU_WAIT_FOR_DATA
approach.
My reasoning is as follows: The "check for busy" code in math_library.asm
takes three CPU instructions (MOVE
, AND
, RBRA
), for a total of circa 7 (?) clock cycles in each iteration. That approach seems optimal for "long" waits, i.e. if the hardware is MANY clock cycles in completing. If, however, we reach the target of 7 clock cycles for a single calculation then just stalling the CPU for this many clock cycles is "simpler". If nothing else, it saves three instructions in the firmware.
Plus, the stalling will only ever occur during the read from the EAE. Notice line 24: https://github.com/sy2002/QNICE-FPGA/blob/master/monitor/math_library.asm#L24C1-L24C1 This is a "useful" instruction taking place simultaneously with the calculation. Therefore, the subsequent read from EAE in the next line will stall for only circa 4 clock cycles.
The refactoring will ultimately reduce the complex logic tree generated.
The idea is to use simpler (and slightly slower) algorithms, but reduce complexity (and maybe even synthesis time).
Currently, the EAE uses approx 1339 LUTs and 410 slices. This can probably be reduced by a factor of 5 or more.
Currently, a calculation takes 3 clock cycles. After refactoring I estimate a calculation to take around 7 clock cycles.
To accommodate the slower calculation, a
CPU_WAIT_FOR_DATA
signal should be added.