Suggestion: Refactor the EAE (hardware multiplier and divider)

sy2002 / MiSTer2MEGA65

Framework to simplify porting MiSTer (and other) cores to the MEGA65

GNU General Public License v3.0

35 stars 9 forks source link

Suggestion: Refactor the EAE (hardware multiplier and divider) #39

Open MJoergen opened 10 months ago

MJoergen commented 10 months ago

The refactoring will ultimately reduce the complex logic tree generated.

The idea is to use simpler (and slightly slower) algorithms, but reduce complexity (and maybe even synthesis time).

Currently, the EAE uses approx 1339 LUTs and 410 slices. This can probably be reduced by a factor of 5 or more.

Currently, a calculation takes 3 clock cycles. After refactoring I estimate a calculation to take around 7 clock cycles.

To accommodate the slower calculation, a CPU_WAIT_FOR_DATA signal should be added.

MJoergen commented 10 months ago

We might also want to include additional functions, e.g. square root. This can be done in around 7 clock cycles too. See this link for an implementation of the square root function: https://github.com/MJoergen/math/tree/master/fast_sqrt

MJoergen commented 10 months ago

Here is an implementation of the division algorithm: https://github.com/MJoergen/nexys4ddr/blob/master/offloader/Episodes/ep10_-_Fact/math/divmod.vhd

Note, all these optimized algorithms are expected to run at a much faster clock rate (> 200 MHz). This is possible because the bit shifting logic is very simple. As long as the fast clock is an integer multiple of the CPU clock, then there should be no Clock Domain Crossing issues.

sy2002 commented 10 months ago

@MJoergen Sounds like a good idea. We could do this in the "dev-V1.61" branch of QNICE, which is the "MiSTer2MEGA65" branch when time is ripe.

Instead of implementing a CPU_WAIT_FOR_DATA, we might also consider to use the busy flag in the EAE's CSR register, which was meant to signal to the Monitor's math library that the calculation is still runnning:

https://github.com/sy2002/QNICE-FPGA/blob/dev-V1.61/vhdl/EAE.vhd#L108C27-L108C27

The math library already has code that waits for the calculation to finish as shown here, for example:

https://github.com/sy2002/QNICE-FPGA/blob/master/monitor/math_library.asm#L19

We then would need to ensure, that our MiSTer2MEGA65 version of monitor.asm has NOT defined EAE_NO_WAIT:

https://github.com/sy2002/QNICE-FPGA/blob/master/monitor/monitor.asm#L4

So by going down this route, the refactoring would be rather seamless to the QNICE Monitor code.

MJoergen commented 10 months ago

Interesting alternative. I had indeed forgotten about that "busy" status :-) However, I still prefer the CPU_WAIT_FOR_DATA approach.

My reasoning is as follows: The "check for busy" code in math_library.asm takes three CPU instructions (MOVE, AND, RBRA), for a total of circa 7 (?) clock cycles in each iteration. That approach seems optimal for "long" waits, i.e. if the hardware is MANY clock cycles in completing. If, however, we reach the target of 7 clock cycles for a single calculation then just stalling the CPU for this many clock cycles is "simpler". If nothing else, it saves three instructions in the firmware.

Plus, the stalling will only ever occur during the read from the EAE. Notice line 24: https://github.com/sy2002/QNICE-FPGA/blob/master/monitor/math_library.asm#L24C1-L24C1 This is a "useful" instruction taking place simultaneously with the calculation. Therefore, the subsequent read from EAE in the next line will stall for only circa 4 clock cycles.