ijor / fx68k

FX68K 68000 cycle accurate SystemVerilog core
GNU General Public License v3.0
135 stars 31 forks source link

Core performance and size improvement #6

Open fredrequin opened 3 years ago

fredrequin commented 3 years ago

No more LINT warnings under Verilator. New core size is less than 3200 LEs. Fmax on a Stratix 1S40 is 130 - 135 MHz

sorgelig commented 3 years ago

using FPGA specific BRAM is good for completed project is fine, but it's not good for generic core supposed to be used in wide variety of FPGAs. I suggest to replace it with inferred BRAM.

fredrequin commented 3 years ago

Inferring true dual ported BRAM with byte enables is nearly impossible with Quartus II. That's why I had to use vendor specific BRAM. The generic code is still there and used by Verilator simulation.

sorgelig commented 3 years ago

it's possible, especially in SV:

module dpram #(ADDRWIDTH=9)
(
    input                 clock,

    input [ADDRWIDTH-2:0] address_a,
    input          [15:0] data_a, 
    input                 wren_a,
    output     reg [15:0] q_a,

    input [ADDRWIDTH-1:0] address_b,
    input           [7:0] data_b, 
    input                 wren_b,
    output     reg  [7:0] q_b
);

logic [1:0][7:0] ram[0:(1<<(ADDRWIDTH-1))-1];

always@(posedge clock) begin
    if(wren_a) begin
        ram[address_a] <= data_a;
        q_a <= data_a;
    end
    else begin
        q_a <= ram[address_a];
    end
end

always@(posedge clock) begin
    if(wren_b) begin
        ram[address_b[ADDRWIDTH-1:1]][address_b[0]] <= data_b;
        q_b <= data_b;
    end
    else begin
        q_b <= ram[address_b[ADDRWIDTH-1:1]][address_b[0]];
    end
end

endmodule

it's not exactly with byte enable, but pretty much similar and easy to modify. Quartus also provides templates (edit->insert template) with similar module.

fredrequin commented 3 years ago

Been there, done that. With clock enable / byte enable, it does not work : Info: RAM logic "fx68k_soc:U_fx68k_soc|fx68k:U_fx68k|excUnit:excUnit|fx68kRegs:U_fx68kRegs|dpram:U_dpram_B|ram" is uninferred due to asynchronous read logic Info: RAM logic "fx68k_soc:U_fx68k_soc|fx68k:U_fx68k|excUnit:excUnit|fx68kRegs:U_fx68kRegs|dpram:U_dpram_W|ram" is uninferred due to asynchronous read logic Info: RAM logic "fx68k_soc:U_fx68k_soc|fx68k:U_fx68k|excUnit:excUnit|fx68kRegs:U_fx68kRegs|dpram:U_dpram_L|ram" is uninferred due to asynchronous read logic I can even trigger a crash in Quartus II :-) Internal Error: Sub-system: OPT, File: /quartus/synth/opt/opt_ram_resource_aware_st.cpp, Line: 8236 this->m_read_a_during_write_b_behavior == CDB_RAM_READ_OLD_DATA Stack Trace: 0xeee6d: opt_sea_of_mux + 0x4a7cd (SYNTH_OPT) 0xedc84: opt_sea_of_mux + 0x495e4 (SYNTH_OPT) 0xecbad: opt_sea_of_mux + 0x4850d (SYNTH_OPT) 0xec623: opt_sea_of_mux + 0x47f83 (SYNTH_OPT) 0x1ebc06: SMP_ROOT::operator= + 0x3b16 (SYNTH_OPT) 0x1ee9e4: SMP_ROOT::operator= + 0x68f4 (SYNTH_OPT) 0x1ea7d7: SMP_ROOT::operator= + 0x26e7 (SYNTH_OPT) 0x36ba: RTL_ROOT::RTL_ROOT + 0x26ba (SYNTH_OPT) 0x4050: RTL_ROOT::process_sgate_netlist + 0x4e0 (SYNTH_OPT) 0x1369fa: SGN_HSSI_CONVERT::operator= + 0x67c2a (synth_sgn) 0x137e73: SGN_HSSI_CONVERT::operator= + 0x690a3 (synth_sgn) 0xa8403: sgn_qic_helper + 0x9c843 (synth_sgn) 0xab635: sgn_qic_helper + 0x9fa75 (synth_sgn) 0xac040: sgn_qic_helper + 0xa0480 (synth_sgn) 0xae52: sgn_full + 0xc2 (synth_sgn)

 0xfd5b: qexe_get_tcl_sub_option + 0xefb (comp_qexe)
0x133a7: qexe_process_cmdline_arguments + 0x5b7 (comp_qexe)
0x134a1: qexe_standard_main + 0x71 (comp_qexe)

 0x1a18: msg_exe_fini + 0xf8 (CCL_MSG)
 0x19bc: msg_exe_fini + 0x9c (CCL_MSG)
 0xa290: operator new[] + 0x130 (ccl_mem)
0x344d4: msg_exe_main + 0x74 (CCL_MSG)

 0x13d1: BaseThreadInitThunk + 0x21 (KERNEL32)
0x154f3: RtlUserThreadStart + 0x33 (ntdll)

End-trace

Quartus II 64-Bit Version 9.1 Build 222 10/21/2009 SJ Full Version

jotego commented 3 years ago

Thank you so much for improving this core once more. I have used it in several arcade games already and I really appreciate its quality. About this update:

I agree with @sorgelig. Abandoning generic code for vendor-specific instances seriously limits the reusability of this. Your commit messages suggest that you also fixed 1 or 2 bugs -apart from performance improvements.

I would like to be able to keep those bug fixes without the vendor-specific RAM.

Could you have the vendor-specific code under an ifdef macro statement?

a1exh commented 3 years ago

@jotego the pull request hasn't been approved yet. I'd wait.

The commit messages may not be referring to bugs in the original code.

Some changes, such as increasing the address bus width might not be in the spirit of fx68k? I don't think a 32-bit address capable variant of the 68000 was ever available.

ijor commented 3 years ago

Hi everybody,

@jotego @a1exh As far as I can see the bugs are not in the original code.

@fredrequin I need to review the changes before accepting the pull request. Please bear with me for the delay.

But I agree with @sorgelig that vendor specific code should be avoided. If you want, put all the vendor specific stuff in a separate file and ifdef the reference in the main code. The default must be to use generic code, though.

sorgelig commented 3 years ago

@fredrequin would be good to see the performance and size before and after to judge the difference

fredrequin commented 3 years ago

Hello all, yes the bugs were my fault while changing the ALU or doing some code clean-up. To me, extending the address bus does not impact the performance or size, if you do not connect the upper address bits, you are back to 16 MB addressing. BTW, there was a very rare 68012 chip with 2GB addressing (Ok, it was 68010 based, so slightly different from a 68000). Performance wise, I was not able to reach 100 MHz with the original core. Now, I can reach 130 MHz. For the size, I went from 5180 LEs to 3170 LEs. Most of the size gain comes from the register file in BRAM. The way the original core was written prevented the use of block RAM for the registers. I am able to read the registers on T3, so I can put a block RAM before the buses MUXes on T4. Registers writes are still done on T4. Given the limited pipelining of the 68000, I assume that there is no concurrent read / write ? If there is some, a bypass mechanism can be added in the register file. The speed gain comes mostly from the ALU : by changing the OP_xxx order so the cases get simplified and by using the same Adder than the J68 core. As many of you suggested, I can put some ifdef for vendor specific code.

fredrequin commented 3 years ago

I have pushed a new change with generic ROM/RAM and Altera ROM/RAM. The core is even smaller (I have put the microLatch and nanoLatch inside the M4Ks) : 2985 LEs ! This size reduction is a little bit suspicious, I was expecting ~100 LEs less and not 200 LEs.

ijor commented 3 years ago

@fredrequin On exactly which FPGA model and with which Quartus version you got those size and performance results?

fredrequin commented 3 years ago

I have used Quartus II 64-bit Full v9.1 and the target is a Stratix EP1S40F780C5 (Nios Dev. Board, Stratix Professional Edition).

jotego commented 3 years ago

What is the verdict about this? Could we rescue the Verilator fixes at least?

jotego commented 3 years ago

I have tried compiling with Fredrequin's softcore, and the games hang up. I haven't looked into why they hang up. I assume there is some bug in the modifications done.