rsd-devel / rsd

RSD: RISC-V Out-of-Order Superscalar Processor
Apache License 2.0
934 stars 95 forks source link

Add benchmark / resource usage information #9

Open mithro opened 4 years ago

mithro commented 4 years ago

Would you mind adding some benchmark / resource usage information for the core?

The VexRISCV README has the following;

VexRiscv smallest (RV32I, 0.52 DMIPS/Mhz, no datapath bypass, no interrupt) ->
    Artix 7     -> 233 Mhz 494 LUT 505 FF
    Cyclone V   -> 193 Mhz 347 ALMs
    Cyclone IV  -> 179 Mhz 730 LUT 494 FF
    iCE40       -> 92 Mhz 1130 LC

....

VexRiscv full max dmips/mhz -> (RV32IM, 1.44 DMIPS/Mhz 2.70 Coremark/Mhz,, 16KB-I$,16KB-D$, single cycle barrel shifter, debug module, catch exceptions, dynamic branch prediction in the fetch stage, branch and shift operations done in the Execute stage) ->
    Artix 7     -> 140 Mhz 1767 LUT 1128 FF
    Cyclone V   -> 90 Mhz 1,089 ALMs
    Cyclone IV  -> 79 Mhz 2,336 LUT 1,048 FF

VexRiscv full with MMU (RV32IM, 1.24 DMIPS/Mhz 2.35 Coremark/Mhz, with cache trashing, 4KB-I$, 4KB-D$, single cycle barrel shifter, debug module, catch exceptions, dynamic branch, MMU) ->
    Artix 7     -> 161 Mhz 1985 LUT 1585 FF
    Cyclone V   -> 124 Mhz 1,319 ALMs
    Cyclone IV  -> 122 Mhz 2,710 LUT 1,501 FF

VexRiscv linux balanced (RV32IMA, 1.21 DMIPS/Mhz 2.27 Coremark/Mhz, with cache trashing, 4KB-I$, 4KB-D$, single cycle barrel shifter, catch exceptions, static branch, MMU, Supervisor, Compatible with mainstream linux) ->
    Artix 7     -> 170 Mhz 2530 LUT 2013 FF
    Cyclone V   -> 125 Mhz 1,618 ALMs
    Cyclone IV  -> 116 Mhz 3,314 LUT 2,016 FF

My general guidance around the performance of soft-cores is listed in the following table; Screenshot from 2020-01-14 14-50-02

shioyadan commented 4 years ago

In the following paper presented at FPT last year, we explain resource usage and drystone values regarding RSD and some cores.

http://sv.rsg.ci.i.u-tokyo.ac.jp/pdfs/Mashimo-FPT'19.pdf

Also, we plan to integrate some benchmarks into our repository, and will add such information to the documentation as well.

mithro commented 4 years ago

It looks like you only compared your CPU to BOOM and OPA? Any reason you didn't compare to in-order cores too?

Looks like you get O(2.04 DMIPS/MHz) @ O(90 MHz) with using O(15k LUTs) and O(8k FF)? I'm currently assuming you are using an Artix-7 board?

This compares to VexRISCV (which is an in order core) which gets VexRiscv linux balanced - O(1.2 DMIPS/Mhz) @ O(170 Mhz) using O(2.5k LUTs) and O(2k FF).

I think it would be really interesting to add RSD to LiteX which already supports VexRISCV (see #6) to give a more fair comparison in real world benchmarks.

shioyadan commented 4 years ago

It looks like you only compared your CPU to BOOM and OPA? Any reason you didn't compare to in-order cores too?

The main reason is that this paper focuses on an efficient implementation of an OoO processor on FPGAs.

(To be honest, another reason was the page limit for the conference paper, and there wasn't enough time to perform thorough evaluation including evaluation compared with InO cores until the deadline ...).

Anyway, I also would like to evaluate RSD compared with other cores using more complex and real world benchmarks.

Looks like you get O(2.04 DMIPS/MHz) @ O(90 MHz) with using O(15k LUTs) and O(8k FF)?

Yes.

I'm currently assuming you are using an Artix-7 board?

We used ZedBoard with XC7Z020, whose FPGA part seems to correspond to Artix-7 with 53K LUTs. I'm not an FPGA expert, but one of our team members is an FPGA expert and he may provide additional information about the board.

msmssm commented 4 years ago

RSD currently supports only Zynq-based platforms because it depends on an ARM processor on Zynq to load a program binary for RSD into an external memory. I believe that both Zynq-7000 and Artix-7 are based on the Xilinx 7-series FPGA architecture and the resource comparison is fair even though one uses Zynq-7000 and the other uses Artix-7.

Dolu1990 commented 2 years ago

I believe that both Zynq-7000 and Artix-7 are based on the Xilinx 7-series FPGA

@msmssm I was quite suprised but it seems that Zynq-7000 can be quite faster than Artix-7 at the same speed grade.

Looking at their datasheet : https://www.xilinx.com/support/documentation/data_sheets/ds181_Artix_7_Data_Sheet.pdf https://www.xilinx.com/support/documentation/data_sheets/ds191-XC7Z030-XC7Z045-data-sheet.pdf

For instance the CLB Distributed RAM Switching Characteristics TSHCKO For that timing, the slower Zynq (speed grade -1) is faster than the fastest Artix 7 (speed grade -3) The same seems true for LUT

For bram it seems it is a bit less diverging zynq -2 ~~ Artix -3

Would need to test on a whole design to see what the average is ^^

Dolu1990 commented 2 years ago

I just found out that inside the same zynq family, there is 2 class of devices with radicaly different timings :

https://docs.xilinx.com/v/u/en-US/ds187-XC7Z010-XC7Z020-Data-Sheet vs https://docs.xilinx.com/v/u/en-US/ds191-XC7Z030-XC7Z045-data-sheet

And so the XC7Z020 devices used in the paper seems totaly equivalent to Artix 7 ones :) my bad !

shioyadan commented 2 years ago

msmssm is unable to respond to this issue for the reasons stated in the email. Thank you again for your information!

Dolu1990 commented 2 years ago

@shioyadan Thanks :)