Accuracy of perf_get_mcycle() or perf_get_mcycle64()

bala122 commented 2 years ago

Hi, I just wanted to know if the perf_get_mcycle or mcycle64 function for profiling is accurate in giving cycle counts, because I tried it out with one of my custom hardware peripherals- which I know takes atleast around 80-90 cycles ( which also has a fp mul and add unit- which obviously takes quite a bit of time). However, via perf_get_mcycle it is showing around 7 cycles which seems to be off by a lot.

               long start = perf_get_mcycle64(); 
                cfu_op2(0,in_channel_count,out_channel);  //7 cycles.
                long end = perf_get_mcycle64();
                printf("COMP time %ld\n" ,end-start);

Additionally, if there is some related issue regarding cycle accuracy in perf_get_mcycle64(), please let me know.

bala122 commented 2 years ago

An update after looking at the disassemblies- I think perf_get_mcycle64() just increments after every instruction (if the pipelined Vexriscv core takes 1 cycle/instr on an avg) including the cfu_op instruction and hence prints the total cycles taken which is also the number of instructions in between "long start" and "long end". I'm not sure why this is happening. Could you please check the validity of perf_get_mcycle64() or if the mycycle CSR is infact incrementing at every clock edge?

tcal-x commented 2 years ago

Hi @bala122 , the mcycle calls should return the true cycle count regardless of any stalls in the processor or on the memory bus.

This is running on the actual board, correct? Renode simulation does not model everything with true cycle-accurate delays.

Can you paste the disassembly around the CFU op including the cycle count calls?

bala122 commented 2 years ago

No actually Im trying to simulate on renode first. Is there a way to get cycle accurate simulation without a physical board ? ( using vivado maybe?)

tcal-x commented 2 years ago

You can try running the full Verilog (Verilator) simulation:

make PLATFORM=sim load

It runs slowly since it is a Verilog simulation of both the CPU and the CFU.

bala122 commented 2 years ago

I'm getting the following error:

%Error: /home/shivaubuntu/CFU-playground/ee18b155/CFU_Playground_Gitlab/proj/proj_nn_svhn/cfu.v:4: Cannot find include file: fp_mul.v
`include "fp_mul.v"
         ^~~~~~~~~~
%Error: /home/shivaubuntu/CFU-playground/ee18b155/CFU_Playground_Gitlab/proj/proj_nn_svhn/cfu.v:4: This may be because there's no search path specified with -I<dir>.
`include "fp_mul.v"
         ^~~~~~~~~~
        ... Looked in:
             fp_mul.v
             fp_mul.v.v
             fp_mul.v.sv
             obj_dir/fp_mul.v
             obj_dir/fp_mul.v.v
             obj_dir/fp_mul.v.sv
%Error: /home/shivaubuntu/CFU-playground/ee18b155/CFU_Playground_Gitlab/proj/proj_nn_svhn/cfu.v:5: Cannot find include file: fp_add.v
`include "fp_add.v"
         ^~~~~~~~~~
%Error: Cannot find file containing module: /home/shivaubuntu/CFU-playground/ee18b155/CFU_Playground_Gitlab/proj/proj_nn_svhn
%Error: Exiting due to 4 error(s)
make[2]: *** [/home/shivaubuntu/CFU-playground/ee18b155/CFU_Playground_Gitlab/third_party/python/litex/litex/build/sim/core/Makefile:40: sim] Error 1
make[2]: Leaving directory '/home/shivaubuntu/CFU-playground/ee18b155/CFU_Playground_Gitlab/soc/build/sim.proj_nn_svhn/gateware'
make[1]: *** [/home/shivaubuntu/CFU-playground/ee18b155/CFU_Playground_Gitlab/soc/sim.mk:56: run] Error 1
make[1]: Leaving directory '/home/shivaubuntu/CFU-playground/ee18b155/CFU_Playground_Gitlab/soc'
make: *** [../proj.mk:354: load] Error 2

I have a main cfu.v file with other module files included in it( one of them being fp_mul.v)

bala122 commented 2 years ago

The above worked only after adding all modules into a single cfu.v file. Also, I just wanted to add that it would be helpful if you could post some details about the interface signals (rsp_valid,rsp_ready,cmd_valid,cmd_ready) because I had to go through a lot of errors when performing the cycle accurate simulation on verilator. Eg: cmd_payload_function_id wasn't stable throughout the execution time of the accelerator (it stays stable only at the start). So, I had to run various tests for those and to get some data about how the interface signals behave before I got them working. Thanks.

bala122 commented 2 years ago

Hi, any updates on the cycle accurate info about the CFU bus interface? Any doc would be helpful. Thanks.

tcal-x commented 2 years ago

Hi @bala122 , thanks for the poke. Yes, some text and waveforms would be appropriate.

bala122 commented 2 years ago

Hi, also just a query related to profiling- Are the memory access to DRAM or SRAM is modeled with a multi-cycle delay on verilator, or is any access to memory still single cycle. I wanted to know this because lately I've noticed that regions of the code involving memory access are taking up many cycles. I'm not sure if this is because of pipeline stalls ( assuming mem access is still one cycle) or because memory accesses take multiple cycles itself or if there are cache misses. For reference, the following text is a part of the disassembly and after it follows the clk count at which the cfu command (takes 2 cycles to finish) is valid

                                        lw    t4,0(a7)
4001ec0c:   02fe8e8b            cfu[1,0]  t4, t4, a5
4001ec10:   0048ae83            lw    t4,4(a7)
4001ec14:   00978f13            addi      t5,a5,9
4001ec18:   03ee8e8b            cfu[1,0]  t4, t4, t5
4001ec1c:   0088ae83            lw    t4,8(a7)
4001ec20:   01278f13            addi      t5,a5,18
4001ec24:   03ee8e8b            cfu[1,0]  t4, t4, t5
4001ec28:   000e2e83            lw    t4,0(t3)
4001ec2c:   00178f13            addi      t5,a5,1
4001ec30:   03ee8e8b            cfu[1,0]  t4, t4, t5
4001ec34:   004e2e83            lw    t4,4(t3)
4001ec38:   00a78f13            addi      t5,a5,10
4001ec3c:   03ee8e8b            cfu[1,0]  t4, t4, t5
4001ec40:   008e2e83            lw    t4,8(t3)
4001ec44:   01378f13            addi      t5,a5,19
4001ec48:   03ee8e8b            cfu[1,0]  t4, t4, t5
4001ec4c:   00032e83            lw    t4,0(t1)

cmd is valid @    3315742

cmd is valid @    3315746

cmd is valid @    3315770

cmd is valid @    3315774

cmd is valid @    3315778

cmd is valid @    3315802

So, obviously there are dependencies in between instructions, but I'm not sure if DRAM/SRAM multi-cycle mem. access is taken into account. Additionally, I encountered this weird case:

4001ebcc:   02e7860b            cfu[1,0]  a2, a5, a4
4001ebd0:   02e7860b            cfu[1,0]  a2, a5, a4
4001ebd4:   02e7860b            cfu[1,0]  a2, a5, a4
4001ebd8:   02e7860b            cfu[1,0]  a2, a5, a4

cmd is valid @    3352998

cmd is valid @    3352999
----------------------------------
cmd is valid @    3353020

cmd is valid @    3353021

cmd is valid @    3353022

I didn't get why there was a 21 cycle delay between consecutive CFU calls even though threre was no dependency. Does this mean that the host couldnt access the CFU peripheral for a while and was servicing a memory request (probably load req, with a multi-cycle mem. access delay) from an earlier instruction?

bala122 commented 2 years ago

Hi, any update on the above? The only reasonable conclusion I could make out was that memory is modelled as a multi cycle delay even in simulation along with many cache misses that cause this behaviour. ( Additionally, little pipeline stalls as well). Is this right or is there some other reason? Thanks in advance.

bala122 commented 2 years ago

Hi @tcal-x , any update on this? Thanks, Bala.

tcal-x commented 2 years ago

Hi @bala122 , I do see the issue with Renode (that a long-running CFU operation is not tallied correctly in $mcycle), and will open an issue.

With regard to memory latency -- this is a known limitation of both Renode and Verilator simulation, that external memory latencies are not modeled accurately.

bala122 commented 2 years ago

Okay, so can we atleast assume that whenever there are cache misses or even a cache access, that this delay is mostly more than a clk cycle? Because Im guessing in reality this could be the case and as of now although I dont have a board, I want to model this and say that by reducing the number of frequent memory accesses and promoting data reuse, we can potentially reduce the total time taken. The memory may not be exactly cycle accurate but is it possible that by having a buffer to reuse data and reduce memory accesses with this, it shows improvements in both models( cycle accurate and not exactly cycle accurate)?

bala122 commented 2 years ago

Hi @bala122 , thanks for the poke. Yes, some text and waveforms would be appropriate.

Hi @tcal-x , any update on this? This would be really helpful while debugging. Right now it is hard to debug with less information about the interface

mithro commented 2 years ago

@PiotrZierhoffer - Any thoughts?

tcal-x commented 2 years ago

@bala122 You are right, I would also get a lot of use out of an enhanced Renode that includes I-cache and D-cache simulation, even if it ran significantly slower. And, I did add a "note" in this section: https://cfu-playground.readthedocs.io/en/latest/step-by-step.html#verilog-cfu-development (regarding the interface timing, that the request signals are only valid for that one cycle). I'm still working on waveforms for certain scenarios.

(thanks @mithro for bringing in Renode people)

bala122 commented 2 years ago

@bala122 You are right, I would also get a lot of use out of an enhanced Renode that includes I-cache and D-cache simulation, even if it ran significantly slower. And, I did add a "note" in this section: https://cfu-playground.readthedocs.io/en/latest/step-by-step.html#verilog-cfu-development . I'm still working on waveforms for certain scenarios.

(thanks @mithro for bringing in Renode people)

Sure, @tcal-x . Thanks for the update.

tcal-x commented 2 years ago

@bala122 here's a sneak preview:

google / CFU-Playground

Accuracy of perf_get_mcycle() or perf_get_mcycle64() #506