Open bala122 opened 2 years ago
An update after looking at the disassemblies- I think perf_get_mcycle64() just increments after every instruction (if the pipelined Vexriscv core takes 1 cycle/instr on an avg) including the cfu_op instruction and hence prints the total cycles taken which is also the number of instructions in between "long start" and "long end". I'm not sure why this is happening. Could you please check the validity of perf_get_mcycle64() or if the mycycle CSR is infact incrementing at every clock edge?
Hi @bala122 , the mcycle calls should return the true cycle count regardless of any stalls in the processor or on the memory bus.
This is running on the actual board, correct? Renode simulation does not model everything with true cycle-accurate delays.
Can you paste the disassembly around the CFU op including the cycle count calls?
No actually Im trying to simulate on renode first. Is there a way to get cycle accurate simulation without a physical board ? ( using vivado maybe?)
You can try running the full Verilog (Verilator) simulation:
make PLATFORM=sim load
It runs slowly since it is a Verilog simulation of both the CPU and the CFU.
I'm getting the following error:
%Error: /home/shivaubuntu/CFU-playground/ee18b155/CFU_Playground_Gitlab/proj/proj_nn_svhn/cfu.v:4: Cannot find include file: fp_mul.v
`include "fp_mul.v"
^~~~~~~~~~
%Error: /home/shivaubuntu/CFU-playground/ee18b155/CFU_Playground_Gitlab/proj/proj_nn_svhn/cfu.v:4: This may be because there's no search path specified with -I<dir>.
`include "fp_mul.v"
^~~~~~~~~~
... Looked in:
fp_mul.v
fp_mul.v.v
fp_mul.v.sv
obj_dir/fp_mul.v
obj_dir/fp_mul.v.v
obj_dir/fp_mul.v.sv
%Error: /home/shivaubuntu/CFU-playground/ee18b155/CFU_Playground_Gitlab/proj/proj_nn_svhn/cfu.v:5: Cannot find include file: fp_add.v
`include "fp_add.v"
^~~~~~~~~~
%Error: Cannot find file containing module: /home/shivaubuntu/CFU-playground/ee18b155/CFU_Playground_Gitlab/proj/proj_nn_svhn
%Error: Exiting due to 4 error(s)
make[2]: *** [/home/shivaubuntu/CFU-playground/ee18b155/CFU_Playground_Gitlab/third_party/python/litex/litex/build/sim/core/Makefile:40: sim] Error 1
make[2]: Leaving directory '/home/shivaubuntu/CFU-playground/ee18b155/CFU_Playground_Gitlab/soc/build/sim.proj_nn_svhn/gateware'
make[1]: *** [/home/shivaubuntu/CFU-playground/ee18b155/CFU_Playground_Gitlab/soc/sim.mk:56: run] Error 1
make[1]: Leaving directory '/home/shivaubuntu/CFU-playground/ee18b155/CFU_Playground_Gitlab/soc'
make: *** [../proj.mk:354: load] Error 2
I have a main cfu.v file with other module files included in it( one of them being fp_mul.v)
The above worked only after adding all modules into a single cfu.v file. Also, I just wanted to add that it would be helpful if you could post some details about the interface signals (rsp_valid,rsp_ready,cmd_valid,cmd_ready) because I had to go through a lot of errors when performing the cycle accurate simulation on verilator. Eg: cmd_payload_function_id wasn't stable throughout the execution time of the accelerator (it stays stable only at the start). So, I had to run various tests for those and to get some data about how the interface signals behave before I got them working. Thanks.
Hi, any updates on the cycle accurate info about the CFU bus interface? Any doc would be helpful. Thanks.
Hi @bala122 , thanks for the poke. Yes, some text and waveforms would be appropriate.
Hi, also just a query related to profiling- Are the memory access to DRAM or SRAM is modeled with a multi-cycle delay on verilator, or is any access to memory still single cycle. I wanted to know this because lately I've noticed that regions of the code involving memory access are taking up many cycles. I'm not sure if this is because of pipeline stalls ( assuming mem access is still one cycle) or because memory accesses take multiple cycles itself or if there are cache misses. For reference, the following text is a part of the disassembly and after it follows the clk count at which the cfu command (takes 2 cycles to finish) is valid
lw t4,0(a7)
4001ec0c: 02fe8e8b cfu[1,0] t4, t4, a5
4001ec10: 0048ae83 lw t4,4(a7)
4001ec14: 00978f13 addi t5,a5,9
4001ec18: 03ee8e8b cfu[1,0] t4, t4, t5
4001ec1c: 0088ae83 lw t4,8(a7)
4001ec20: 01278f13 addi t5,a5,18
4001ec24: 03ee8e8b cfu[1,0] t4, t4, t5
4001ec28: 000e2e83 lw t4,0(t3)
4001ec2c: 00178f13 addi t5,a5,1
4001ec30: 03ee8e8b cfu[1,0] t4, t4, t5
4001ec34: 004e2e83 lw t4,4(t3)
4001ec38: 00a78f13 addi t5,a5,10
4001ec3c: 03ee8e8b cfu[1,0] t4, t4, t5
4001ec40: 008e2e83 lw t4,8(t3)
4001ec44: 01378f13 addi t5,a5,19
4001ec48: 03ee8e8b cfu[1,0] t4, t4, t5
4001ec4c: 00032e83 lw t4,0(t1)
cmd is valid @ 3315742
cmd is valid @ 3315746
cmd is valid @ 3315770
cmd is valid @ 3315774
cmd is valid @ 3315778
cmd is valid @ 3315802
So, obviously there are dependencies in between instructions, but I'm not sure if DRAM/SRAM multi-cycle mem. access is taken into account. Additionally, I encountered this weird case:
4001ebcc: 02e7860b cfu[1,0] a2, a5, a4
4001ebd0: 02e7860b cfu[1,0] a2, a5, a4
4001ebd4: 02e7860b cfu[1,0] a2, a5, a4
4001ebd8: 02e7860b cfu[1,0] a2, a5, a4
cmd is valid @ 3352998
cmd is valid @ 3352999
----------------------------------
cmd is valid @ 3353020
cmd is valid @ 3353021
cmd is valid @ 3353022
I didn't get why there was a 21 cycle delay between consecutive CFU calls even though threre was no dependency. Does this mean that the host couldnt access the CFU peripheral for a while and was servicing a memory request (probably load req, with a multi-cycle mem. access delay) from an earlier instruction?
Hi, any update on the above? The only reasonable conclusion I could make out was that memory is modelled as a multi cycle delay even in simulation along with many cache misses that cause this behaviour. ( Additionally, little pipeline stalls as well). Is this right or is there some other reason? Thanks in advance.
Hi @tcal-x , any update on this? Thanks, Bala.
Hi @bala122 , I do see the issue with Renode (that a long-running CFU operation is not tallied correctly in $mcycle), and will open an issue.
With regard to memory latency -- this is a known limitation of both Renode and Verilator simulation, that external memory latencies are not modeled accurately.
Okay, so can we atleast assume that whenever there are cache misses or even a cache access, that this delay is mostly more than a clk cycle? Because Im guessing in reality this could be the case and as of now although I dont have a board, I want to model this and say that by reducing the number of frequent memory accesses and promoting data reuse, we can potentially reduce the total time taken. The memory may not be exactly cycle accurate but is it possible that by having a buffer to reuse data and reduce memory accesses with this, it shows improvements in both models( cycle accurate and not exactly cycle accurate)?
Hi @bala122 , thanks for the poke. Yes, some text and waveforms would be appropriate.
Hi @tcal-x , any update on this? This would be really helpful while debugging. Right now it is hard to debug with less information about the interface
@PiotrZierhoffer - Any thoughts?
@bala122 You are right, I would also get a lot of use out of an enhanced Renode that includes I-cache and D-cache simulation, even if it ran significantly slower. And, I did add a "note" in this section: https://cfu-playground.readthedocs.io/en/latest/step-by-step.html#verilog-cfu-development (regarding the interface timing, that the request signals are only valid for that one cycle). I'm still working on waveforms for certain scenarios.
(thanks @mithro for bringing in Renode people)
@bala122 You are right, I would also get a lot of use out of an enhanced Renode that includes I-cache and D-cache simulation, even if it ran significantly slower. And, I did add a "note" in this section: https://cfu-playground.readthedocs.io/en/latest/step-by-step.html#verilog-cfu-development . I'm still working on waveforms for certain scenarios.
(thanks @mithro for bringing in Renode people)
Sure, @tcal-x . Thanks for the update.
@bala122 here's a sneak preview:
Hi, I just wanted to know if the perf_get_mcycle or mcycle64 function for profiling is accurate in giving cycle counts, because I tried it out with one of my custom hardware peripherals- which I know takes atleast around 80-90 cycles ( which also has a fp mul and add unit- which obviously takes quite a bit of time). However, via perf_get_mcycle it is showing around 7 cycles which seems to be off by a lot.
Additionally, if there is some related issue regarding cycle accuracy in perf_get_mcycle64(), please let me know.