Closed mahdi259 closed 10 months ago
Hi @mahdi259 interesting problem!
Have you enabled the cache(s)? Please share your full generics settings of your neorv32 VHDL instance.
I learned the hard way that re shuffling the code order (or commenting something out) can have seemingly very strange impact on code execution when a cache is involved.
Otherwise I'd recommend you to compare the generated assembly from GCC that is output with make asm
for both versions.
I hope that helps,
Nik
Hi @NikLeberg,
I don't use caches in my design. The enabled memories are 20KB IMEM and 8KB DMEM.
The top module is attached to my reply. I will try make asm
. Thanks
topmodule.zip
I checked make asm
and there is no difference.
The benchmark with extra printf has an additional code block for it(before start_time()
).
There is one asm code that calls start_time.
With extra printf:
1f6:576000ef jal ra,76c <start_time>
Without extra printf:
204:574000ef jal ra,778 <start_time>
Hey @mahdi259
Which counters are you using to benchmark your code? The standard CPU counters cycle
and instret
, the hardware performance monitors (HPMs) or the machine timer (MTIME)? How do you setup these counters?
Could show the code for your start_time()
, stop_time()
and get_time()
functions?
Hi I use functions of core_portme.c from coremark directory. I think they use hardware performance monitor counters.
void start_time(void) {
GETMYTIME(&start_time_val);
neorv32_cpu_csr_write(CSR_MCOUNTINHIBIT, 0); // start all counters
}
void stop_time(void) {
neorv32_cpu_csr_write(CSR_MCOUNTINHIBIT, -1); // stop all counters
GETMYTIME(&stop_time_val);
}
CORE_TICKS get_time(void) {
CORE_TICKS elapsed
= (CORE_TICKS)(MYTIMEDIFF(stop_time_val, start_time_val));
return elapsed;
}
By the way, does get_time()
return milliseconds?
I use functions of core_portme.c from coremark directory.
Ah I see. These functions use the CPU-internal cycle counter cycle
.
Do you need the cycle counter for anything else than benchmarking? E.g. for some delays?
I would recommend to use a HPM here so there is no risk that this specific counter might be overridden by some other software part. This is what I would do:
// enable ALL counters (including HPMs)
neorv32_cpu_csr_write(CSR_MCOUNTINHIBIT, 0);
// configure HPM3 to count clock ticks
neorv32_cpu_csr_write(CSR_MHPMEVENT3, 1 << HPMCNT_EVENT_CY);
// clear HPM3 (low word only), could be put into a 'start_time' function
neorv32_cpu_csr_write(CSR_MHPMCOUNTER3, 0):
...your application code you want to profile...
// read current HPM3 value (low word only), could be put into a `get_time` function
uint32_t elapsed_cycles = neorv32_cpu_csr_read(CSR_MHPMCOUNTER3);
If your application codes runs for a "long" time you might also need to use the high-word of HPM3.
Unfortunately the problem persists. Just try the following code snippet by commenting/uncommenting the specified printf:
neorv32_cpu_csr_write(CSR_MCOUNTINHIBIT, -1);// Enable CPU counters
// neorv32_uart0_printf("----------------\n"); // **Here**
neorv32_uart0_printf("---- First method:\n");
// ------------------------
neorv32_cpu_csr_write(CSR_MCOUNTINHIBIT, 0);// Enable CPU counters
neorv32_cpu_csr_write(CSR_MHPMEVENT3, 1 << HPMCNT_EVENT_CY);
neorv32_cpu_csr_write(CSR_MHPMCOUNTER3, 0);
for (i=0; i<40; i++) {
asm("nop");
}
stop_time_val = neorv32_cpu_csr_read(CSR_MHPMCOUNTER3);
neorv32_uart0_printf("Iterations: %d, Cycles: %d\n", iterations, stop_time_val);
I've tested both version. The one without the extra printf:
----ULPPACK 1-bit:
Iterations: 0, Cycles: 438
And the one with the extra printf:
----------------
----ULPPACK 1-bit:
Iterations: 0, Cycles: 438
In both cases the HPM reading is identical. I am using a simple rv32i_zicsr_zifencei
CPU + HPMs but without any caches. Are you sure that caches are disabled in your setup? Is the file your have attached above the actual top entity?
You could check the caches with this piece of code:
if (NEORV32_SYSINFO->SOC & (1 << SYSINFO_SOC_ICACHE)) { neorv32_uart0_printf("ICACHE enabled\n"); }
else { neorv32_uart0_printf("ICACHE disabled\n"); }
if (NEORV32_SYSINFO->SOC & (1 << SYSINFO_SOC_DCACHE)) { neorv32_uart0_printf("DCACHE enabled\n"); }
else { neorv32_uart0_printf("DCACHE disabled\n"); }
You are right. Seems that the variation occurs when compiling code with MARCH=rv32imc. I'm sure that caches are not implemented. The top module is exactly what I sent you.
MARCH=rv32imc:
ICACHE disabled
DCACHE disabled
----ULPPACK 1-bit:
Iterations: 0, Cycles: 517
CFU demo program completed.
ICACHE disabled
DCACHE disabled
----------------
----ULPPACK 1-bit:
Iterations: 0, Cycles: 400
CFU demo program completed.
MARCH=rv32i:
ICACHE disabled
DCACHE disabled
----ULPPACK 1-bit:
Iterations: 0, Cycles: 438
CFU demo program completed.
ICACHE disabled
DCACHE disabled
----------------
----ULPPACK 1-bit:
Iterations: 0, Cycles: 438
CFU demo program completed.
I have re-run my simulations also using a rv32i
and a rv32ic
compiler configuration. And I can reproduce your findings.
The generated assembly code is identical - at least the loop that is used for profiling. The general outcome of the code is also correct. So there is no "actual" bug in the core (phew 😅).
However, this is a really interesting scenario. I've looked through some waveforms and finally found the reason for this strange behavior:
The CPU's font-end keeps fetching new instructions independently of the actual instruction execution. As soon as a branch instruction is executed (like at the end of your loop) the front-end is reset to start fetching instruction from the branch destination.
The code version running for 517 the instruction fetch was quite fast and the entire instruction prefetch buffer is full. The front-ent is waiting do for some free space in that buffer before it can continue fetching. At this point the end-of-loop branch kicks in. Unfortunately, the front-end is not reset (yet) because it first waits for free space in the prefetch buffer.
Long story short, this is the problem:
In this state the front-end reset should also be evaluated to allow a faster restart. So it should be something like this:
when IF_REQUEST => -- request next 32-bit-aligned instruction word
-- ------------------------------------------------------------
if (ipb.free = "11") then -- wait for free IPB space
fetch_engine.state <= IF_PENDING;
elsif (fetch_engine.restart = '1') or (fetch_engine.reset = '1') then -- restart request due to branch
fetch_engine.state <= IF_RESTART;
elsif (execute_engine.state = SLEEP) then -- halt request (sleep)?
fetch_engine.state <= IF_PARKED;
end if;
I will do some more tests and if everything works fine I'll open a PR to update the front-end logic.
Thanks for finding this bug bottleneck! 😉
Thanks
The variations issue should be fixed now - at least when no caches are implemented :wink:
Yes. There is no variation in the discussed code now. Thanks for your effort. I also learned from your replies that the C extension potentially makes variations in the clock cycles so I disabled it to get more stable results. Thanks
Actually I got some further variations in my benchmark experimenting CFU custom instructions. Finally I removed C extension in compilation and got stable results. Thanks again for your attention. Best regards for you...
The C extension can cause minor variations as it also supports unaligned 32-bit instructions words (that are 16-bit aligned) requiring an additional bus access to fetch the "second half of the instruction".
In large programs this variation should not be noticeable. However, you could try to increase the size of the instruction prefetch buffer:
This might reduce this kind of variations. Furthermore, large blocks of linear code can benefit from this is terms of increased execution speed.
Thanks for your detailed description.
Describe the bug I use neorv32 to determine cycle counts that my custom instruction takes in a loop. My intention is to show the benefits that it apply to a certain application instead of using the conventional method.
The problem is that when I add some neorv32_uart0_printf before my experiment block, the total clock cycles changes. It looks storage because start_time() should work independently after printing.
Expected behavior The total cycle count of a piece of code be constant, regardless of other codes of the program.
Screenshots If applicable, add screenshots to help explain your problem.
Hardware: Extensions: M, C, zicntr, zihpm, zxcfu.
As pictures show, removing one printf increases measured clock cycles.
// neorv32_uart0_printf("----------------\n"); neorv32_uart0_printf("---- First method:\n"); // ------------------------ res = 0; iterations = 40; start_time(); for (i=0; i<iterations; i++) { res += neorv32_cfu_r3_instr(0b0000000, 0b000, rs1, rs2); } stop_time(); t0 = get_time(); neorv32_uart0_printf("Iterations: %d, Cycles: %d\n", iterations, t0); // ------------------------