Variation in counter values

mahdi259 commented 10 months ago

Describe the bug I use neorv32 to determine cycle counts that my custom instruction takes in a loop. My intention is to show the benefits that it apply to a certain application instead of using the conventional method.

The problem is that when I add some neorv32_uart0_printf before my experiment block, the total clock cycles changes. It looks storage because start_time() should work independently after printing.

Expected behavior The total cycle count of a piece of code be constant, regardless of other codes of the program.

Screenshots If applicable, add screenshots to help explain your problem.

Hardware: Extensions: M, C, zicntr, zihpm, zxcfu.

As pictures show, removing one printf increases measured clock cycles.

// neorv32_uart0_printf("----------------\n"); neorv32_uart0_printf("---- First method:\n"); // ------------------------ res = 0; iterations = 40; start_time(); for (i=0; i<iterations; i++) { res += neorv32_cfu_r3_instr(0b0000000, 0b000, rs1, rs2); } stop_time(); t0 = get_time(); neorv32_uart0_printf("Iterations: %d, Cycles: %d\n", iterations, t0); // ------------------------

NikLeberg commented 10 months ago

Hi @mahdi259 interesting problem! Have you enabled the cache(s)? Please share your full generics settings of your neorv32 VHDL instance. I learned the hard way that re shuffling the code order (or commenting something out) can have seemingly very strange impact on code execution when a cache is involved. Otherwise I'd recommend you to compare the generated assembly from GCC that is output with make asm for both versions. I hope that helps, Nik

mahdi259 commented 10 months ago

Hi @NikLeberg, I don't use caches in my design. The enabled memories are 20KB IMEM and 8KB DMEM. The top module is attached to my reply. I will try make asm. Thanks
topmodule.zip

mahdi259 commented 10 months ago

I checked make asm and there is no difference. The benchmark with extra printf has an additional code block for it(before start_time()).

There is one asm code that calls start_time. With extra printf: 1f6:576000ef jal ra,76c <start_time> Without extra printf: 204:574000ef jal ra,778 <start_time>

stnolting commented 10 months ago

Hey @mahdi259

Which counters are you using to benchmark your code? The standard CPU counters cycle and instret, the hardware performance monitors (HPMs) or the machine timer (MTIME)? How do you setup these counters?

Could show the code for your start_time(), stop_time() and get_time() functions?

mahdi259 commented 10 months ago

Hi I use functions of core_portme.c from coremark directory. I think they use hardware performance monitor counters.


void start_time(void) {
    GETMYTIME(&start_time_val);
    neorv32_cpu_csr_write(CSR_MCOUNTINHIBIT, 0); // start all counters
}

void stop_time(void) {
    neorv32_cpu_csr_write(CSR_MCOUNTINHIBIT, -1); // stop all counters
    GETMYTIME(&stop_time_val);
}

CORE_TICKS get_time(void) {
    CORE_TICKS elapsed
        = (CORE_TICKS)(MYTIMEDIFF(stop_time_val, start_time_val));
    return elapsed;
}

By the way, does get_time() return milliseconds?

stnolting commented 10 months ago

I use functions of core_portme.c from coremark directory.

Ah I see. These functions use the CPU-internal cycle counter cycle.

Do you need the cycle counter for anything else than benchmarking? E.g. for some delays?

I would recommend to use a HPM here so there is no risk that this specific counter might be overridden by some other software part. This is what I would do:

  // enable ALL counters (including HPMs)
  neorv32_cpu_csr_write(CSR_MCOUNTINHIBIT, 0);

  // configure HPM3 to count clock ticks
  neorv32_cpu_csr_write(CSR_MHPMEVENT3, 1 << HPMCNT_EVENT_CY);

  // clear HPM3 (low word only), could be put into a 'start_time' function
  neorv32_cpu_csr_write(CSR_MHPMCOUNTER3, 0):

  ...your application code you want to profile...

  // read current HPM3 value (low word only), could be put into a `get_time` function
  uint32_t elapsed_cycles = neorv32_cpu_csr_read(CSR_MHPMCOUNTER3);

If your application codes runs for a "long" time you might also need to use the high-word of HPM3.

mahdi259 commented 10 months ago

Unfortunately the problem persists. Just try the following code snippet by commenting/uncommenting the specified printf:

neorv32_cpu_csr_write(CSR_MCOUNTINHIBIT, -1);// Enable CPU counters

//  neorv32_uart0_printf("----------------\n"); // **Here**
  neorv32_uart0_printf("---- First method:\n");

  // ------------------------
  neorv32_cpu_csr_write(CSR_MCOUNTINHIBIT, 0);// Enable CPU counters
  neorv32_cpu_csr_write(CSR_MHPMEVENT3, 1 << HPMCNT_EVENT_CY); 
  neorv32_cpu_csr_write(CSR_MHPMCOUNTER3, 0);
  for (i=0; i<40; i++) {
  asm("nop");
  }
  stop_time_val = neorv32_cpu_csr_read(CSR_MHPMCOUNTER3);
  neorv32_uart0_printf("Iterations: %d, Cycles: %d\n", iterations, stop_time_val);

stnolting commented 10 months ago

I've tested both version. The one without the extra printf:

----ULPPACK 1-bit:
Iterations: 0, Cycles: 438

And the one with the extra printf:

----------------
----ULPPACK 1-bit:
Iterations: 0, Cycles: 438

In both cases the HPM reading is identical. I am using a simple rv32i_zicsr_zifencei CPU + HPMs but without any caches. Are you sure that caches are disabled in your setup? Is the file your have attached above the actual top entity?

You could check the caches with this piece of code:

  if (NEORV32_SYSINFO->SOC & (1 << SYSINFO_SOC_ICACHE)) { neorv32_uart0_printf("ICACHE enabled\n"); }
  else { neorv32_uart0_printf("ICACHE disabled\n"); }

  if (NEORV32_SYSINFO->SOC & (1 << SYSINFO_SOC_DCACHE)) { neorv32_uart0_printf("DCACHE enabled\n"); }
  else { neorv32_uart0_printf("DCACHE disabled\n"); }

mahdi259 commented 10 months ago

You are right. Seems that the variation occurs when compiling code with MARCH=rv32imc. I'm sure that caches are not implemented. The top module is exactly what I sent you.

MARCH=rv32imc:

ICACHE disabled
DCACHE disabled
----ULPPACK 1-bit:
Iterations: 0, Cycles: 517

CFU demo program completed.

ICACHE disabled
DCACHE disabled
----------------
----ULPPACK 1-bit:
Iterations: 0, Cycles: 400

CFU demo program completed.

MARCH=rv32i:

ICACHE disabled
DCACHE disabled
----ULPPACK 1-bit:
Iterations: 0, Cycles: 438

CFU demo program completed.

ICACHE disabled
DCACHE disabled
----------------
----ULPPACK 1-bit:
Iterations: 0, Cycles: 438

CFU demo program completed.

stnolting commented 10 months ago

I have re-run my simulations also using a rv32i and a rv32ic compiler configuration. And I can reproduce your findings.

The generated assembly code is identical - at least the loop that is used for profiling. The general outcome of the code is also correct. So there is no "actual" bug in the core (phew 😅).

However, this is a really interesting scenario. I've looked through some waveforms and finally found the reason for this strange behavior:

The CPU's font-end keeps fetching new instructions independently of the actual instruction execution. As soon as a branch instruction is executed (like at the end of your loop) the front-end is reset to start fetching instruction from the branch destination.

The code version running for 517 the instruction fetch was quite fast and the entire instruction prefetch buffer is full. The front-ent is waiting do for some free space in that buffer before it can continue fetching. At this point the end-of-loop branch kicks in. Unfortunately, the front-end is not reset (yet) because it first waits for free space in the prefetch buffer.

Long story short, this is the problem:

https://github.com/stnolting/neorv32/blob/5281589fbf9329745b81534700ac8bede86a7185/rtl/core/neorv32_cpu_control.vhd#L379-L385

In this state the front-end reset should also be evaluated to allow a faster restart. So it should be something like this:

 when IF_REQUEST => -- request next 32-bit-aligned instruction word 
 -- ------------------------------------------------------------ 
   if (ipb.free = "11") then -- wait for free IPB space 
     fetch_engine.state <= IF_PENDING;
   elsif (fetch_engine.restart = '1') or (fetch_engine.reset = '1') then -- restart request due to branch
      fetch_engine.state <= IF_RESTART;
   elsif (execute_engine.state = SLEEP) then -- halt request (sleep)? 
     fetch_engine.state <= IF_PARKED; 
   end if;

I will do some more tests and if everything works fine I'll open a PR to update the front-end logic.

Thanks for finding this ~~bug~~ bottleneck! 😉

mahdi259 commented 10 months ago

Thanks

stnolting commented 10 months ago

The variations issue should be fixed now - at least when no caches are implemented :wink:

mahdi259 commented 10 months ago

Yes. There is no variation in the discussed code now. Thanks for your effort. I also learned from your replies that the C extension potentially makes variations in the clock cycles so I disabled it to get more stable results. Thanks

mahdi259 commented 10 months ago

Actually I got some further variations in my benchmark experimenting CFU custom instructions. Finally I removed C extension in compilation and got stable results. Thanks again for your attention. Best regards for you...

stnolting commented 10 months ago

The C extension can cause minor variations as it also supports unaligned 32-bit instructions words (that are 16-bit aligned) requiring an additional bus access to fetch the "second half of the instruction".

In large programs this variation should not be noticeable. However, you could try to increase the size of the instruction prefetch buffer:

https://github.com/stnolting/neorv32/blob/c95e03978cd2cf14f21d76f3ac405fcafce444d6/rtl/core/neorv32_package.vhd#L51-L52

This might reduce this kind of variations. Furthermore, large blocks of linear code can benefit from this is terms of increased execution speed.

mahdi259 commented 10 months ago

Thanks for your detailed description.

stnolting / neorv32

Variation in counter values #773