Latency measurement through `CSR(MCYCLE)` adds one extra cycle

Hi @stnolting !

Back here again 😅

I'm performing a CFU (Custom Function Unit) latency measurement using the CSR(MYCYCLE).

For this purpose I've added a multiplier with NEORV32 through this mode of connection.

In this context I've made the following code in C:

  //0000000000000001 x 0000000000000001
  static uint32_t fir = 0x00010001;
  //0000000000000010 x 0000000000000010
  static uint32_t sec = 0x00020002;
  //0000000000000100 x 0000000000000100
  static uint32_t thi = 0x00040004;
  //0000000000001000 x 0000000000001000
  static uint32_t fou = 0x00080008;

  // Intro
  neorv32_uart0_printf("\n CFU-mpw \n");

  // Perform 4 multiplication through custom instruction (funct3=001)
    neorv32_cpu_csr_write(CSR_MCYCLE, 0);
    neorv32_cfu_r3_instr(0b1111111, 0b001, fir, 0); 
    neorv32_cpu_csr_read(CSR_MCYCLE); 

    neorv32_cpu_csr_write(CSR_MCYCLE, 0);
    neorv32_cfu_r3_instr(0b1111111, 0b001, sec, 0); 
    neorv32_cpu_csr_read(CSR_MCYCLE); 

    neorv32_cpu_csr_write(CSR_MCYCLE, 0);
    neorv32_cfu_r3_instr(0b1111111, 0b001, thi, 0); 
    neorv32_cpu_csr_read(CSR_MCYCLE); 

    neorv32_cpu_csr_write(CSR_MCYCLE, 0);
    neorv32_cfu_r3_instr(0b1111111, 0b001, fou, 0); 
    neorv32_cpu_csr_read(CSR_MCYCLE); 

  // End
  neorv32_uart0_printf("\nEND\n");

Note: in the custom instruction the two 16-bit input operators are contained in rs1, rs2 don't care and the 32-bit result is stored in rd.

Obtaining the following assembly code:

  asm volatile ("csrw %[input_i], %[input_j]" :  : [input_i] "i" (csr_id), [input_j] "r" (csr_data));
 214:   00000793            li  a5,0
 218:   b0079073            csrw    mcycle,a5
    neorv32_cfu_r3_instr(0b1111111, 0b001, fir, 0); 
 21c:   00010737            lui a4,0x10
 220:   00170713            add a4,a4,1 # 10001 <__neorv32_ram_size+0xe001>
 224:   fef7170b            .word   0xfef7170b
  asm volatile ("csrr %[result], %[input_i]" : [result] "=r" (csr_data) : [input_i] "i" (csr_id));
 228:   b0002773            csrr    a4,mcycle
  asm volatile ("csrw %[input_i], %[input_j]" :  : [input_i] "i" (csr_id), [input_j] "r" (csr_data));
 22c:   b0079073            csrw    mcycle,a5
    neorv32_cfu_r3_instr(0b1111111, 0b001, sec, 0); 
 230:   00020737            lui a4,0x20
 234:   00270713            add a4,a4,2 # 20002 <__neorv32_ram_size+0x1e002>
 238:   fef7170b            .word   0xfef7170b
  asm volatile ("csrr %[result], %[input_i]" : [result] "=r" (csr_data) : [input_i] "i" (csr_id));
 23c:   b0002773            csrr    a4,mcycle
  asm volatile ("csrw %[input_i], %[input_j]" :  : [input_i] "i" (csr_id), [input_j] "r" (csr_data));
 240:   b0079073            csrw    mcycle,a5
    neorv32_cfu_r3_instr(0b1111111, 0b001, thi, 0); 
 244:   00040737            lui a4,0x40
 248:   00470713            add a4,a4,4 # 40004 <__neorv32_ram_size+0x3e004>
 24c:   fef7170b            .word   0xfef7170b
  asm volatile ("csrr %[result], %[input_i]" : [result] "=r" (csr_data) : [input_i] "i" (csr_id));
 250:   b0002773            csrr    a4,mcycle
  asm volatile ("csrw %[input_i], %[input_j]" :  : [input_i] "i" (csr_id), [input_j] "r" (csr_data));
 254:   b0079073            csrw    mcycle,a5
    neorv32_cfu_r3_instr(0b1111111, 0b001, fou, 0); 
 258:   00080737            lui a4,0x80
 25c:   00870713            add a4,a4,8 # 80008 <__neorv32_ram_size+0x7e008>
 260:   fef7170b            .word   0xfef7170b
  asm volatile ("csrr %[result], %[input_i]" : [result] "=r" (csr_data) : [input_i] "i" (csr_id));
 264:   b00027f3            csrr    a5,mcycle

The code executes four multiplications to perform the CFU latency measurement. The first measurement will be performed between the instruction pc: 218 where the CSR: MYCYCLE is set to 0 and the instruction pc: 228 where the first latency will be read from this register. In the same way, the second measurement is performed between pc: 22c and pc: 23c, the third between pc: 240 and pc: 250 and the fourth between pc: 254 and pc: 264.

The complex design (NEORV32 + multiplier) is simulated and the following results are obtained:

issue_CSR_1

Note: The PROGRAM COUNTER, INSTRUCTION, CSR(MYCYCLE) and CFU signals are from NEORV32 and the FIFO_IN, FIFO_OUT and FULL/EMPTY signals are from multiplier.

The question is that all the instructions to perform the CFU custom instruction lui, add and .work take 11 cycles to complete and the CSR(MYCYCLE) counts 12 cycles.

As we can see in the following image:

issue_CSR_2

Is it possible that the latency measurement through CSR(MYCYCLE) is adding one extra cycle? :thinking:

I look forward to your reply,

Cheers! 😃

/cc @umarcor

Isn't there something missing? The variables for the return data??

neorv32_cfu_r3_instr(0b1111111, 0b001, fir, 0); 
neorv32_cpu_csr_read(CSR_MCYCLE);

The cycle counter will increment in every clock cycle if the CPU is not in sleep mode. The CSR instructions to read/write any CSR are identical in terms of execution time, but they" commit in different stages. In this cas the CSR write is one cycle faster than the CSR read. That might be the reason for the +1 offset?!

Anyway, I suggest to also add the execute engine's state register to you waveform. Then you can see when an instruction has really completed execution - just looking at the program counter is not explicit enough for this.

Hi @stnolting !

Isn't there something missing? The variables for the return data??

The program is not designed to perform a implementation in FPGA. Therefore, it has been simplified to the maximum and the instructions/variables have been reduced to those necessary to excite the signals to be captured/measured.

The idea is to obtain directly in CI the result of the latency measurement for each modes/accelerators.

For this purpose the complex (NEORV32 + multiplier) is simulated through VUnit and the CSR(MYCYCLE) value is extracted using the following code.

  mycycle_capture: process
  begin
    done <= false;
    wait until start and rising_edge(clk);
    for x in 0 to test_items-1 loop
      wait until rising_edge(clk) and csr_we = '0' and csr_valid = '1' and csr_addr = x"B00" and csr_rdata_o /= x"00000000"; -- CSR MYCYCLE ADDR IS 0xB00
      info(logger, "Data " & to_string(x+1) & "/" & to_string(test_items) & " latency is " & to_string(to_integer(unsigned(csr_rdata_o))) & " cycles");
      wait until rising_edge(clk);
    end loop;

    wait until rising_edge(clk);
    done <= true;
    wait;
  end process;

To obtain the latency when the simulation is performed.

issue_CSR_3

Note: uart.tx: mpw means "multp_wfifos", it is the name of one of the accelerators under test.

For this reason I didn't use variables to return data.

It should be noted that I've tested in FPGA the correct operation of the design with a program with variables to return data to print these values via uart. 😁

Note: in this case I also measured 12 cycles of latency.

In this cas the CSR write is one cycle faster than the CSR read. That might be the reason for the +1 offset?!

In the first comment in the second image we can see that the csrw (CSR write) instruction pc: 218 takes three cycles to execute and the csrr (CSR read) instruction pc: 228 also takes three cycles to execute. I can't see why CSR write is one cycle faster. 😅 Please, could you re-explain this to me?

Anyway, I suggest to also add the execute engine's state register to you waveform. Then you can see when an instruction has really completed execution - just looking at the program counter is not explicit enough for this.

Okey, thanks for the hint! 😄

I will relaunch the simulation to see this information and will comment the results. 🚀

Cheers! 😃

Hi again!

I've added the execute engine's state register in the waveform.

Since the vcd format cannot display type elements I've made the following assignment:

with execute_engine.state select
     states <= "0001" when DISPATCH,
               "0010" when TRAP_ENTER,
               "0011" when TRAP_EXIT,
               "0100" when RESTART,
               "0101" when FENCE,
               "0110" when SLEEP,
               "0111" when EXECUTE,
               "1000" when ALU_WAIT,
               "1001" when BRANCH,
               "1010" when BRANCHED,
               "1011" when SYSTEM,
               "1100" when MEM_REQ,
               "1101" when MEM_WAIT,
               "0000" when others;

I know that the ghw format supports type elements but the simulation is so large that it gets stuck in this format. 😅

Anyway I've obtained the following results:

issue_CSR_5

For csrw instruction:

First cycle: EXECUTE
Second cycle: SYSTEM
Third cycle: DISPATCH

For csrr instruction:

First cycle: EXECUTE
Second cycle: SYSTEM
Third cycle: DISPATCH

At first look, it seems that both instructions take the same time to execute. In addition, it looks that the cycles are captured through csrr when the SYSTEM is consulted adding one extra cycle to the measurement, that is the EXECUTE cycle. :thinking:

What do you think?

Cheers! 😃

Since the vcd format cannot display type elements I've made the following assignment:

That's smart!

At first look, it seems that both instructions take the same time to execute. In addition, it looks that the cycles are captured through csrr when the SYSTEM is consulted adding one extra cycle to the measurement, that is the EXECUTE cycle.

That's right! All CSR access instruction always require the same amount of cycles. However, read and write access have have not the same latency. Write accesses happen in the 2nd cycle after the SYSTEM state. Read accesses (or actually the write-back to the register file) already happen in the cycle right after SYSTEM.

That's why the write access seems to be one cycle faster in the overall chronological order. However, I guess this is a very relative view 😅

Hi @stnolting !

That's right! All CSR access instruction always require the same amount of cycles. However, read and write access have have not the same latency. Write accesses happen in the 2nd cycle after the SYSTEM state. Read accesses (or actually the write-back to the register file) already happen in the cycle right after SYSTEM.

Okey!

Then, following your indication, I've understood this:

issue_CSR_6

If I'm right you must subtract one cycle to the cfu_r3_instr latency measurement and in general to any instruction/s that you are going to measure between csrw(mcycle = 0) and csrr.

Could you please confirm if I am correct in my assumption?

Cheers! 😃

Hi!

To reinforce my argument I've simulated the NEORV32 with this brief program:

  //Intro
  neorv32_uart0_printf("\n CSR \n\n");
  neorv32_cpu_csr_write(CSR_MCYCLE, 0);
  neorv32_cpu_csr_read(CSR_MCYCLE); 
  neorv32_uart0_printf("\nEND-CSR\n");
  //END

In assembler:

  asm volatile ("csrw %[input_i], %[input_j]" :  : [input_i] "i" (csr_id), [input_j] "r" (csr_data));
 1f0:   00000793            li  a5,0
 1f4:   b0079073            csrw    mcycle,a5
  asm volatile ("csrr %[result], %[input_i]" : [result] "=r" (csr_data) : [input_i] "i" (csr_id));
 1f8:   b00027f3            csrr    a5,mcycle

To see if the consecutive execution of csrw(mcycle = 0) and csrr instructions counts in the read operation 1 or 0 cycles.

Giving two possible scenarios:

Read operation counts 0 cycles. The measurement through CSR(MCYCLE) doesn't add any extra cycle.
Read operation counts 1 cycle. The measurement through CSR(MCYCLE) adds one extra cycle.

In this context I've obtained the following waveform:

issue_CSR_8

And the following simulation result:

issue_CSR_7

In view of these results, we can confirm that the latency measurement through CSR(MCYCLE) adds one extra cycle to whatever we want to measure between csrw(mcycle = 0) and csrr. If this were not the case, the result of this test should have been 0 cycles.

To fix this issue I've subtracted one cycle to the csr_rdata_o in the process of extracting CSR(MCYCLE) information.

From:

info(logger, "Data " & to_string(x+1) & "/" & to_string(test_items) & " latency is " & to_string(to_integer(unsigned(csr_rdata_o))) & " cycles");

To:

info(logger, "Data " & to_string(x+1) & "/" & to_string(test_items) & " latency is " & to_string(to_integer(unsigned(csr_rdata_o)) - 1) & " cycles");

What do you think about this conclusion?

Cheers! 😃

In view of these results, we can confirm that the latency measurement through CSR(MCYCLE) adds one extra cycle to whatever we want to measure between csrw(mcycle = 0) and csrr.

Right. This is a side-effect of the actual hardware implementation.

This is what I would do: initialize mcycle only once at the beginning. Then, just read the "current time" from that CSR and compute the time deltas. Thus, you will always be using read accesses for mcycle resulting in identical latency:

// initialize mcycle (low-part only!)
neorv32_cpu_csr_write(CSR_MCYCLE, 0);

...

uint32_t time_start = neorv32_cpu_csr_read(CSR_MCYCLE); // start time
... // your code to be benchmarked
uint32_t time_end   = neorv32_cpu_csr_read(CSR_MCYCLE); // end time

uint32_t execution_cycles = time_end - time_start; // time delta

Hi @stnolting !

Right. This is a side-effect of the actual hardware implementation.

Okey! Issue solved! 🎉 🎊

This is what I would do: initialize mcycle only once at the beginning. Then, just read the "current time" from that CSR and compute the time deltas. Thus, you will always be using read accesses for mcycle resulting in identical latency:

I appreciated your tip!

Thank you so much for your quick responses! 😎

Cheers! 😃

stnolting / neorv32

Latency measurement through `CSR(MCYCLE)` adds one extra cycle #897