sjohann81 / hf-risc

HF-RISC SoC
GNU General Public License v2.0
29 stars 36 forks source link

Halting CPU during opcode fetch #2

Closed rdb9879 closed 2 years ago

rdb9879 commented 3 years ago

I am trying to use this core on hardware that has opcodes stored in serial EEPROM. I have a hardware translation module that appears like a parallel memory block attached to the address/data bus, but under the hood it's driving serial commands to access the EEPROM. Naturally, I need to stall the CPU while the module is busy sending serial commands. I'm confused on the exact timing for driving the STALL signal high/low and where the valid opcode data needs to appear. If my hardware module sees the correct (EEPROM) address on the bus, should it drive the STALL signal high instantaneously (the exact same clock cycle), drive it high the next clock cycle, etc? Similarly, when the opcode data is available, should my hardware module present the data onto the databus 1 clock cycle before lowering stall, the same clock cycle as lowering stall, 1 clock cycle later, etc? Thanks

sjohann81 commented 3 years ago

The 'stall' signal should be driven by combinatorial logic. When the CPU issues an address, your logic should detect if the memory subsystem can provide data to the CPU in 1 clock cycle. If this is the case, 'stall' should remain low. If not, your logic must drive 'stall' high before the end of the cycle, as the CPU will sample the data bus after one cycle. When 'stall' is high, the CPU pipeline is stalled for the number of cycles needed in order to synchronize the external memory system, thus 'stall' must be kept high for the multi-cycle memory access. When the memory operation finishes, data should be put in the data bus before driving 'stall' low (on a read operation). There should be at least 2 clock cycles before driving 'stall' high again, due to the pipeline synchronization. As it is, this mechanism will only work if there is a cache or at least an instruction (prefetch) queue between the external memory and the CPU.

rdb9879 commented 3 years ago

Thanks for the input, but I am still having trouble. I followed your instructions exactly (stall is risen by combinatorial logic, setup data 1 clock cycle before driving 'stall' low, waiting at least 2 clock cycles before driving 'stall' high again), and verified this in simulation. I wrote a very simple SW program for testing (just setting registers x1 through x9 the values 1 to 9). The very first instruction works properly: I see "abc0_0000" on the address bus and the stall goes high instantly. After driving serial commands, I then put the opcode result on data_in, and then 1 clock cycle later I lower 'stall'. 2 clock cycles later, the address moves to "abc0_0004" and I drive 'stall' high again, instantly. This time, however, after I drive 'stall' low, two clock cycles later "abc0_0004" appears again on the bus (I would expect "abc0_0008"), and this cycle repeats forever: It keeps requesting "abc0_0004" over and over. For what it's worth, it does complete the two instructions: I see "1" in register x1, and "2" in register x2. I verified that the value being placed on "data_in" matches the hex code from the disassembler. Thanks again for the help.

sjohann81 commented 3 years ago

Could you post the code being used for the test and your general setup (e.g. modules organization / memory map)? If you wait 3 or more clock cycles instead of 2, does it make a difference? Also, a few screenshots of your simulation would ease my attempt to replicate the issue...

rdb9879 commented 3 years ago

No problem. I've created a very minimized testbench that replicates the issue. Sorry about the formatting, the 'code' tags are just not working correctly for me.

` -- This testbench runs the following assembly program, with -- "stalls" between each op-code fetch

--===================================================== --===================================================== -- .text -- .align 2 -- .global _entry --_entry: -- li x1, 1 -- li x2, 2 -- li x3, 3 -- li x4, 4 -- li x5, 5 -- li x6, 6 -- li x7, 7 -- beq zero, zero, _entry

--
-- Disassembled, this gives the following op-codes: --===================================================== --===================================================== -- code.elf: file format elf32-littleriscv -- Disassembly of section .text: -- 00000000 <_entry>: -- 0: 00100093 li ra,1 -- 4: 00200113 li sp,2 -- 8: 00300193 li gp,3 -- c: 00400213 li tp,4 -- 10: 00500293 li t0,5 -- 14: 00600313 li t1,6 -- 18: 00700393 li t2,7 -- 1c: fe0002e3 beqz zero,0 <_entry>

LIBRARY ieee; use ieee.std_logic_1164.ALL; USE ieee.numeric_std.ALL;

ENTITY datapath_tb IS END datapath_tb;

architecture behavior of datapath_tb is

component datapath Port ( clock : in std_logic; reset : in std_logic; stall : in std_logic; irq_vector : in std_logic_vector ( 31 downto 0 ); irq : in std_logic; irq_ack : out std_logic; exception : out std_logic; address : out std_logic_vector ( 31 downto 0 ); data_in : in std_logic_vector ( 31 downto 0 ); data_out : out std_logic_vector ( 31 downto 0 ); data_w : out std_logic_vector ( 3 downto 0 ); data_b : out std_logic; data_h : out std_logic; data_access : out std_logic ); end component;

signal irq_ack : std_logic;
signal exception : std_logic;
signal address : std_logic_vector ( 31 downto 0 );
signal data_out : std_logic_vector ( 31 downto 0 );
signal data_w : std_logic_vector ( 3 downto 0 );
signal data_b : std_logic;
signal data_h : std_logic;
signal data_access : std_logic;
signal clock : std_logic;
signal reset : std_logic;
signal stall : std_logic;
signal irq_vector : std_logic_vector ( 31 downto 0 );
signal irq : std_logic;
signal data_in : std_logic_vector ( 31 downto 0 );

type opCodes_t is array(0 to 7) of std_logic_vector(31 downto 0);

constant opCodeArray : opCodes_t := (0 => X"00100093", 1 => X"00200113", 2 => X"00300193", 3 => X"00400213", 4 => X"00500293", 5 => X"00600313", 6 => X"00700393", 7 => X"fe0002e3");

signal wordAddr : std_logic_vector(31 downto 2); signal wordAddri : integer; signal wordAddriclamped : integer; signal addrTargeted : std_logic; signal selectedOpCodeInst : std_logic_vector(31 downto 0); signal stallStarting : std_logic := '0'; signal stallActive : std_logic := '0';
signal holdoffActive : std_logic := '0'; signal stallCountDown : unsigned(3 downto 0) := (others => '0');

type stallStateMachine_t is (waitingForStart, stallIsActive, holdoffIsActive); signal currState : stallStateMachine_t := waitingForStart; signal latchedDataIn : std_logic_vector(31 downto 0) := (others => '0');

BEGIN

--Component Instantiation uut : datapath Port Map ( clock => clock ,
reset => reset ,
stall => stall ,
irq_vector => irq_vector , irq => irq ,
irq_ack => irq_ack ,
exception => exception ,
address => address ,
data_in => data_in ,
data_out => data_out ,
data_w => data_w ,
data_b => data_b ,
data_h => data_h ,
data_access => data_access
);

    wordAddr           <= address(31 downto 2);
    wordAddri          <= to_integer(unsigned(wordAddr));        
    addrTargeted       <= '1' when (wordAddri >= 0 and wordAddri < 8) else '0';
    wordAddriclamped   <= wordAddri when (addrTargeted = '1') else 0;  --So simulation doesn't crash if we go outside 0 to 7
    selectedOpCodeInst <= opCodeArray(wordAddriclamped) when (addrTargeted='1') else X"0000_0000";
    stallStarting      <= addrTargeted and not(stallActive) and not(holdoffActive);
    stall              <= (stallStarting or stallActive) and not(reset);

    --Convert read data to little endian
    data_in <= latchedDataIn(7 downto 0) & latchedDataIn(15 downto 8) & latchedDataIn(23 downto 16) & latchedDataIn(31 downto 24);

    StallStateProc : process(clock)
    begin
      if(rising_edge(clock)) then
         stallCountDown <= stallCountDown - 1;
         holdoffActive  <= '0';

         ---------------------------------
         if(currState = waitingForStart) then
            stallActive    <= '0';               
            stallCountDown <= X"F";

            if(stallStarting = '1' and reset = '0') then
               latchedDataIn  <= selectedOpCodeInst;
               stallActive    <= '1';
               currState      <= stallIsActive;
            end if;
         ---------------------------------   
         elsif(currState = stallIsActive) then

            if(stallCountDown = 0) then
               stallActive    <= '0';
               holdoffActive  <= '1';
               stallCountDown <= to_unsigned(1, stallCountDown'length);   --value of 1 holds off on stall for 2 clock cycles
               --stallCountDown <= to_unsigned(2, stallCountDown'length);   --value of 2 holds off on stall for 3 clock cycles
               currState      <= holdoffIsActive;
            end if;

         ---------------------------------
         elsif(currState = holdoffIsActive) then
            holdoffActive  <= '1';

            if(stallCountDown = 0) then
               holdoffActive  <= '0';
               currState      <= waitingForStart;
            end if;

         ---------------------------------
         else
            currState <= waitingForStart;
         end if;
      end if;
    end process StallStateProc;

clkProc : process
begin
    clock <= '1';
    wait for 5 ns;
    clock <= '0';
    wait for 5 ns;
end process;

MainTestProcess : Process
Begin
    reset <= '1';
    wait until rising_edge(clock); wait for 1 ps;
    wait until rising_edge(clock); wait for 1 ps;
    reset <= '0';

        wait for 20 us; 
        assert false report "Simulation Complete" severity failure;
        wait; -- Will wait forever
    end process;

END; `

As for screenshots, here's the 2-clock-cycle hold-off version:

RiscvEndlessLoop_2Clocks

and here's the 3-clock-cycle hold-off version:

RiscvEndlessLoop_3Clocks

sjohann81 commented 3 years ago

Thanks for the detailed explanation, I will try to address the issue. However, as I stated earlier, in the current implementation the stall cannot be used on every opcode fetch, as there needs to be at least a couple of cycles where the pipeline is running free after a stall. This is not a limitation if there is a prefetch queue or cache memory in the system, where the memory subsystem fetches a block of instructions. About the address generated by the CPU, remember that this implementation uses a three stage pipeline, so the instruction being executed was fetched two cycles in the past. In the case of pipeline stalls, this requires that the same address needs to be generated when an instruction could not be commited to the register file or memory (such as memory operations and branches). I will take a look to verify if this is working as expected.

rdb9879 commented 3 years ago

Hmmm well that could be a problem then. I don't know if I can guarantee that two cache misses in a row could never happen. If the code is doing multiple jumps in a row, or accessing some random location in ROM as part of a look-up table.

sjohann81 commented 3 years ago

Yes you can. The pipeline has three stages, and instructions in the branch delay slot are never executed in the RISC-V architecture, so any 'wrong' instructions following the branch, although fetched, will be executed as NOPs. In the second case, memory is accessed as data, and this is a three cycle operation. Cache misses in a row can be avoided if you fetch more data from memory in the same burst.

rdb9879 commented 3 years ago

Okay I think I'm understanding...in general. So if my module sees an instruction that hasn't been cached, it should go ahead and read that word plus the next N words (1, 2 or 3?) and due to the way the pipelining works, it is guaranteed to request these addresses in the following clock cycles, even if its not going to execute them, which will now be cached. "data access is a three cycle operation" does that mean that I am good on stalling as much as needed on data accesses?

sjohann81 commented 2 years ago

The issue appears to have been solved. I have decoupled the memory multiplexing logic for instruction and data accesses and performed some improvements in the design. These improvements were first tested with the new RV32E core and imported into the RV32I design after testing.