Closed rdb9879 closed 2 years ago
The 'stall' signal should be driven by combinatorial logic. When the CPU issues an address, your logic should detect if the memory subsystem can provide data to the CPU in 1 clock cycle. If this is the case, 'stall' should remain low. If not, your logic must drive 'stall' high before the end of the cycle, as the CPU will sample the data bus after one cycle. When 'stall' is high, the CPU pipeline is stalled for the number of cycles needed in order to synchronize the external memory system, thus 'stall' must be kept high for the multi-cycle memory access. When the memory operation finishes, data should be put in the data bus before driving 'stall' low (on a read operation). There should be at least 2 clock cycles before driving 'stall' high again, due to the pipeline synchronization. As it is, this mechanism will only work if there is a cache or at least an instruction (prefetch) queue between the external memory and the CPU.
Thanks for the input, but I am still having trouble. I followed your instructions exactly (stall is risen by combinatorial logic, setup data 1 clock cycle before driving 'stall' low, waiting at least 2 clock cycles before driving 'stall' high again), and verified this in simulation. I wrote a very simple SW program for testing (just setting registers x1 through x9 the values 1 to 9). The very first instruction works properly: I see "abc0_0000" on the address bus and the stall goes high instantly. After driving serial commands, I then put the opcode result on data_in, and then 1 clock cycle later I lower 'stall'. 2 clock cycles later, the address moves to "abc0_0004" and I drive 'stall' high again, instantly. This time, however, after I drive 'stall' low, two clock cycles later "abc0_0004" appears again on the bus (I would expect "abc0_0008"), and this cycle repeats forever: It keeps requesting "abc0_0004" over and over. For what it's worth, it does complete the two instructions: I see "1" in register x1, and "2" in register x2. I verified that the value being placed on "data_in" matches the hex code from the disassembler. Thanks again for the help.
Could you post the code being used for the test and your general setup (e.g. modules organization / memory map)? If you wait 3 or more clock cycles instead of 2, does it make a difference? Also, a few screenshots of your simulation would ease my attempt to replicate the issue...
No problem. I've created a very minimized testbench that replicates the issue. Sorry about the formatting, the 'code' tags are just not working correctly for me.
--
-- Disassembled, this gives the following op-codes:
--=====================================================
--=====================================================
-- code.elf: file format elf32-littleriscv
-- Disassembly of section .text:
-- 00000000 <_entry>:
-- 0: 00100093 li ra,1
-- 4: 00200113 li sp,2
-- 8: 00300193 li gp,3
-- c: 00400213 li tp,4
-- 10: 00500293 li t0,5
-- 14: 00600313 li t1,6
-- 18: 00700393 li t2,7
-- 1c: fe0002e3 beqz zero,0 <_entry>
LIBRARY ieee; use ieee.std_logic_1164.ALL; USE ieee.numeric_std.ALL;
ENTITY datapath_tb IS END datapath_tb;
architecture behavior of datapath_tb is
component datapath Port ( clock : in std_logic; reset : in std_logic; stall : in std_logic; irq_vector : in std_logic_vector ( 31 downto 0 ); irq : in std_logic; irq_ack : out std_logic; exception : out std_logic; address : out std_logic_vector ( 31 downto 0 ); data_in : in std_logic_vector ( 31 downto 0 ); data_out : out std_logic_vector ( 31 downto 0 ); data_w : out std_logic_vector ( 3 downto 0 ); data_b : out std_logic; data_h : out std_logic; data_access : out std_logic ); end component;
signal irq_ack : std_logic;
signal exception : std_logic;
signal address : std_logic_vector ( 31 downto 0 );
signal data_out : std_logic_vector ( 31 downto 0 );
signal data_w : std_logic_vector ( 3 downto 0 );
signal data_b : std_logic;
signal data_h : std_logic;
signal data_access : std_logic;
signal clock : std_logic;
signal reset : std_logic;
signal stall : std_logic;
signal irq_vector : std_logic_vector ( 31 downto 0 );
signal irq : std_logic;
signal data_in : std_logic_vector ( 31 downto 0 );
type opCodes_t is array(0 to 7) of std_logic_vector(31 downto 0);
constant opCodeArray : opCodes_t := (0 => X"00100093", 1 => X"00200113", 2 => X"00300193", 3 => X"00400213", 4 => X"00500293", 5 => X"00600313", 6 => X"00700393", 7 => X"fe0002e3");
signal wordAddr : std_logic_vector(31 downto 2);
signal wordAddri : integer;
signal wordAddriclamped : integer;
signal addrTargeted : std_logic;
signal selectedOpCodeInst : std_logic_vector(31 downto 0);
signal stallStarting : std_logic := '0';
signal stallActive : std_logic := '0';
signal holdoffActive : std_logic := '0';
signal stallCountDown : unsigned(3 downto 0) := (others => '0');
type stallStateMachine_t is (waitingForStart, stallIsActive, holdoffIsActive); signal currState : stallStateMachine_t := waitingForStart; signal latchedDataIn : std_logic_vector(31 downto 0) := (others => '0');
BEGIN
--Component Instantiation
uut : datapath
Port Map (
clock => clock ,
reset => reset ,
stall => stall ,
irq_vector => irq_vector ,
irq => irq ,
irq_ack => irq_ack ,
exception => exception ,
address => address ,
data_in => data_in ,
data_out => data_out ,
data_w => data_w ,
data_b => data_b ,
data_h => data_h ,
data_access => data_access
);
wordAddr <= address(31 downto 2);
wordAddri <= to_integer(unsigned(wordAddr));
addrTargeted <= '1' when (wordAddri >= 0 and wordAddri < 8) else '0';
wordAddriclamped <= wordAddri when (addrTargeted = '1') else 0; --So simulation doesn't crash if we go outside 0 to 7
selectedOpCodeInst <= opCodeArray(wordAddriclamped) when (addrTargeted='1') else X"0000_0000";
stallStarting <= addrTargeted and not(stallActive) and not(holdoffActive);
stall <= (stallStarting or stallActive) and not(reset);
--Convert read data to little endian
data_in <= latchedDataIn(7 downto 0) & latchedDataIn(15 downto 8) & latchedDataIn(23 downto 16) & latchedDataIn(31 downto 24);
StallStateProc : process(clock)
begin
if(rising_edge(clock)) then
stallCountDown <= stallCountDown - 1;
holdoffActive <= '0';
---------------------------------
if(currState = waitingForStart) then
stallActive <= '0';
stallCountDown <= X"F";
if(stallStarting = '1' and reset = '0') then
latchedDataIn <= selectedOpCodeInst;
stallActive <= '1';
currState <= stallIsActive;
end if;
---------------------------------
elsif(currState = stallIsActive) then
if(stallCountDown = 0) then
stallActive <= '0';
holdoffActive <= '1';
stallCountDown <= to_unsigned(1, stallCountDown'length); --value of 1 holds off on stall for 2 clock cycles
--stallCountDown <= to_unsigned(2, stallCountDown'length); --value of 2 holds off on stall for 3 clock cycles
currState <= holdoffIsActive;
end if;
---------------------------------
elsif(currState = holdoffIsActive) then
holdoffActive <= '1';
if(stallCountDown = 0) then
holdoffActive <= '0';
currState <= waitingForStart;
end if;
---------------------------------
else
currState <= waitingForStart;
end if;
end if;
end process StallStateProc;
clkProc : process
begin
clock <= '1';
wait for 5 ns;
clock <= '0';
wait for 5 ns;
end process;
MainTestProcess : Process
Begin
reset <= '1';
wait until rising_edge(clock); wait for 1 ps;
wait until rising_edge(clock); wait for 1 ps;
reset <= '0';
wait for 20 us;
assert false report "Simulation Complete" severity failure;
wait; -- Will wait forever
end process;
END; `
As for screenshots, here's the 2-clock-cycle hold-off version:
and here's the 3-clock-cycle hold-off version:
Thanks for the detailed explanation, I will try to address the issue. However, as I stated earlier, in the current implementation the stall cannot be used on every opcode fetch, as there needs to be at least a couple of cycles where the pipeline is running free after a stall. This is not a limitation if there is a prefetch queue or cache memory in the system, where the memory subsystem fetches a block of instructions. About the address generated by the CPU, remember that this implementation uses a three stage pipeline, so the instruction being executed was fetched two cycles in the past. In the case of pipeline stalls, this requires that the same address needs to be generated when an instruction could not be commited to the register file or memory (such as memory operations and branches). I will take a look to verify if this is working as expected.
Hmmm well that could be a problem then. I don't know if I can guarantee that two cache misses in a row could never happen. If the code is doing multiple jumps in a row, or accessing some random location in ROM as part of a look-up table.
Yes you can. The pipeline has three stages, and instructions in the branch delay slot are never executed in the RISC-V architecture, so any 'wrong' instructions following the branch, although fetched, will be executed as NOPs. In the second case, memory is accessed as data, and this is a three cycle operation. Cache misses in a row can be avoided if you fetch more data from memory in the same burst.
Okay I think I'm understanding...in general. So if my module sees an instruction that hasn't been cached, it should go ahead and read that word plus the next N words (1, 2 or 3?) and due to the way the pipelining works, it is guaranteed to request these addresses in the following clock cycles, even if its not going to execute them, which will now be cached. "data access is a three cycle operation" does that mean that I am good on stalling as much as needed on data accesses?
The issue appears to have been solved. I have decoupled the memory multiplexing logic for instruction and data accesses and performed some improvements in the design. These improvements were first tested with the new RV32E core and imported into the RV32I design after testing.
I am trying to use this core on hardware that has opcodes stored in serial EEPROM. I have a hardware translation module that appears like a parallel memory block attached to the address/data bus, but under the hood it's driving serial commands to access the EEPROM. Naturally, I need to stall the CPU while the module is busy sending serial commands. I'm confused on the exact timing for driving the STALL signal high/low and where the valid opcode data needs to appear. If my hardware module sees the correct (EEPROM) address on the bus, should it drive the STALL signal high instantaneously (the exact same clock cycle), drive it high the next clock cycle, etc? Similarly, when the opcode data is available, should my hardware module present the data onto the databus 1 clock cycle before lowering stall, the same clock cycle as lowering stall, 1 clock cycle later, etc? Thanks