openrisc / mor1kx

mor1kx - an OpenRISC 1000 processor IP core
Other
493 stars 147 forks source link

Marocchino Tests Failing #74

Closed stffrdhrn closed 5 years ago

stffrdhrn commented 5 years ago

When running or1k-tests.

Running or1k-alignillegalinsn PASS Running or1k-backtoback_jmp PASS Running or1k-basic PASS Running or1k-cmov PASS Running or1k-cy FAIL Running or1k-dsx TIME OUT Running or1k-dsxinsn PASS Running or1k-ext PASS Running or1k-ffl1 PASS Running or1k-icache PASS Running or1k-illegalinsn PASS Running or1k-illegalinsndelayslot FAIL Running or1k-insnfetchalign PASS Running or1k-insnfetcherror TIME OUT Running or1k-intloop PASS Running or1k-intmulticycle PASS Running or1k-intsyscall PASS Running or1k-inttickloop PASS Running or1k-jmp PASS Running or1k-jr PASS Running or1k-lsu TIME OUT Running or1k-lsualign PASS Running or1k-lsualigndelayslot PASS Running or1k-lsuerror PASS Running or1k-lsuerrordelayslot PASS Running or1k-lwjr PASS Running or1k-msync TIME OUT Running or1k-mul-basic PASS Running or1k-ov FAIL Running or1k-regjmp PASS Running or1k-rfe TIME OUT Running or1k-sf PASS Running or1k-sfbf TIME OUT Running or1k-shiftopts PASS Running or1k-shortbranch PASS Running or1k-shortjump FAIL Running or1k-systemcall PASS Running or1k-tickloop PASS Running or1k-tickrfforward TIME OUT Running or1k-ticksyscall PASS Running or1k-timer PASS Running or1k-trap TIME OUT Running or1k-trapdelayslot TIME OUT

stffrdhrn commented 5 years ago

List of issues after investigations

bandvig commented 5 years ago

@stffrdhrn I’ve examined architecture manual again and found that l.trap implementation doesn’t meet requirements. In particular the immediate value isn’t used for SR masking. At the same time it looks like such functionality isn’t implemented in CAPPUCCINO too. Is SR masking required by or1k-tests?

stffrdhrn commented 5 years ago

The SR checking is implemented in the test but for mor1kx it looks like we have adjusted the tests to pass. Lets try to fix it.

The asm from the test is here, you can see the "TODO" and the "For now..." https://github.com/openrisc/or1k-tests/blob/master/native/or1k/or1k-trap.S#L117

    /* Trap that shouldn't work, SR[9], flag bit, will be low, so
       shouldn't trap
    TODO - look at whether we drop the checking of bit in SR */
    l.trap  9
    l.addi  r2,r2,1 /* For now, just add this to make test pass*/

    /* Check results OK */
    l.sfeq  r1,r2
    l.bf    test_ok

The l.trap immediate is not documented too clearly in the spec I think. For example, what l.trap 9 means is, trap if SR[9] (FLAG) is set. l.trap 15 will always trap because SR[15] is Fixed One, or the index 15 bit of SR is always 1 so l.trap 15 will always trap / handover to debug unit.

stffrdhrn commented 5 years ago

For the or1k-shortjump the trace asm (https://github.com/openrisc/or1k-tests/blob/master/native/or1k/or1k-shortjump.S#L270) is:

2:  
    l.sfne  r5,r9
    l.bf    test_fail
    l.j 3f
    l.addi  r1,r1,1    // <--- this one is missing in trace.
3:
    l.j 4f

The trace is:

S 00002f80: e4254800 l.sfne  r5,r9                                  flag: 0              
S 00002f84: 100000f9 l.bf    0x00000f9                              flag: 0              
S 00002f88: 00000002 l.j     0x0000002                              flag: 0              
S 00002f90: 00000003 l.j     0x0000003                              flag: 0              
S 00002f94: 9c21ffff l.addi  r1,r1,0xffff    r1         = 000000f8  flag: 0              
S 00002f9c: 00000004 l.j     0x0000004                              flag: 0

It seems the delay slot l.addi instruction is not being executed.

bandvig commented 5 years ago

The l.trap immediate is not documented too clearly in the spec I think. For example, what l.trap 9 means is, trap if SR[9] (FLAG) is set. l.trap 15 will always trap because SR[15] is Fixed One, or the index 15 bit of SR is always 1 so l.trap 15 will always trap / handover to debug unit.

I think, It would be better to change immediate behavior from direct addressing of SR’s bits to masking. I mean l.trap imm should run exception handler if (|(imm & SR)) rather than (SR[imm]). The SR[imm] access is more consuming in terms of gate/luts. Nevertheless, I’ll fix l.trap behavior soon but without invoking immediate.

bandvig commented 5 years ago

For the or1k-shortjump the trace asm (https://github.com/openrisc/or1k-tests/blob/master/native/or1k/or1k-shortjump.S#L270) is:

2:    
  l.sfne  r5,r9
  l.bf    test_fail
  l.j 3f
  l.addi  r1,r1,1    // <--- this one is missing in trace.
3:
  l.j 4f

Huh, I designed MAROCCHINO in assumption that jump/branch instruction must not occupy delay slot. I’m not sure if architectural manual is clear enough about that, but the test is definitely first case I’ve seen till now. GCC4/5 don’t generate such code. What about your GCC9 port?

bandvig commented 5 years ago

@stffrdhrn I've pushed fix for l.trap behavior. Now or1k-trap* should pass.

stffrdhrn commented 5 years ago

@stffrdhrn I've pushed fix for l.trap behavior. Now or1k-trap* should pass.

Excellent work, it passes

$ ./runtests.sh MAROCCHINO or1k-trap*
Running tests for pipeline: MAROCCHINO, with test filter: or1k-trap*

Running or1k-trap                                           PASS
Running or1k-trapdelayslot                                  PASS
Results                                                     Total:   2  FAIL:   0
stffrdhrn commented 5 years ago

Huh, I designed MAROCCHINO in assumption that jump/branch instruction must not occupy delay slot. I’m not sure if architectural manual is clear enough about that, but the test is definitely first case I’ve seen till now. GCC4/5 don’t generate such code. What about your GCC9 port?

I don't believe GCC9 will generate code like this either. We can leave this one for later.

Also, I agree on l.trap that using the immediate is not optimal, that is probably why it was not implemented on cappuccino. Lets leave it for now. When we want to change it we should write a spec proposal.

stffrdhrn commented 5 years ago

FYI, for the timer related issues I investigated the tick timer, but I don't see any issue there. I am seeing instructions are taking a very long time to retire. It takes >30 clock cycles to execute the 6 instructions in the exception handler. After the handler returns with l.rfe one instruction executes then it returns to the exception handler. I will investigate more later.

Those instructions are here:

(https://github.com/openrisc/or1k-tests/blob/master/native/or1k/or1k-sfbf.S#L37)

    /* Timer exception, clear interrupt bit, and return */
    .org 0x500
    l.mfspr r11,r0, OR1K_SPR_TICK_TTMR_ADDR         // 2 clks
    l.movhi r12, hi(OR1K_SPR_TICK_TTMR_IP_MASK)     // 9 clks
    l.xor   r11, r11, r12                           // 17clks
    l.mtspr r0, r11, OR1K_SPR_TICK_TTMR_ADDR        // 2 clks
    l.addi  r13,r13,1                               // 9 clks
    l.rfe                                           // 18clks

Note, this is while caching is turned off.

stffrdhrn commented 5 years ago

This patch partially fixes the issue, but it seems there are issues with getting timers during DSX.

--- a/rtl/verilog/mor1kx_ctrl_marocchino.v
+++ b/rtl/verilog/mor1kx_ctrl_marocchino.v
@@ -395,7 +395,7 @@ module mor1kx_ctrl_marocchino
       casez({(wrbk_except_itlb_miss_i | wrbk_except_ipagefault_i |
               wrbk_except_ibus_err_i  |
               wrbk_except_illegal_i   | wrbk_except_dbus_align_i | wrbk_except_ibus_align_i),
-             wrbk_except_syscall_i,
+             (wrbk_except_syscall_i | wrbk_tt_interrupt_i | wrbk_pic_interrupt_i ),
              (wrbk_except_dtlb_miss_i | wrbk_except_dpagefault_i),
              wrbk_except_trap_i,
              sbuf_err_i,
bandvig commented 5 years ago

@stffrdhrn

This patch partially fixes the issue, but it seems there are issues with getting timers during DSX.

--- a/rtl/verilog/mor1kx_ctrl_marocchino.v
+++ b/rtl/verilog/mor1kx_ctrl_marocchino.v
@@ -395,7 +395,7 @@ module mor1kx_ctrl_marocchino
       casez({(wrbk_except_itlb_miss_i | wrbk_except_ipagefault_i |
               wrbk_except_ibus_err_i  |
               wrbk_except_illegal_i   | wrbk_except_dbus_align_i | wrbk_except_ibus_align_i),
-             wrbk_except_syscall_i,
+             (wrbk_except_syscall_i | wrbk_tt_interrupt_i | wrbk_pic_interrupt_i ),
              (wrbk_except_dtlb_miss_i | wrbk_except_dpagefault_i),
              wrbk_except_trap_i,
              sbuf_err_i,

You changed priority of exceptions for EPCR spr_epcr only, while there are similar case for EEAR spr_eear and exception vector excepton_pc_addr. Exceptions order for all of them must be consistent. In particular TT and PIC interrupts have lowest priority and they are processed in spr_epcr and spr_eear by default case. You’ve mentioned that timer exception handler takes over 30 CPU clocks. It would be normal for MAROCCHINO by two reasons. First, MAROCCHINO executes l.mtspr/l.mfspr only after all previous instructions being retired and OCB has become empty. Second, keep in mind that TT and PVT logic is in Wishbone clock domain mostly. As a result, each time l.mtspr/l.mfspr accesses TT/PIC control registers, two clock domain crosses occur even if (cpu_clk === wb_clk). Therefore l.mtspr/l.mfspr transaction takes more CPU clocks than in CAPPUCCINO for (cpu_clk === wb_clk) case. I would to propose:

stffrdhrn commented 5 years ago

I am sure the patch is a hack, but I don't think I was only playing with priority of exception.

The change I was attempting was to fix the value of spr epcr during timer interrupts. It was getting set to the previous PC instead of next PC (as per spec). So in the case that only one instruction can be executed after the timer the system will never make progress.

I'll revert the patch and post a gtkwave image.

On Fri, Feb 8, 2019, 10:19 PM Andrey Bacherov <notifications@github.com wrote:

@stffrdhrn https://github.com/stffrdhrn

This patch partially fixes the issue, but it seems there are issues with getting timers during DSX.

--- a/rtl/verilog/mor1kx_ctrl_marocchino.v

+++ b/rtl/verilog/mor1kx_ctrl_marocchino.v

@@ -395,7 +395,7 @@ module mor1kx_ctrl_marocchino

   casez({(wrbk_except_itlb_miss_i | wrbk_except_ipagefault_i |

           wrbk_except_ibus_err_i  |

           wrbk_except_illegal_i   | wrbk_except_dbus_align_i | wrbk_except_ibus_align_i),
  • wrbk_except_syscall_i,

  • (wrbk_except_syscall_i | wrbk_tt_interrupt_i | wrbk_pic_interrupt_i ),

          (wrbk_except_dtlb_miss_i | wrbk_except_dpagefault_i),
    
          wrbk_except_trap_i,
    
          sbuf_err_i,

You changed priority of exceptions for EPCR spr_epcr only, while there are similar case for EEAR spr_eear and exception vector excepton_pc_addr. Exceptions order for all of them must be consistent. In particular TT and PIC interrupts have lowest priority and they are processed in spr_epcr and spr_eear by default case. You’ve mentioned that timer exception handler takes over 30 CPU clocks. It would be normal for MAROCCHINO by two reasons. First, MAROCCHINO executes l.mtspr/l.mfspr only after all previous instructions being retired and OCB has become empty. Second, keep in mind that TT and PVT logic is in Wishbone clock domain mostly. As a result, each time l.mtspr/l.mfspr accesses TT/PIC control registers, two clock domain crosses occur even if ( cpu_clk === wb_clk). Therefore l.mtspr/l.mfspr transaction takes more CPU clocks than in CAPPUCCINO for (cpu_clk === wb_clk) case. I would to propose:

-

revert your “hack” with priority of exceptions for EPCR

try to play with interrupt interval ( https://github.com/openrisc/or1k-tests/blob/master/native/or1k/or1k-sfbf.S#L596). For example try to increase it twice, for example. Alternatively, try to make cpu_clk two or even three times faster than wb_clk.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openrisc/mor1kx/issues/74#issuecomment-461800036, or mute the thread https://github.com/notifications/unsubscribe-auth/ACSvn-54sulCrSzW5Mrk7lEsXZVIdMXWks5vLXlUgaJpZM4alvQ7 .

bandvig commented 5 years ago

I am sure the patch is a hack, but I don't think I was only playing with priority of exception. The change I was attempting was to fix the value of spr epcr during timer interrupts. It was getting set to the previous PC instead of next PC (as per spec). So in the case that only one instruction can be executed after the timer the system will never make progress. I'll revert the patch and post a gtkwave image.

Please, don't care about wave diagrams. It looks like it my fault. I read arch. manual not attentively enough. The “hack” could not fix the issue by the fact that with current implementation of TT/PIC interrupt processing any instruction which overlays such interrupt will be discarded completely including its write-back request. Therefore exactly the discarded instruction must be restarted by l.rfe. The behavior doesn’t meet arch. manual. I think I’ll fix it in one or two days.

bandvig commented 5 years ago

@stffrdhrn

At the same time I observe that after caches initialization there are intervals where either ICACHE or DCACHE become off while they should be ON till end of test if I understand correctly. I’m going to investigate that.

I can change processing of TT/PIC interrupts to meet arch manual requirement. It wouldn’t trivial either. However any case the timer interval should be increased to guarantee that TT interrupt handler takes less CPU clocks than the interval. Otherwise such tests become too long.

stffrdhrn commented 5 years ago

@stffrdhrn

I’ve went through CAPPUCCINO sources once more and found that it’s behavior doesn't meet arch manual too ... And MAROCCHINO’s behavior follows to CAPPUCCINO’s one as MAROCCHINO is rewrite of CAPPUCCINO.

OH, thats interesting. The reason I would think this would be a problem is when the last instruction was something like l.addi r1, r1, 0x1 . It the writeback to r1 is committed and we get a timer exception, after return it looks like it would run l.addi r1, r1, 0x1 again causing incorrect behavior.

Perhaps, that is not happening. I just checked cappuccino and I agree its showing the same behavior.

I’ve increased timer interval from 30 to 300 clocks and or1k-sfbf had passed. Tests os1k-lsu and or1k-msync should be the same however I hadn’t tried them.

OK, I will run all tests with the increase and we can see how it goes.

At the same time I observe that after caches initialization there are intervals where either ICACHE or DCACHE become off while they should be ON till end of test if I understand correctly. I’m going to investigate that.

I believe the test runs:

  1. SFBF tests 1-16 cache:off timer:off 1x
  2. SFBF tests 1-16 cache:on timer:off 2x
  3. SFBF tests 1-16 cache:on timer:on 1x
  4. SFBF tests 1-16 cache:off timer:on 1x
  5. SFBF tests 1-16 cache:on timer:on 1x

The reason being , that the _cache_init function uses xor to "enable" the cache bit. I think its probably wrong, but at least it test all the cases we are looking for.

I can change processing of TT/PIC interrupts to meet arch manual requirement. It wouldn’t trivial either. However any case the timer interval should be increased to guarantee that TT interrupt handler takes less CPU clocks than the interval. Otherwise such tests become too long.

I agree, lets leave as is. Thanks for helping to investigate.

stffrdhrn commented 5 years ago

I increased the TT timer value. Most tests are passing. I'll investigate the failures and let you know.

stffrdhrn commented 5 years ago

Full test results now look like:

$ ./runtests.sh MAROCCHINO
Running tests for pipeline: MAROCCHINO, with test filter: or1k-*

Running or1k-alignillegalinsn                               PASS
Running or1k-backtoback_jmp                                 PASS
Running or1k-basic                                          PASS    
Running or1k-cmov                                           PASS
Running or1k-cy                                             FAIL*
Running or1k-dsx                                            PASS
Running or1k-dsxinsn                                        PASS
Running or1k-ext                                            PASS
Running or1k-ffl1                                           PASS
Running or1k-icache                                         PASS
Running or1k-illegalinsn                                    PASS
Running or1k-illegalinsndelayslot                           FAIL**
Running or1k-insnfetchalign                                 PASS
Running or1k-insnfetcherror                                 TIME OUT
Running or1k-intloop                                        PASS
Running or1k-intmulticycle                                  PASS
Running or1k-intsyscall                                     PASS
Running or1k-inttickloop                                    PASS
Running or1k-jmp                                            PASS
Running or1k-jr                                             PASS
Running or1k-lsu                                            PASS
Running or1k-lsualign                                       PASS
Running or1k-lsualigndelayslot                              PASS
Running or1k-lsuerror                                       PASS
Running or1k-lsuerrordelayslot                              PASS
Running or1k-lwjr                                           PASS
Running or1k-msync                                          PASS
Running or1k-mul-basic                                      PASS
Running or1k-ov                                             FAIL**
Running or1k-regjmp                                         PASS
Running or1k-rfe                                            PASS
Running or1k-sf                                             PASS
Running or1k-sfbf                                           PASS
Running or1k-shiftopts                                      PASS
Running or1k-shortbranch                                    PASS
Running or1k-shortjump                                      FAIL**
Running or1k-systemcall                                     PASS
Running or1k-tickloop                                       PASS
Running or1k-tickrfforward                                  TIME OUT
Running or1k-ticksyscall                                    PASS
Running or1k-timer                                          PASS
Running or1k-trap                                           PASS
Running or1k-trapdelayslot                                  PASS
Results                                                     Total:  43  FAIL:   6
bandvig commented 5 years ago

@stffrdhrn Be careful with replacement of l.add.d for or1k-illegalinsn. The l.cust8 serves internally as “empty” output of IFETCH, while other l.cust* could be used.

stffrdhrn commented 5 years ago

@bandvig

Be careful with replacement of l.add.d for or1k-illegalinsn. The l.cust8 serves internally as “empty” output of IFETCH, while other l.cust* could be used.

I was thinking to use lf.rem.s. It seems to work, and I dont see anyone ever wanting to implement that.

stffrdhrn commented 5 years ago

For the timeout issues

bandvig commented 5 years ago

@stffrdhrn

Be careful with replacement of l.add.d for or1k-illegalinsn. The l.cust8 serves internally as “empty” output of IFETCH, while other l.cust* could be used.

I was thinking to use lf.rem.s. It seems to work, and I dont see anyone ever wanting to implement that.

:) Actually, I'm thinking to implement lf.rem.s for CAPPUCCINO and MAROCCHINO and lf.rem.d for MAROCCHINO to complete FPU-functionality. Not sure how much time it will take. An l.cust* is really more safety.

bandvig commented 5 years ago

@stffrdhrn

  • or1k-insnfetcherror - the test case does a l.j 0xee000000, it seems marocchino sets epcr to 0xee00000 instead of the guilty instruction address.

Yes, it could be. Currently IFETCH haven't got any information if IBUS error caused by jump/branch instruction or by regular PC increment.

  • or1k-ov - marocchino doesn't support signed overflow

However, it should support. There is a field for investigation here.

bandvig commented 5 years ago

@stffrdhrn

Ah, perhaps you mean that or1k-ov fails because MAROCCHINO's multiplier doesn't support overflow.

stffrdhrn commented 5 years ago

Ah, perhaps you mean that or1k-ov fails because MAROCCHINO's multiplier doesn't support overflow.

Yes, that is what I meant. Sorry I should have mentioned multiplier. It seems easy to support as per cappuccino.

bandvig commented 5 years ago

@stffrdhrn I've pushed fix for or1k-insnfetcherror. As usual I just restored logic from CAPPUCCINO.

bandvig commented 5 years ago

Be careful with replacement of l.add.d for or1k-illegalinsn. The l.cust8 serves internally as “empty” output of IFETCH, while other l.cust* could be used.

I was thinking to use lf.rem.s. It seems to work, and I dont see anyone ever wanting to implement that.

:) Actually, I'm thinking to implement lf.rem.s for CAPPUCCINO and MAROCCHINO and lf.rem.d for MAROCCHINO to complete FPU-functionality. Not sure how much time it will take. An l.cust* is really more safety.

Oh, I forgot that l.cust* are not supported by binutils. Perhaps something like l.madd?

stffrdhrn commented 5 years ago

@bandvig

l.cust* does seem to be supported the other parts of the illegal instruction test uses them. I'll use them.

Btw, it seems we have discovered all of the reasons for test failures and illiminated timeouts. I'll work on automating running the tests on check-in.

stffrdhrn commented 5 years ago

I am closing this as these are all known failures. Also, Marocchino has its own repo now. Additional issues can be opened there.

https://github.com/openrisc/or1k_marocchino