Closed stffrdhrn closed 5 years ago
List of issues after investigations
or1k-trap*
- marocchino only handles trap via debug unit, and does not go to the exception handler. If debug unit handover is not configured we should still execute the exception handler. Currently if l.trap
instruction is encountered the cpu stalls.or1k-shortjump
- failing to execute instruction in delay slot after 2 branches in a row. See comment below. (https://github.com/openrisc/or1k-tests/blob/master/native/or1k/or1k-shortjump.S#L270)or1k-sfbf
- this one seems to be running, but the tick timer looks to be misbehaving. The test runs a few loops, during the 3rd loop it enables the tick timer. It looks like there is a bug here as after tick timer goes off the first time tt_rdy_o
is staying active almost always, starving instructions from running.or1k-msync
- seems a similar issue to or1k-sfbf
; after timer is enabled the cpu keeps executing but timer is getting fired too much.or1k-lsu
- similar to or1k-sfbf
and or1k-msync
@stffrdhrn I’ve examined architecture manual again and found that l.trap implementation doesn’t meet requirements. In particular the immediate value isn’t used for SR masking. At the same time it looks like such functionality isn’t implemented in CAPPUCCINO too. Is SR masking required by or1k-tests?
The SR checking is implemented in the test but for mor1kx it looks like we have adjusted the tests to pass. Lets try to fix it.
The asm from the test is here, you can see the "TODO" and the "For now..." https://github.com/openrisc/or1k-tests/blob/master/native/or1k/or1k-trap.S#L117
/* Trap that shouldn't work, SR[9], flag bit, will be low, so
shouldn't trap
TODO - look at whether we drop the checking of bit in SR */
l.trap 9
l.addi r2,r2,1 /* For now, just add this to make test pass*/
/* Check results OK */
l.sfeq r1,r2
l.bf test_ok
The l.trap
immediate is not documented too clearly in the spec I think. For example, what l.trap 9
means is, trap if SR[9]
(FLAG) is set. l.trap 15
will always trap because SR[15]
is Fixed One, or the index 15 bit of SR is always 1
so l.trap 15
will always trap / handover to debug unit.
For the or1k-shortjump
the trace asm (https://github.com/openrisc/or1k-tests/blob/master/native/or1k/or1k-shortjump.S#L270) is:
2:
l.sfne r5,r9
l.bf test_fail
l.j 3f
l.addi r1,r1,1 // <--- this one is missing in trace.
3:
l.j 4f
The trace is:
S 00002f80: e4254800 l.sfne r5,r9 flag: 0
S 00002f84: 100000f9 l.bf 0x00000f9 flag: 0
S 00002f88: 00000002 l.j 0x0000002 flag: 0
S 00002f90: 00000003 l.j 0x0000003 flag: 0
S 00002f94: 9c21ffff l.addi r1,r1,0xffff r1 = 000000f8 flag: 0
S 00002f9c: 00000004 l.j 0x0000004 flag: 0
It seems the delay slot l.addi
instruction is not being executed.
The
l.trap
immediate is not documented too clearly in the spec I think. For example, whatl.trap 9
means is, trap ifSR[9]
(FLAG) is set.l.trap 15
will always trap becauseSR[15]
is Fixed One, or the index 15 bit of SR is always1
sol.trap 15
will always trap / handover to debug unit.
I think, It would be better to change immediate behavior from direct addressing of SR’s bits to masking. I mean l.trap imm
should run exception handler if (|(imm & SR))
rather than (SR[imm])
. The SR[imm]
access is more consuming in terms of gate/luts.
Nevertheless, I’ll fix l.trap
behavior soon but without invoking immediate.
For the
or1k-shortjump
the trace asm (https://github.com/openrisc/or1k-tests/blob/master/native/or1k/or1k-shortjump.S#L270) is:2: l.sfne r5,r9 l.bf test_fail l.j 3f l.addi r1,r1,1 // <--- this one is missing in trace. 3: l.j 4f
Huh, I designed MAROCCHINO in assumption that jump/branch instruction must not occupy delay slot. I’m not sure if architectural manual is clear enough about that, but the test is definitely first case I’ve seen till now. GCC4/5 don’t generate such code. What about your GCC9 port?
@stffrdhrn
I've pushed fix for l.trap behavior. Now or1k-trap*
should pass.
@stffrdhrn I've pushed fix for l.trap behavior. Now
or1k-trap*
should pass.
Excellent work, it passes
$ ./runtests.sh MAROCCHINO or1k-trap*
Running tests for pipeline: MAROCCHINO, with test filter: or1k-trap*
Running or1k-trap PASS
Running or1k-trapdelayslot PASS
Results Total: 2 FAIL: 0
Huh, I designed MAROCCHINO in assumption that jump/branch instruction must not occupy delay slot. I’m not sure if architectural manual is clear enough about that, but the test is definitely first case I’ve seen till now. GCC4/5 don’t generate such code. What about your GCC9 port?
I don't believe GCC9 will generate code like this either. We can leave this one for later.
Also, I agree on l.trap
that using the immediate is not optimal, that is probably why it was not implemented on cappuccino. Lets leave it for now. When we want to change it we should write a spec proposal.
FYI, for the timer related issues I investigated the tick timer, but I don't see any issue there. I am seeing instructions are taking a very long time to retire. It takes >30 clock cycles to execute the 6 instructions in the exception handler. After the handler returns with l.rfe
one instruction executes then it returns to the exception handler. I will investigate more later.
Those instructions are here:
(https://github.com/openrisc/or1k-tests/blob/master/native/or1k/or1k-sfbf.S#L37)
/* Timer exception, clear interrupt bit, and return */
.org 0x500
l.mfspr r11,r0, OR1K_SPR_TICK_TTMR_ADDR // 2 clks
l.movhi r12, hi(OR1K_SPR_TICK_TTMR_IP_MASK) // 9 clks
l.xor r11, r11, r12 // 17clks
l.mtspr r0, r11, OR1K_SPR_TICK_TTMR_ADDR // 2 clks
l.addi r13,r13,1 // 9 clks
l.rfe // 18clks
Note, this is while caching is turned off.
This patch partially fixes the issue, but it seems there are issues with getting timers during DSX.
--- a/rtl/verilog/mor1kx_ctrl_marocchino.v
+++ b/rtl/verilog/mor1kx_ctrl_marocchino.v
@@ -395,7 +395,7 @@ module mor1kx_ctrl_marocchino
casez({(wrbk_except_itlb_miss_i | wrbk_except_ipagefault_i |
wrbk_except_ibus_err_i |
wrbk_except_illegal_i | wrbk_except_dbus_align_i | wrbk_except_ibus_align_i),
- wrbk_except_syscall_i,
+ (wrbk_except_syscall_i | wrbk_tt_interrupt_i | wrbk_pic_interrupt_i ),
(wrbk_except_dtlb_miss_i | wrbk_except_dpagefault_i),
wrbk_except_trap_i,
sbuf_err_i,
@stffrdhrn
This patch partially fixes the issue, but it seems there are issues with getting timers during DSX.
--- a/rtl/verilog/mor1kx_ctrl_marocchino.v +++ b/rtl/verilog/mor1kx_ctrl_marocchino.v @@ -395,7 +395,7 @@ module mor1kx_ctrl_marocchino casez({(wrbk_except_itlb_miss_i | wrbk_except_ipagefault_i | wrbk_except_ibus_err_i | wrbk_except_illegal_i | wrbk_except_dbus_align_i | wrbk_except_ibus_align_i), - wrbk_except_syscall_i, + (wrbk_except_syscall_i | wrbk_tt_interrupt_i | wrbk_pic_interrupt_i ), (wrbk_except_dtlb_miss_i | wrbk_except_dpagefault_i), wrbk_except_trap_i, sbuf_err_i,
You changed priority of exceptions for EPCR spr_epcr
only, while there are similar case
for EEAR spr_eear
and exception vector excepton_pc_addr
. Exceptions order for all of them must be consistent. In particular TT and PIC interrupts have lowest priority and they are processed in spr_epcr
and spr_eear
by default
case.
You’ve mentioned that timer exception handler takes over 30 CPU clocks. It would be normal for MAROCCHINO by two reasons. First, MAROCCHINO executes l.mtspr
/l.mfspr
only after all previous instructions being retired and OCB has become empty. Second, keep in mind that TT and PVT logic is in Wishbone clock domain mostly. As a result, each time l.mtspr
/l.mfspr
accesses TT/PIC control registers, two clock domain crosses occur even if (cpu_clk
=== wb_clk
). Therefore l.mtspr
/l.mfspr
transaction takes more CPU clocks than in CAPPUCCINO for (cpu_clk
=== wb_clk
) case.
I would to propose:
revert your “hack” with priority of exceptions for EPCR
try to play with interrupt interval (https://github.com/openrisc/or1k-tests/blob/master/native/or1k/or1k-sfbf.S#L596). For example try to increase it twice, for example. Alternatively, try to make cpu_clk
two or even three times faster than wb_clk
.
I am sure the patch is a hack, but I don't think I was only playing with priority of exception.
The change I was attempting was to fix the value of spr epcr during timer interrupts. It was getting set to the previous PC instead of next PC (as per spec). So in the case that only one instruction can be executed after the timer the system will never make progress.
I'll revert the patch and post a gtkwave image.
On Fri, Feb 8, 2019, 10:19 PM Andrey Bacherov <notifications@github.com wrote:
@stffrdhrn https://github.com/stffrdhrn
This patch partially fixes the issue, but it seems there are issues with getting timers during DSX.
--- a/rtl/verilog/mor1kx_ctrl_marocchino.v
+++ b/rtl/verilog/mor1kx_ctrl_marocchino.v
@@ -395,7 +395,7 @@ module mor1kx_ctrl_marocchino
casez({(wrbk_except_itlb_miss_i | wrbk_except_ipagefault_i | wrbk_except_ibus_err_i | wrbk_except_illegal_i | wrbk_except_dbus_align_i | wrbk_except_ibus_align_i),
wrbk_except_syscall_i,
(wrbk_except_syscall_i | wrbk_tt_interrupt_i | wrbk_pic_interrupt_i ),
(wrbk_except_dtlb_miss_i | wrbk_except_dpagefault_i), wrbk_except_trap_i, sbuf_err_i,
You changed priority of exceptions for EPCR spr_epcr only, while there are similar case for EEAR spr_eear and exception vector excepton_pc_addr. Exceptions order for all of them must be consistent. In particular TT and PIC interrupts have lowest priority and they are processed in spr_epcr and spr_eear by default case. You’ve mentioned that timer exception handler takes over 30 CPU clocks. It would be normal for MAROCCHINO by two reasons. First, MAROCCHINO executes l.mtspr/l.mfspr only after all previous instructions being retired and OCB has become empty. Second, keep in mind that TT and PVT logic is in Wishbone clock domain mostly. As a result, each time l.mtspr/l.mfspr accesses TT/PIC control registers, two clock domain crosses occur even if ( cpu_clk === wb_clk). Therefore l.mtspr/l.mfspr transaction takes more CPU clocks than in CAPPUCCINO for (cpu_clk === wb_clk) case. I would to propose:
-
revert your “hack” with priority of exceptions for EPCR
try to play with interrupt interval ( https://github.com/openrisc/or1k-tests/blob/master/native/or1k/or1k-sfbf.S#L596). For example try to increase it twice, for example. Alternatively, try to make cpu_clk two or even three times faster than wb_clk.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openrisc/mor1kx/issues/74#issuecomment-461800036, or mute the thread https://github.com/notifications/unsubscribe-auth/ACSvn-54sulCrSzW5Mrk7lEsXZVIdMXWks5vLXlUgaJpZM4alvQ7 .
I am sure the patch is a hack, but I don't think I was only playing with priority of exception. The change I was attempting was to fix the value of spr epcr during timer interrupts. It was getting set to the previous PC instead of next PC (as per spec). So in the case that only one instruction can be executed after the timer the system will never make progress. I'll revert the patch and post a gtkwave image.
Please, don't care about wave diagrams. It looks like it my fault. I read arch. manual not attentively enough. The “hack” could not fix the issue by the fact that with current implementation of TT/PIC interrupt processing any instruction which overlays such interrupt will be discarded completely including its write-back request. Therefore exactly the discarded instruction must be restarted by l.rfe. The behavior doesn’t meet arch. manual. I think I’ll fix it in one or two days.
@stffrdhrn
I’ve went through CAPPUCCINO sources once more and found that it’s behavior doesn't meet arch manual too (https://github.com/openrisc/mor1kx/blob/aa4ea5a4a2c557b9ec4492dff89ed39e25a31557/rtl/verilog/mor1kx_ctrl_cappuccino.v#L375 and https://github.com/openrisc/mor1kx/blob/aa4ea5a4a2c557b9ec4492dff89ed39e25a31557/rtl/verilog/mor1kx_ctrl_cappuccino.v#L785). And MAROCCHINO’s behavior follows to CAPPUCCINO’s one as MAROCCHINO is rewrite of CAPPUCCINO.
I’ve increased timer interval from 30 to 300 clocks and or1k-sfbf
had passed. Tests os1k-lsu
and or1k-msync
should be the same however I hadn’t tried them.
At the same time I observe that after caches initialization there are intervals where either ICACHE or DCACHE become off while they should be ON till end of test if I understand correctly. I’m going to investigate that.
I can change processing of TT/PIC interrupts to meet arch manual requirement. It wouldn’t trivial either. However any case the timer interval should be increased to guarantee that TT interrupt handler takes less CPU clocks than the interval. Otherwise such tests become too long.
@stffrdhrn
I’ve went through CAPPUCCINO sources once more and found that it’s behavior doesn't meet arch manual too ... And MAROCCHINO’s behavior follows to CAPPUCCINO’s one as MAROCCHINO is rewrite of CAPPUCCINO.
OH, thats interesting. The reason I would think this would be a problem is when the last instruction was something like l.addi r1, r1, 0x1
. It the writeback to r1
is committed and we get a timer exception, after return it looks like it would run l.addi r1, r1, 0x1
again causing incorrect behavior.
Perhaps, that is not happening. I just checked cappuccino and I agree its showing the same behavior.
I’ve increased timer interval from 30 to 300 clocks and
or1k-sfbf
had passed. Testsos1k-lsu
andor1k-msync
should be the same however I hadn’t tried them.
OK, I will run all tests with the increase and we can see how it goes.
At the same time I observe that after caches initialization there are intervals where either ICACHE or DCACHE become off while they should be ON till end of test if I understand correctly. I’m going to investigate that.
I believe the test runs:
cache:off
timer:off
1xcache:on
timer:off
2xcache:on
timer:on
1xcache:off
timer:on
1xcache:on
timer:on
1xThe reason being , that the _cache_init
function uses xor
to "enable" the cache bit. I think its probably wrong, but at least it test all the cases we are looking for.
I can change processing of TT/PIC interrupts to meet arch manual requirement. It wouldn’t trivial either. However any case the timer interval should be increased to guarantee that TT interrupt handler takes less CPU clocks than the interval. Otherwise such tests become too long.
I agree, lets leave as is. Thanks for helping to investigate.
I increased the TT timer value. Most tests are passing. I'll investigate the failures and let you know.
Full test results now look like:
$ ./runtests.sh MAROCCHINO
Running tests for pipeline: MAROCCHINO, with test filter: or1k-*
Running or1k-alignillegalinsn PASS
Running or1k-backtoback_jmp PASS
Running or1k-basic PASS
Running or1k-cmov PASS
Running or1k-cy FAIL*
Running or1k-dsx PASS
Running or1k-dsxinsn PASS
Running or1k-ext PASS
Running or1k-ffl1 PASS
Running or1k-icache PASS
Running or1k-illegalinsn PASS
Running or1k-illegalinsndelayslot FAIL**
Running or1k-insnfetchalign PASS
Running or1k-insnfetcherror TIME OUT
Running or1k-intloop PASS
Running or1k-intmulticycle PASS
Running or1k-intsyscall PASS
Running or1k-inttickloop PASS
Running or1k-jmp PASS
Running or1k-jr PASS
Running or1k-lsu PASS
Running or1k-lsualign PASS
Running or1k-lsualigndelayslot PASS
Running or1k-lsuerror PASS
Running or1k-lsuerrordelayslot PASS
Running or1k-lwjr PASS
Running or1k-msync PASS
Running or1k-mul-basic PASS
Running or1k-ov FAIL**
Running or1k-regjmp PASS
Running or1k-rfe PASS
Running or1k-sf PASS
Running or1k-sfbf PASS
Running or1k-shiftopts PASS
Running or1k-shortbranch PASS
Running or1k-shortjump FAIL**
Running or1k-systemcall PASS
Running or1k-tickloop PASS
Running or1k-tickrfforward TIME OUT
Running or1k-ticksyscall PASS
Running or1k-timer PASS
Running or1k-trap PASS
Running or1k-trapdelayslot PASS
Results Total: 43 FAIL: 6
or1k-cy
- is a known issue as mul doesnt support carryor1k-shortjump
- is the issue we saw before, not supported by marocchinoor1k-ov
- marocchino doesn't support signed overflowor1k-illegalinsn*
- the test uses lf.add.d as the illegal instruction, but it's not on marocchino. Need to fix test.or1k-*
- others are timeouts I need to look at@stffrdhrn
Be careful with replacement of l.add.d for or1k-illegalinsn
. The l.cust8
serves internally as “empty” output of IFETCH, while other l.cust*
could be used.
@bandvig
Be careful with replacement of l.add.d for
or1k-illegalinsn
. Thel.cust8
serves internally as “empty” output of IFETCH, while otherl.cust*
could be used.
I was thinking to use lf.rem.s
. It seems to work, and I dont see anyone ever wanting to implement that.
For the timeout issues
or1k-insnfetcherror
- the test case does a l.j 0xee000000
, it seems marocchino sets epcr to 0xee00000 instead of the guilty instruction address.or1k-tickrfforward
- this case is just due to too many timeouts adjusting the timer from 100 to 600 cycles allows the test to pass.@stffrdhrn
Be careful with replacement of l.add.d for
or1k-illegalinsn
. Thel.cust8
serves internally as “empty” output of IFETCH, while otherl.cust*
could be used.I was thinking to use
lf.rem.s
. It seems to work, and I dont see anyone ever wanting to implement that.
:) Actually, I'm thinking to implement lf.rem.s
for CAPPUCCINO and MAROCCHINO and lf.rem.d
for MAROCCHINO to complete FPU-functionality. Not sure how much time it will take. An l.cust*
is really more safety.
@stffrdhrn
or1k-insnfetcherror
- the test case does al.j 0xee000000
, it seems marocchino sets epcr to 0xee00000 instead of the guilty instruction address.
Yes, it could be. Currently IFETCH haven't got any information if IBUS error caused by jump/branch instruction or by regular PC increment.
or1k-ov
- marocchino doesn't support signed overflow
However, it should support. There is a field for investigation here.
@stffrdhrn
Ah, perhaps you mean that or1k-ov
fails because MAROCCHINO's multiplier doesn't support overflow.
Ah, perhaps you mean that or1k-ov fails because MAROCCHINO's multiplier doesn't support overflow.
Yes, that is what I meant. Sorry I should have mentioned multiplier. It seems easy to support as per cappuccino.
@stffrdhrn
I've pushed fix for or1k-insnfetcherror
. As usual I just restored logic from CAPPUCCINO.
Be careful with replacement of l.add.d for
or1k-illegalinsn
. Thel.cust8
serves internally as “empty” output of IFETCH, while otherl.cust*
could be used.I was thinking to use
lf.rem.s
. It seems to work, and I dont see anyone ever wanting to implement that.:) Actually, I'm thinking to implement
lf.rem.s
for CAPPUCCINO and MAROCCHINO andlf.rem.d
for MAROCCHINO to complete FPU-functionality. Not sure how much time it will take. Anl.cust*
is really more safety.
Oh, I forgot that l.cust*
are not supported by binutils. Perhaps something like l.madd
?
@bandvig
l.cust*
does seem to be supported the other parts of the illegal instruction test uses them. I'll use them.
Btw, it seems we have discovered all of the reasons for test failures and illiminated timeouts. I'll work on automating running the tests on check-in.
I am closing this as these are all known failures. Also, Marocchino has its own repo now. Additional issues can be opened there.
When running or1k-tests.
Running or1k-alignillegalinsn PASS Running or1k-backtoback_jmp PASS Running or1k-basic PASS Running or1k-cmov PASS Running or1k-cy FAIL Running or1k-dsx TIME OUT Running or1k-dsxinsn PASS Running or1k-ext PASS Running or1k-ffl1 PASS Running or1k-icache PASS Running or1k-illegalinsn PASS Running or1k-illegalinsndelayslot FAIL Running or1k-insnfetchalign PASS Running or1k-insnfetcherror TIME OUT Running or1k-intloop PASS Running or1k-intmulticycle PASS Running or1k-intsyscall PASS Running or1k-inttickloop PASS Running or1k-jmp PASS Running or1k-jr PASS Running or1k-lsu TIME OUT Running or1k-lsualign PASS Running or1k-lsualigndelayslot PASS Running or1k-lsuerror PASS Running or1k-lsuerrordelayslot PASS Running or1k-lwjr PASS Running or1k-msync TIME OUT Running or1k-mul-basic PASS Running or1k-ov FAIL Running or1k-regjmp PASS Running or1k-rfe TIME OUT Running or1k-sf PASS Running or1k-sfbf TIME OUT Running or1k-shiftopts PASS Running or1k-shortbranch PASS Running or1k-shortjump FAIL Running or1k-systemcall PASS Running or1k-tickloop PASS Running or1k-tickrfforward TIME OUT Running or1k-ticksyscall PASS Running or1k-timer PASS Running or1k-trap TIME OUT Running or1k-trapdelayslot TIME OUT