Linux crashes with latest master

stffrdhrn commented 2 years ago

It seems to work up to about 2021 Feb, but not after that.

The failure looks like below. It happens randomly either during boot or while running user apps for a few minutes. The issue seems to be that a memory load loads 0x0 into a register, that is the dereferenced. The actual memory that is loaded is not 0x0.

[   11.540000] Unable to handle kernel NULL pointer dereference
[   11.540000]  at virtual address 0x00000274
[   11.550000]
[   11.550000] Oops#: 0000
[   11.550000] CPU #: 0
[   11.550000]    PC: c00b5098    SR: 0000827f    SP: c0b89e60
[   11.560000] GPR00: 00000000 GPR01: c0b89e60 GPR02: c0b89eb4 GPR03: c0b9cfd0
[   11.570000] GPR04: 30642000 GPR05: c0ae8c84 GPR06: 00000001 GPR07: cfed220c
[   11.580000] GPR08: cfed2224 GPR09: c00b4c18 GPR10: c0b88000 GPR11: 00000000
[   11.580000] GPR12: 00000000 GPR13: 00000000 GPR14: 00000001 GPR15: 00000000
[   11.590000] GPR16: 00000000 GPR17: 00000000 GPR18: c0b9cfd0 GPR19: 00000000
[   11.600000] GPR20: 00000000 GPR21: 00000000 GPR22: 00000000 GPR23: 00000000
[   11.600000] GPR24: cffd7880 GPR25: 30642010 GPR26: c0ac4d3c GPR27: fffffff9
[   11.610000] GPR28: 00000255 GPR29: 00008079 GPR30: c0b9cfd0 GPR31: 00000000
[   11.620000]   RES: 00000000 oGPR11: ffffffff
[   11.620000] Process wc (pid: 84, stackpage=c0afe7e0)
[   11.630000]
[   11.630000] Stack:
[   11.630000] Call trace:
[   11.630000] [<(ptrval)>] handle_mm_fault+0x324/0x7b4
[   11.640000] [<(ptrval)>] do_page_fault+0x23c/0x47c
[   11.640000] [<(ptrval)>] ? _data_page_fault_handler+0x104/0x10c
[   11.650000]
[   11.650000] Code:
[   11.650000]  c00b5080:       d7e28fac
[   11.650000]  c00b5084:       bc0b0000
[   11.650000]  c00b5088:       8482ffbc
[   11.660000]  c00b508c:       1a000000
[   11.660000]  c00b5090:       a6310001
[   11.660000]  c00b5094:       07fd4e67
[   11.670000] (c00b5098:)      86330274
[   11.670000]  c00b509c:       9c21ffec
[   11.670000]  c00b50a0:       d4019004
[   11.680000]  c00b50a4:       e2031804
[   11.680000]  c00b50a8:       bc110000
[   11.680000]  c00b50ac:       a6311fff

Failed code

c00b49b8:       10 00 00 0a     l.bf c00b49e0 <handle_mm_fault+0xec>
c00b49bc:       a6 30 00 04     l.andi r17,r16,0x4
c00b49c0:       a6 b4 00 20     l.andi r21,r20,0x20
c00b49c4:       e2 91 a8 04     l.or r20,r17,r21
c00b49c8:       bc 14 00 00     l.sfeqi r20,0
c00b49cc:       10 00 01 b3     l.bf c00b5098 <handle_mm_fault+0x7a4>  <--- go to 1:
c00b49d0:       86 6a 00 00      l.lwz r19,0(r10)                               <-- load data at r10 into r19 (mem inspection shows data is there)
c00b49d4:       86 33 02 78     l.lwz r17,632(r19)
...

c00b5084:       84 82 ff bc     l.lwz r4,-68(r2)
c00b5088:       07 fd 4e 67     l.jal c0008a24 <local_flush_tlb_page>
c00b508c:       84 62 ff b0     l.lwz r3,-80(r2)
c00b5090:       03 ff ff f5     l.j c00b5064 <handle_mm_fault+0x770>
c00b5094:       15 00 00 00     l.nop 0x0
1:
c00b5098:       86 33 02 74     l.lwz r17,628(r19)         <--- r19 is 0x0
c00b509c:       9e 31 00 01     l.addi r17,r17,1
c00b50a0:       03 ff fe 50     l.j c00b49e0 <handle_mm_fault+0xec>
c00b50a4:       d4 13 8a 74     l.sw 628(r19),r17

stffrdhrn commented 2 years ago

Bisected to this commit: 040a89f

< shorne@antec ~/work/litex/pythondata-cpu-mor1kx/pythondata_cpu_mor1kx/verilog > git lo
2021-07-01 040a89f Stafford Horne   dcache: Allow writing during write_pending  (HEAD, refs/bisect/bad)
2021-07-01 37ce026 Stafford Horne   lsu: Simply the logic for lsu_ack  (refs/bisect/good-37ce0265681fc09cc90945525106f71a8ee9745b)
2021-07-01 8c4b595 Stafford Horne   lsu: Hookup dcache hit output to CPU
2021-07-01 41497c6 Stafford Horne   immu: Fix issue with multiple mtspr insns
2021-06-25 9e5bbc5 Stafford Horne   Cleanups: Also remove unused inputs in *espresso  (refs/bisect/good-9e5bbc5996c55a92222ceeefa78be4b9cf0fafd5)
2021-06-25 1f21eab Stafford Horne   Cleanups: Remove misleading unused inputs
2021-06-25 01e5ee1 Stafford Horne   Fix mulitple monitor block issue with iverilog
2021-06-20 e902782 Stafford Horne   Merge pull request #115 from openrisc/unf_fix
2021-06-19 ebab3c4 Stafford Horne   Merge pull request #121 from stffrdhrn/or1k-tests-update
2021-06-19 9260ff6 Stafford Horne   workflows: Update or1k-tests to 1.0.3  (shorne/or1k-tests-update, or1k-tests-update)

stffrdhrn commented 2 years ago

There are 3 tests that we need to get working now. (this is similar to #122)

Formal mor1kx_dcache, failing due to invalidate_ack being constantly high
Test suite or1k-mmu, failing with back to back mtspr instructions
Kernel stability, running linux stable when running the glibc test suite - after FIXED

stffrdhrn commented 2 years ago

This patch , is part of the reverted change. After applying this or1k-mmu passes. But linux becomes unstable.

diff --git a/rtl/verilog/mor1kx_dcache.v b/rtl/verilog/mor1kx_dcache.v
index eb30f27..c04a61d 100644
--- a/rtl/verilog/mor1kx_dcache.v
+++ b/rtl/verilog/mor1kx_dcache.v
@@ -451,6 +451,8 @@ module mor1kx_dcache
                 invalidate_adr <= spr_bus_dat_i[WAY_WIDTH-1:OPTION_DCACHE_BLOCK_WIDTH];

                 state <= INVALIDATE;
+             end else if (cpu_we_i | write_pending) begin
+                state <= WRITE;
              end else begin
                 state <= IDLE;
              end
@@ -534,7 +536,7 @@ module mor1kx_dcache

           WRITE: begin
              way_wr_dat = cpu_dat_i;
-             if (hit & cpu_req_i) begin
+             if (hit & (cpu_req_i | write_pending)) begin
                 /* Mux cache output with write data */
                 if (!cpu_bsel_i[3])
                   way_wr_dat[31:24] = cpu_dat_o[31:24];

stffrdhrn commented 2 years ago

In the PR I uploaded some new patches that update the toolchain to be able to reproduce the issue. This is similar to #122 after the revert for now, but we have to thinking of a different way to fix it.

The code from or1k-mmu that fails:

https://github.com/openrisc/or1k-tests/blob/665abea1996a9a304ef11b11ac346f6593bdd581/native/or1k/or1k-mmu.c#L1137-L1174

Assembled:

    2c38:       d7 f9 87 f8     l.sw -8(r25),r16
    2c3c:       1a 20 15 00     l.movhi r17,0x1500
    2c40:       d7 f9 8f fc     l.sw -4(r25),r17
    2c44:       d4 19 80 00     l.sw 0(r25),r16
    2c48:       d4 19 88 04     l.sw 4(r25),r17
    2c4c:       d4 79 87 f8     l.sw 8184(r25),r16
    2c50:       d4 79 8f fc     l.sw 8188(r25),r17
    2c54:       1a a0 00 00     l.movhi r21,0x0
    2c58:       9e b5 3a cc     l.addi r21,r21,15052
    2c5c:       d4 99 80 00     l.sw 8192(r25),r16
    2c60:       d4 02 a8 40     l.sw 64(r2),r21
    2c64:       d4 02 a8 24     l.sw 36(r2),r21
    2c68:       d4 99 88 04     l.sw 8196(r25),r17
    2c6c:       9f b9 ff f8     l.addi r29,r25,-8
    2c70:       9e f9 1f f8     l.addi r23,r25,8184
    2c74:       9e b9 20 00     l.addi r21,r25,8192
    2c78:       aa 20 18 02     l.ori r17,r0,0x1802
    2c7c:       c0 11 e8 00     l.mtspr r17,r29,0x0
    2c80:       c0 11 c8 00     l.mtspr r17,r25,0x0
    2c84:       d4 01 c8 20     l.sw 32(r1),r25
    2c88:       c0 11 b8 00     l.mtspr r17,r23,0x0
    2c8c:       c0 11 a8 00     l.mtspr r17,r21,0x0
immu_enable:
    2c90:       18 80 00 00     l.movhi r4,0x0
    2c94:       9c 84 3e 40     l.addi r4,r4,15936
    2c98:       a8 60 00 0a     l.ori r3,r0,0xa
    2c9c:       87 02 00 08     l.lwz r24,8(r2)
    2ca0:       04 00 37 a2     l.jal 10b28 <or1k_exception_handler_add>
    2ca4:       e3 18 98 08     l.sll r24,r24,r19
    2ca8:       18 80 00 00     l.movhi r4,0x0
    2cac:       9c 84 40 1c     l.addi r4,r4,16412
    2cb0:       04 00 37 9e     l.jal 10b28 <or1k_exception_handler_add>
    2cb4:       a8 60 00 04     l.ori r3,r0,0x4
    2cb8:       04 00 38 6e     l.jal 10e70 <or1k_immu_enable>  // }
    2cbc:       e2 d8 e0 00     l.add r22,r24,r28
    2cc0:       1a 20 00 00     l.movhi r17,0x0
    2cc4:       e4 38 88 00     l.sfne r24,r17
    2cc8:       0c 00 02 9b     l.bnf 3734 <main+0x1734>
    2ccc:       87 21 00 20     l.lwz r25,32(r1)
    2cd0:       9d f2 ff ff     l.addi r15,r18,-1
    2cd4:       9f 72 00 01     l.addi r27,r18,1
    2cd8:       e3 ee 90 00     l.add r31,r14,r18
    2cdc:       e2 74 90 00     l.add r19,r20,r18
    2ce0:       86 22 00 08     l.lwz r17,8(r2)
    2ce4:       9e 31 ff ff     l.addi r17,r17,-1
    2ce8:       e1 ef 88 03     l.and r15,r15,r17
    2cec:       ab ff 10 00     l.ori r31,r31,0x1000
    2cf0:       e2 3b 88 03     l.and r17,r27,r17
    2cf4:       ab b3 10 00     l.ori r29,r19,0x1000
    2cf8:       ab 39 00 c0     l.ori r25,r25,0xc0
    2cfc:       a9 00 00 02     l.ori r8,r0,0x2
    2d00:       e2 f1 78 02     l.sub r23,r17,r15
    2d04:       85 a2 00 14     l.lwz r13,20(r2)
    2d08:       86 21 00 2c     l.lwz r17,44(r1)
    2d0c:       85 82 00 18     l.lwz r12,24(r2)
    2d10:       84 f1 13 d8     l.lwz r7,5080(r17)
    2d14:       86 a2 00 34     l.lwz r21,52(r2)
    2d18:       aa 36 00 01     l.ori r17,r22,0x1
    2d1c:       c0 1f 88 00     l.mtspr r31,r17,0x0
    2d20:       c0 1d c8 00     l.mtspr r29,r25,0x0
    2d24:       d4 02 00 1c     l.sw 28(r2),r0
    2d28:       d4 02 00 20     l.sw 32(r2),r0
    2d2c:       d4 02 00 38     l.sw 56(r2),r0
    2d30:       d4 02 00 3c     l.sw 60(r2),r0
    2d34:       e4 6d b0 00     l.sfgeu r13,r22
    2d38:       10 00 00 20     l.bf 2db8 <main+0xdb8>
    2d3c:       e4 4c b0 00     l.sfgtu r12,r22
    2d40:       10 00 00 06     l.bf 2d58 <main+0xd58>
    2d44:       9e 36 ff f8     l.addi r17,r22,-8
    2d48:       e4 47 b0 00     l.sfgtu r7,r22
    2d4c:       10 00 00 1c     l.bf 2dbc <main+0xdbc>
    2d50:       1a 20 00 00     l.movhi r17,0x0
    2d54:       9e 36 ff f8     l.addi r17,r22,-8
    2d58:       48 00 88 00     l.jalr r17
    2d5c:       15 00 00 00      l.nop 0x0

In our trace we see:

at 00002c84 store value ta 0041c000 to stack 0x0020(r1) / [007fdfb4]
at 00002ccc load back from stack 0x0020(r1) / 007fdfb4 to r25
at 00002cf8 we see the value of r25 is 001122f3 completely different! it should be 0041c000
at 00002d1c setting up ITLBW_MR to virtual address 0009c001
at 00002d20 setting up ITLBW_TR mapping we use the bad physical address 001122f3
at 00002d58 jump to 9bff8
then we jump to 9bff8 a bad address when the system hangs. We should be getting a bus error.

Essentially the store at PC 2c84 or load at 2ccc is corrupt.

Or in c code it would be:

    mtspr(OR1K_SPR_DCACHE_DCBFR_ADDR, ta - 8);
    mtspr(OR1K_SPR_DCACHE_DCBFR_ADDR, ta);
    // ta gets corrupted around here as temp store fails
    mtspr(OR1K_SPR_DCACHE_DCBFR_ADDR, ta + PAGE_SIZE - 8);
    mtspr(OR1K_SPR_DCACHE_DCBFR_ADDR, ta + PAGE_SIZE);

    mtspr (OR1K_SPR_IMMU_ITLBW_MR_ADDR(way, set), ea | OR1K_SPR_IMMU_ITLBW_MR_V_MASK);
    mtspr (OR1K_SPR_IMMU_ITLBW_TR_ADDR(way, set), ta | ITLB_PR_NOLIMIT);
    call (ea - 8);

ta
S 00002c38: d7f987f8 l.sw    0xfff8(r25),r16 [0041bff8] = 44004800  flag: 0
S 00002c3c: 1a201500 l.movhi r17,0x1500      r17        = 15000000  flag: 0   
S 00002c40: d7f98ffc l.sw    0xfffc(r25),r17 [0041bffc] = 15000000  flag: 0   
S 00002c44: d4198000 l.sw    0x0000(r25),r16 [0041c000] = 44004800  flag: 0   
S 00002c48: d4198804 l.sw    0x0004(r25),r17 [0041c004] = 15000000  flag: 0   

ta + PAGE_SIZE                                                                
S 00002c4c: d47987f8 l.sw    0x1ff8(r25),r16 [0041dff8] = 44004800  flag: 0   
S 00002c50: d4798ffc l.sw    0x1ffc(r25),r17 [0041dffc] = 15000000  flag: 0   
S 00002c54: 1aa00000 l.movhi r21,0x0000      r21        = 00000000  flag: 0   
S 00002c58: 9eb53acc l.addi  r21,r21,0x3acc  r21        = 00003acc  flag: 0   
S 00002c5c: d4998000 l.sw    0x2000(r25),r16 [0041e000] = 44004800  flag: 0   

    part of reset_tlb_handler_config, strange but ok 3acc is tlb_default_set_translate
    S 00002c60: d402a840 l.sw    0x0040(r2),r21  [00018fb0] = 00003acc  flag: 0
    S 00002c64: d402a824 l.sw    0x0024(r2),r21  [00018f94] = 00003acc  flag: 0

S 00002c68: d4998804 l.sw    0x2004(r25),r17 [0041e004] = 15000000  flag: 0   

S 00002c6c: 9fb9fff8 l.addi  r29,r25,0xfff8  r29        = 0041bff8  flag: 0   
S 00002c70: 9ef91ff8 l.addi  r23,r25,0x1ff8  r23        = 0041dff8  flag: 0   
S 00002c74: 9eb92000 l.addi  r21,r25,0x2000  r21        = 0041e000  flag: 0   
S 00002c78: aa201802 l.ori   r17,r0,0x1802   r17        = 00001802  flag: 0   

Flush D-Cache                                                                 
S 00002c7c: c011e800 l.mtspr r17,r29,0x0000  SPR[1802]  = 0041bff8  flag: 0   
S 00002c80: c011c800 l.mtspr r17,r25,0x0000  SPR[1802]  = 0041c000  flag: 0   
S 00002c84: d401c820 l.sw    0x0020(r1),r25  [007fdfb4] = 0041c000  flag: 0     *
S 00002c88: c011b800 l.mtspr r17,r23,0x0000  SPR[1802]  = 0041dff8  flag: 0   
S 00002c8c: c011a800 l.mtspr r17,r21,0x0000  SPR[1802]  = 0041e000  flag: 0   

call to immu_enable ()

S 00002cc0: 1a200000 l.movhi r17,0x0000      r17        = 00000000  flag: 0
S 00002cc4: e4388800 l.sfne  r24,r17                                flag: 1
 S 00002cc8: 0c00029b l.bnf   0x000029b                             flag: 1

S 00002ccc: 87210020 l.lwz   r25,0x0020(r1)  r25        = 007fdfb4  flag: 1 **
S 00002cd0: 9df2ffff l.addi  r15,r18,0xffff  r15        = 0000000d  flag: 1
S 00002cd4: 9f720001 l.addi  r27,r18,0x0001  r27        = 0000000f  flag: 1
S 00002cd8: e3ee9000 l.add   r31,r14,r18     r31        = 0000020e  flag: 1
S 00002cdc: e2749000 l.add   r19,r20,r18     r19        = 0000028e  flag: 1
S 00002ce0: 86220008 l.lwz   r17,0x0008(r2)  r17        = 00018f78  flag: 1
S 00002ce4: 9e31ffff l.addi  r17,r17,0xffff  r17        = 00000040  flag: 1
S 00002ce4: 9e31ffff l.addi  r17,r17,0xffff  r17        = 0000003f  flag: 1
S 00002ce8: e1ef8803 l.and   r15,r15,r17     r15        = 0000000d  flag: 1
S 00002cec: abff1000 l.ori   r31,r31,0x1000  r31        = 0000120e  flag: 1
S 00002cf0: e23b8803 l.and   r17,r27,r17     r17        = 0000000f  flag: 1
S 00002cf4: abb31000 l.ori   r29,r19,0x1000  r29        = 0000128e  flag: 1
S 00002cf8: ab3900c0 l.ori   r25,r25,0x00c0  r25        = 001122f3  flag: 1  ***
S 00002cfc: a9000002 l.ori   r8,r0,0x0002    r8         = 00000002  flag: 1
S 00002d00: e2f17802 l.sub   r23,r17,r15     r23        = 00000002  flag: 1
S 00002d04: 85a20014 l.lwz   r13,0x0014(r2)  r13        = 00018f84  flag: 1
S 00002d08: 8621002c l.lwz   r17,0x002c(r1)  r17        = 007fdfc0  flag: 1
S 00002d0c: 85820018 l.lwz   r12,0x0018(r2)  r12        = 00018f88  flag: 1
S 00002d10: 84f113d8 l.lwz   r7,0x13d8(r17)  r7         = 000113d8  flag: 1
S 00002d14: 86a20034 l.lwz   r21,0x0034(r2)  r21        = 00018fa4  flag: 1
S 00002d18: aa360001 l.ori   r17,r22,0x0001  r17        = 0009c001  flag: 1
S 00002d1c: c01f8800 l.mtspr r31,r17,0x0000  SPR[120e]  = 0009c001  flag: 1
S 00002d20: c01dc800 l.mtspr r29,r25,0x0000  SPR[128e]  = 001122f3  flag: 1

reset_tlb_miss_counts:
    S 00002d24: d402001c l.sw    0x001c(r2),r0   [00018f8c] = 00000000  flag: 1
    S 00002d28: d4020020 l.sw    0x0020(r2),r0   [00018f90] = 00000000  flag: 1
    S 00002d2c: d4020038 l.sw    0x0038(r2),r0   [00018fa8] = 00000000  flag: 1
    S 00002d30: d402003c l.sw    0x003c(r2),r0   [00018fac] = 00000000  flag: 1

S 00002d34: e46db000 l.sfgeu r13,r22                                flag: 0
S 00002d38: 10000020 l.bf    0x0000020                              flag: 0
S 00002d3c: e44cb000 l.sfgtu r12,r22                                flag: 1
S 00002d40: 10000006 l.bf    0x0000006                              flag: 1

call (ea - 8);
S 00002d44: 9e36fff8 l.addi  r17,r22,0xfff8  r17        = 0009bff8  flag: 1
S 00002d58: 48008800 l.jalr  r17                                    flag: 1
S 00002d58: 48008800 l.jalr  r17                                    flag: 1
S 00002d5c: 15000000 l.nop   0x0000                                 flag: 1

BAD ADDRESS DEAD FROM HERE!!!!
S 0009bff8: 14000000 l.nop   0x0000                                 flag: 1

stffrdhrn commented 2 years ago

Uploaded VCD and GTKWave save files showing the test failure:

mor1kx-or1k-mmu-146.zip

In the below screenshot we can see:

in dcache cpu_we_i goes high to indicate an upcoming write cpu_req_i goes high to indicate the write request hit is high indicating there is an entry that needs to be updated way.0 and way.1 we never goes high indicating the write was LOST

Screenshot from 2022-03-07 14-49-37

stffrdhrn commented 2 years ago

With the latest commit:

Test suite or1k-mmu, is passing with back to back mtspr instructions
Linux stability, this is also stable now as I am able to run the glibc test suite for several hours
However, Formal
- mor1kx_dcache - passing
- mor1kx_lsu_cappuccino - failing
- mor1kx_cpu_cappuccino - failing
- mor1k - failing

It will take a bit more time to complete the formal verification.

openrisc / mor1kx

Linux crashes with latest master #146