luna88k: Data access fault (Write Violation)

tsutsui commented 2 years ago

Looks triggerred on login attempts, but not 100% reproducible:

Sun Aug  7 06:54:46 UTC 2022

NetBSD/luna88k (lunarian) (console)

login: root
Aug  7 07:00:52 lunarian login: ROOT LOGIN (root) ON console

Data access fault (Write Violation) v = 0x13b2d8, frame 0x6889b58
R00-05: 0x00000000  0x000bbdc0  0xffffd188  0x062caa00  0x000001f8  0x726f6f74
R06-11: 0x00000000  0x00000004  0x00000000  0xf9d37b88  0x00000000  0x00000998
R12-17: 0x060627d0  0x00000001  0x00000000  0x06889e98  0x00100000  0x0606c1a4
R18-23: 0x00000000  0x000b0000  0x060627d0  0x062caa00  0x00000200  0x06889f10
R24-29: 0x06889f30  0x00000200  0x00000000  0x00000000  0x00000000  0x00000000
R30-31: 0x06888058  0x06889c58
sxip 13b2da snip 13b2de sfip 13b2e2
dmt0 42bf dmd0 726f6f74 dma0 62caa00
dmt1 4be dmd1 40 dma1 cff94
dmt2 0 dmd2 ccccccc dma2 ccccccc
fault type 7
[DMT0=42bf: st.s 726f6f74 to 62caa00 as 15 not double not xmem]
fpsr 1081c8 fpcr 0 epsr 900003f0 ssbr 0
fpecr 0 fphs1 abd34 fpls1 0 fphs2 8a00 fpls2 6888058
fppt 7fb88 fprh 105658 fprl 606b4d0 fpit 0
vector 3 mask 0 mode 4 scratch1 173d74 cpu 0x1a3480
panic: Data Access Exception
Stopped in pid 275.1 (login) at netbsd:cpu_Debugger+0x4:                tb0     0
, r0, 0x84
db> bt
stack base = 0x6889a00
(0) netbsd:cpu_Debugger+0x4
(1) netbsd:panic+0x170(?, 0x17, 0, 0, 0, 0x668, 0, 37c)
(2) netbsd:panictrap+0x44
(3) netbsd:m88100_trap+0x324
(4) netbsd:disklabel_bsd_to_om+0xb10
db>

v = 0x13b2d8 of the kernel is copyin_right_aligned_to_doubleword() ?:

0013b2b0 <copyin_right_aligned_to_doubleword>:
  13b2b0:       f4 00 58 00     or          r0,r0,r0
  13b2b4:       f4 a2 15 00     ld.usr      r5,r2,r0
  13b2b8:       f4 00 58 00     or          r0,r0,r0
  13b2bc:       f4 00 58 00     or          r0,r0,r0
  13b2c0:       f4 00 58 00     or          r0,r0,r0
  13b2c4:       f4 c2 15 07     ld.usr      r6,r2,r7
  13b2c8:       f4 00 58 00     or          r0,r0,r0
  13b2cc:       f4 00 58 00     or          r0,r0,r0
  13b2d0:       f4 00 58 00     or          r0,r0,r0
  13b2d4:       64 84 00 08     subu        r4,r4,0x08
  13b2d8:       f4 a3 24 00     st          r5,r3,r0
  13b2dc:       60 42 00 08     addu        r2,r2,0x08
  13b2e0:       f4 c3 24 07     st          r6,r3,r7
  13b2e4:       ed a4 ff f3     bcnd.n      ne0,r4,13b2b0 <copyin_right_aligned_to_doubleword>
  13b2e8:       60 63 00 08     addu        r3,r3,0x08
  13b2ec:       c4 00 00 3b     br.n        13b3d8 <Lcidone>
  13b2f0:       f4 40 58 00     or          r2,r0,r0

(should be confirmed again)

tsutsui commented 2 years ago

Seems triggered by write(2) system calls against file system?

syscall: 293
syscall: 116
syscall: 121
syscall: 4

Data access fault (Write Violation) v = 0x13ae48, frame 0x6888b48
R00-05: 0x00000000  0x000bbbb0  0xffffd188  0x062c2e00  0x000001f8  0x726f6f74
R06-11: 0x00000000  0x00000004  0x00000000  0xf9d3ff88  0x00000000  0x00000998
R12-17: 0x060627d0  0x00000001  0x00000000  0x06888e88  0x00100000  0x0606c1a4
R18-23: 0x00000000  0x000b0000  0x060627d0  0x062c2e00  0x00000200  0x06888f00
R24-29: 0x06888f20  0x00000200  0x00000000  0x00000000  0x00000000  0x00000000
R30-31: 0x06887058  0x06888c48
sxip 13ae4a snip 13ae4e sfip 13ae52
dmt0 42bf dmd0 726f6f74 dma0 62c2e00
dmt1 4be dmd1 40 dma1 cff94
dmt2 0 dmd2 ccccccc dma2 ccccccc
fault type 7
[DMT0=42bf: st.s 726f6f74 to 62c2e00 as 15 not double not xmem]
fpsr 100000 fpcr 0 epsr 900003f0 ssbr 0
fpecr 0 fphs1 abb84 fpls1 0 fphs2 ae00 fpls2 6887058
fppt 7fad8 fprh 105358 fprl 606b4d0 fpit 0
vector 3 mask 0 mode 4 scratch1 1738d4 cpu 0x1a3480
panic: Data Access Exception
Stopped in pid 274.1 (login) at netbsd:cpu_Debugger+0x4:                tb0     0
, r0, 0x84
db>

Maybe we should also check vfsops or buf pages etc.

tsutsui commented 2 years ago

Per investigation of stack addresses, the fault seems triggered via ffs_write() -> uiomove() -> copyin() -> copyin_right_aligned_to_doubleword().

It doesn't help (i.e. the access fault stills occurs) to make copyin() to always use copyin_byte_only().

The following test diff changes the fault address from copyin() variants to memset(), so buffer pages returned from ubc_alloc() is not writable or properly mapped?

diff --git a/sys/ufs/ufs/ufs_readwrite.c b/sys/ufs/ufs/ufs_readwrite.c
index e862f1bf9b90..ac1c4b37e006 100644
--- a/sys/ufs/ufs/ufs_readwrite.c
+++ b/sys/ufs/ufs/ufs_readwrite.c
@@ -353,6 +353,7 @@ WRITE(void *v)

        win = ubc_alloc(&vp->v_uobj, uio->uio_offset, &bytelen,
            ubc_alloc_flags);
+memset(win, 0, bytelen);
        error = uiomove(win, bytelen, uio);
        if (error && extending) {
            /*

Smells MD pmap issue.

tsutsui commented 2 years ago

Now I can reproduce the panic at the same va:

Data access fault (Write Violation) v = 0x17c768, frame 0x688eb68
R00-05: 0x00000000  0x0007fb40  0x060d2000  0x00000000  0x00000c00  0x00000000
R06-11: 0x00000000  0x060d2000  0x000002ff  0x00000004  0x00000001  0x00000c00
R12-17: 0x00000000  0x00000000  0x00000000  0x0688ee68  0x0607f294  0x00000002
R18-23: 0x00000000  0x00000000  0x0607f290  0x00000c00  0x0688eee0  0x0607ae70
R24-29: 0x00000000  0x060d2000  0x00000000  0x00000000  0x00000000  0x00000000
R30-31: 0x0688d058  0x0688ec68
sxip 17c76a snip 17c76e sfip 17c772
dmt0 433f dmd0 0 dma0 60d2000
dmt1 4bc dmd1 ffffc150 dma1 de004
dmt2 0 dmd2 ccccccc dma2 ccccccc
fault type 7
[DMT0=433f: st.s 0 to 60d2000 as 15 not double not xmem]
fpsr 607ae70 fpcr 7f6e8 epsr 900003f0 ssbr 1053a8
fpecr 0 fphs1 d800 fpls1 0 fphs2 0 fpls2 7fb14
fppt 688ee68 fprh 607f294 fprl 2 fpit 0
vector 3 mask 0 mode 4 scratch1 172864 cpu 0x1a2480
panic: Data Access Exception
db>

In that case, va=0x60d2000 is mapped by the following call:

pmap_enter(0x1a2700, 60d2000, 3b11000, 3, 22)

i.e. pmap_enter(9) is called with prot=VM_PROT_READ | VM_PROT_WRITE and flags=PMAP_CANFAIL | VM_PROT_WRITE.

Inconsistent prot and flags seems wrong, but why does this cause Write violation !?

tsutsui commented 2 years ago

The pmap_enter() is called from ubc_fault():

pmap_enter(0x1a2700, 60d2000, 3b10000, 3, 22)
panic: pmap_enter at 0x60d2000
Stopped in pid 14.1 (vi) at     netbsd:cpu_Debugger+0x4:                tb0     0
, r0, 0x84
db> bt
stack base = 0x688e850
(0) netbsd:cpu_Debugger+0x4(stackless)
(1) netbsd:panic+0x170(?, 0xd, 0, 0, 0x84000000, 0x65000000, 18e434, 18e444)
(2) netbsd:pmap_enter+0x280
(3) netbsd:ubc_fault+0x2b8
(4)?0x688ea1c
db>

tsutsui commented 2 years ago

pmap_enter(9) man page says:

           int pmap_enter(pmap_t pmap, vaddr_t va, paddr_t pa, vm_prot_t prot,
                   u_int flags)
                   Create a mapping in physical map pmap for the physical
                   address pa at the virtual address va with protection
                   specified by bits in prot:

                         VM_PROT_READ       The mapping must allow reading.

                         VM_PROT_WRITE      The mapping must allow writing.

                         VM_PROT_EXECUTE    The page mapped contains
                                            instructions that will be executed
                                            by the processor.

                   The flags argument contains protection bits (the same bits
                   as used in the prot argument) indicating the type of access
                   that caused the mapping to be created.  This information
                   may be used to seed modified/referenced information for the
                   page being mapped, possibly avoiding redundant faults on
                   platforms that track modified/referenced information in
                   software.

So passing args as prot = VM_PROT_READ | VM_PROT_WRITE and flags = PMAP_CANFAIL | VM_PROT_WRITE is a valid op.

Anyway the Data access fault (Write Violation) still occurs even if VM_PROT_READ is added to flag, so we should check the PTE entry for the va is (unintentionally) modified after pmap_enter(9) call.

tsutsui commented 2 years ago

current status

the page that causes the Write Violation is allocated by ubc_fault() in sys/uvm/uvm_bio.c https://github.com/tsutsui/netbsd-src/blob/50aefa500ed0bb6425721f724205bd28dd95a3b7/sys/uvm/uvm_bio.c#L370-L371
the Write Violation is caused by reference via ubc_alloc() in sys/ufs/ufs/ufs_readwrite.c https://github.com/tsutsui/netbsd-src/blob/50aefa500ed0bb6425721f724205bd28dd95a3b7/sys/ufs/ufs/ufs_readwrite.c#L354-L355
the page is allocated as VM_PROT_READ | VM_PROT_WRITE first so the PTE is set as PG_M|PG_U|PG_RW|PG_V, but PG_RW and PG_M are cleared later by pmap_changebit() from pmap_page_protect() from genfs_putpage() in sys/miscfs/genfs/genfs_vnop.c https://github.com/tsutsui/netbsd-src/blob/50aefa500ed0bb6425721f724205bd28dd95a3b7/sys/miscfs/genfs/genfs_vnops.c#L1297-L1299

Smells some inconsistency between MI UBC and MD m88k pmap, as #5, but needs more investigation what UBC and genfs actually do..

tsutsui / netbsd-src

luna88k: Data access fault (Write Violation) #12

current status