the-machine-hall / openbsd-sgi

This is an effort do keep the sgi port of OpenBSD alive.
1 stars 0 forks source link

GENERIC-IP28 broken on OpenBSD/sgi 7.3 #2

Open johnny-mnemonic opened 1 year ago

johnny-mnemonic commented 1 year ago

Something went broke between OpenBSD/sgi 7.2 and 7.3 for IP22 (tested on Indy) and IP28 (R10000 Indigo²).

login: [...] indy# 7za b -md=2m panic: kernel diagnostic assertion "!ISSET(bp->b_flags, B_DMA)" failed: file "/ usr/src/sys/kern/vfs_bio.c", line 388 Stopped at db_enter+0x4: jr ra db_enter+0x8: nop TID PID UID PRFLAGS PFLAGS CPU COMMAND *521961 35108 0 0x3 0 0 7za db_enter+0x4 (56daa729608a8de9,900000001fbd9880,900000001fbd9830,0) ra 0xfffff fff88972788 sp 0xffffffff88460cb0, sz 0 panic+0x178 (56daa729608a8de9,ffffffff88b46380,ffffffff88bae2a8,ffffffff88bae0b 0) ra 0xffffffff88974578 sp 0xffffffff88460cb0, sz 112 __assert+0x38 (56daa729608a8de9,ffffffff88b46380,ffffffff88bae2a8,ffffffff88bae 0b0) ra 0xffffffff88b319b4 sp 0xffffffff88460d20, sz 32 buf_flip_high+0x2d4 (56daa729608a8de9,ffffffff88b46380,ffffffff88bae2a8,fffffff f88bae0b0) ra 0xffffffff88b3274c sp 0xffffffff88460d40, sz 48 bufcache_recover_dmapages+0xec (56daa729608a8de9,ffffffff88b46380,ffffffff88bae 2a8,ffffffff88bae0b0) ra 0xffffffff88b32be0 sp 0xffffffff88460d70, sz 96 bufadjust+0x168 (56daa729608a8de9,ffffffff88b46380,ffffffff88bae2a8,ffffffff88b ae0b0) ra 0xffffffff88b34c24 sp 0xffffffff88460dd0, sz 48 bufbackoff+0xcc (56daa729608a8de9,ffffffff88b46380,ffffffff88bae2a8,ffffffff88b ae0b0) ra 0xffffffff88a6dff8 sp 0xffffffff88460e00, sz 80 buf_realloc_pages+0x108 (56daa729608a8de9,ffffffff88b46380,ffffffff88bae2a8,fff fffff88bae0b0) ra 0xffffffff88b31778 sp 0xffffffff88460e50, sz 96 buf_flip_high+0x98 (56daa729608a8de9,ffffffff88b46380,ffffffff88bae2a8,ffffffff 88bae0b0) ra 0xffffffff88b3274c sp 0xffffffff88460eb0, sz 48 bufcache_recover_dmapages+0xec (56daa729608a8de9,ffffffff88b46380,ffffffff88bae 2a8,ffffffff88bae0b0) ra 0xffffffff88b32be0 sp 0xffffffff88460ee0, sz 96 bufadjust+0x168 (56daa729608a8de9,ffffffff88b46380,ffffffff88bae2a8,ffffffff88b ae0b0) ra 0xffffffff88b34c24 sp 0xffffffff88460f40, sz 48 bufbackoff+0xcc (56daa729608a8de9,ffffffff88b46380,ffffffff88bae2a8,ffffffff88b ae0b0) ra 0xffffffff88a6dff8 sp 0xffffffff88460f70, sz 80 buf_realloc_pages+0x108 (56daa729608a8de9,ffffffff88b46380,ffffffff88bae2a8,fff fffff88bae0b0) ra 0xffffffff88b31778 sp 0xffffffff88460fc0, sz 96 buf_flip_high+0x98 (56daa729608a8de9,ffffffff88b46380,ffffffff88bae2a8,ffffffff 88bae0b0) ra 0xffffffff88b3274c sp 0xffffffff88461020, sz 48 User-level: pid 35108 https://www.openbsd.org/ddb.html describes the minimum info required in bug reports. Insufficient info makes it difficult to find and fix bugs. ddb>

  See [full log](https://dmesgd.nycbug.org/index.cgi?do=view&id=7004) for more details.

* [The IP28 kernel tested](https://github.com/the-machine-hall/openbsd-src/releases/download/openbsd.73.sgi/bsd.IP28.openbsd.73.sgi) has the workaround from #1 applied but breaks even earlier during kernel boot:

[...] OpenBSD 7.3 (GENERIC-IP28) #0: Thu Mar 30 17:57:07 CEST 2023 root@octane.machine-hall.org:/usr/src/sys/arch/sgi/compile/GENERIC-IP28 real mem = 268435456 (256MB) rsvd mem = 1064960 (2MB) avail mem = 259670016 (247MB) warning: no entropy supplied by boot loader random: boothowto does not indicate good seed mainbus0 at root: POWER Indigo2 R10000 cpu0 at mainbus0: MIPS R10000 CPU rev 2.5 194 MHz, R10000 FPU rev 0.0 cpu0: cache L1-I 32KB D 32KB 2 way, L2 1024KB 2 way [...] dsclock0 at hpc0 offset 0x00060000 eisa0 at imc0 irq 27

Trap cause = 13 Frame 0x9800000020007e18 Trap PC 0xa8000000200991d8 RA 0xa80000002029b4fc fault 0xc000000000808648 0xa8000000200990e0 (a8000000204a85c0,28,0,0) ra 0xa80000002029b4fc sp 0x9800000020007f70, sz 0 0xa80000002029ab28 (a8000000204a85c0,28,0,0) ra 0x0 sp 0x9800000020007f70, sz 0 User-level: pid 0 stopped on non ddb fault Stopped at 0xa8000000200991d8: teq v1,zero ddb>


See [full log](https://dmesgd.nycbug.org/index.cgi?do=view&id=7005) for more details.
johnny-mnemonic commented 1 year ago

The problem for IP22 also happens with non-ports binaries:

indy# ls /bin
[          cp         domainname kill       mt         rm         sleep
cat        cpio       echo       ksh        mv         rmdir      stty
chgrp      csh        ed         ln         pax        sh         sync
chio       date       eject      ls         ps         sha1       tar
chmod      dd         expr       md5        pwd        sha256     test
cksum      df         hostname   mkdir      rksh       sha512
indy# ls
.Xdefaults .config    .cshrc     .cvsrc     .login     .profile   .ssh
indy# ls /
.cshrc     bin        dev        mnt        swap       usr
.profile   bsd        etc        root       sys        var
altroot    bsd.booted home       sbin       tmp
indy# time sha256 /bsd
panic: kernel diagnostic assertion "!ISSET(bp->b_flags, B_DMA)" failed: file "/
usr/src/sys/kern/vfs_bio.c", line 388
Stopped at      db_enter+0x4:   jr      ra
db_enter+0x8:    nop
    TID    PID    UID     PRFLAGS     PFLAGS  CPU  COMMAND
*473577  31906      0    0x100003          0    0  sha256
db_enter+0x4 (56daa729608a8de9,900000001fbd9880,900000001fbd9830,0)  ra 0xfffff
fff889393d8 sp 0xffffffff8fc38ed0, sz 0
panic+0x178 (56daa729608a8de9,ffffffff88b48408,ffffffff88b48ae8,ffffffff88b488d
0)  ra 0xffffffff8893b1d8 sp 0xffffffff8fc38ed0, sz 112
__assert+0x38 (56daa729608a8de9,ffffffff88b48408,ffffffff88b48ae8,ffffffff88b48
8d0)  ra 0xffffffff88814404 sp 0xffffffff8fc38f40, sz 32
buf_flip_high+0x2d4 (56daa729608a8de9,ffffffff88b48408,ffffffff88b48ae8,fffffff
f88b488d0)  ra 0xffffffff8881519c sp 0xffffffff8fc38f60, sz 48
bufcache_recover_dmapages+0xec (56daa729608a8de9,ffffffff88b48408,ffffffff88b48
ae8,ffffffff88b488d0)  ra 0xffffffff88815630 sp 0xffffffff8fc38f90, sz 96
bufadjust+0x168 (56daa729608a8de9,ffffffff88b48408,ffffffff88b48ae8,ffffffff88b
488d0)  ra 0xffffffff88817674 sp 0xffffffff8fc38ff0, sz 48
bufbackoff+0xcc (56daa729608a8de9,ffffffff88b48408,ffffffff88b48ae8,ffffffff88b
488d0)  ra 0xffffffff88acde48 sp 0xffffffff8fc39020, sz 80
buf_realloc_pages+0x108 (56daa729608a8de9,ffffffff88b48408,ffffffff88b48ae8,fff
fffff88b488d0)  ra 0xffffffff888141c8 sp 0xffffffff8fc39070, sz 96
buf_flip_high+0x98 (56daa729608a8de9,ffffffff88b48408,ffffffff88b48ae8,ffffffff
88b488d0)  ra 0xffffffff8881519c sp 0xffffffff8fc390d0, sz 48
bufcache_recover_dmapages+0xec (56daa729608a8de9,ffffffff88b48408,ffffffff88b48
ae8,ffffffff88b488d0)  ra 0xffffffff88815630 sp 0xffffffff8fc39100, sz 96
bufadjust+0x168 (56daa729608a8de9,ffffffff88b48408,ffffffff88b48ae8,ffffffff88b
488d0)  ra 0xffffffff88817674 sp 0xffffffff8fc39160, sz 48
bufbackoff+0xcc (56daa729608a8de9,ffffffff88b48408,ffffffff88b48ae8,ffffffff88b
488d0)  ra 0xffffffff88acde48 sp 0xffffffff8fc39190, sz 80
buf_realloc_pages+0x108 (56daa729608a8de9,ffffffff88b48408,ffffffff88b48ae8,fff
fffff88b488d0)  ra 0xffffffff888141c8 sp 0xffffffff8fc391e0, sz 96
buf_flip_high+0x98 (56daa729608a8de9,ffffffff88b48408,ffffffff88b48ae8,ffffffff
88b488d0)  ra 0xffffffff8881519c sp 0xffffffff8fc39240, sz 48
User-level: pid 31906
https://www.openbsd.org/ddb.html describes the minimum info required in bug
reports.  Insufficient info makes it difficult to find and fix bugs.
ddb>
johnny-mnemonic commented 1 year ago

After checking various configurations it looks like the IP22 problem is not new, because older kernels (7.2, 7.1 and 7.0 with matching octeon FSes and 6.9 with matching sgi FS) are also affected and show the same result. So maybe this is just an effect of using a NFS root FS or a bug that exists since a while.

UPDATE: Testing on a R4600 Indy showed that it is not affected by this, so it could be that this issue is specific to the R4400 of the Indy I originally tested. I am unsure if it is related to the errata mentioned in https://github.com/the-machine-hall/openbsd-src/commit/64ac1c5a7e13fbe4130b9b53f956c4ebff13c665 and https://github.com/the-machine-hall/openbsd-src/commit/833ab59f79f5195f7dcd0b5b888b8d2f3335eac5.

johnny-mnemonic commented 1 year ago

An update for IP28:

After bisecting the problem for IP28 it turned out that it is actually two-fold and related to the introduction of the clockintr(9) subsystem and specifically to the two following commits:

While the first one can be worked around by not using clockintr for IP28:

diff --git a/sys/arch/mips64/include/_types.h b/sys/arch/mips64/include/_types.h
index 535abead1de..3b30986770c 100644
--- a/sys/arch/mips64/include/_types.h
+++ b/sys/arch/mips64/include/_types.h
@@ -35,7 +35,9 @@
 #ifndef _MIPS64__TYPES_H_
 #define _MIPS64__TYPES_H_

+#if !( defined(TGT_INDIGO2) && defined(CPU_R10000) )
 #define    __HAVE_CLOCKINTR
+#endif

 /*
  * _ALIGN(p) rounds p (pointer or byte index) up to a correctly-aligned
diff --git a/sys/arch/mips64/mips64/mips64_machdep.c b/sys/arch/mips64/mips64/mips64_machdep.c
index be07540f045..f4412f73c9c 100644
--- a/sys/arch/mips64/mips64/mips64_machdep.c
+++ b/sys/arch/mips64/mips64/mips64_machdep.c
@@ -349,12 +349,14 @@ cpu_initclocks(void)
    (*md_startclock)(ci);
 }

 void
 setstatclockrate(int newhz)
 {
+#ifdef __HAVE_CLOCKINTR
    clockintr_setstatclockrate(newhz);
+#endif
 }

 /*
  * Decode instruction and figure out type.
  */

...for the second one no workaround nor solution is available yet.

It is also unclear, why the other machines I can test (Indy (IP22), Origin200 (IP27), Octane/Octane2 (IP30), O2 (IP32)) are unaffected by the two commits mentioned above.

johnny-mnemonic commented 1 year ago

Comparing the kernel configuration files for IP22 and IP28 uncovered a small difference between both, namely the existence of a clock0 "device" on IP22 and none of that on IP28. Actually all of the other SGI systems I have available for testing and supported by OpenBSD/sgi use a clock0 device. Checking the history of those files I came across:

commit 64ac1c5a7e13fbe4130b9b53f956c4ebff13c665
Author: miod <miod@openbsd.org>
Date:   Sat Jul 14 19:53:27 2012 +0000

    A known errata of R4000 and R4400 processors, is that reading the internal
    counter register close to a trigger of the counter interrupt, may cause the
    interrupt not to be generated. This makes it a bad idea to use the internal
    counter both for the scheduling clock and for delay().

    Therefore, on IP22 systems (and IP28 because it makes my life easier), use
    one of the two 8254 timers connected to the onboard interrupt controller as
    the scheduling clock source.

    Adapted from NetBSD.

...which switched both IP22 and IP28 from using clock0 to the timers connected to int0. This was soon after took back for the Indy with:

commit 833ab59f79f5195f7dcd0b5b888b8d2f3335eac5
Author: miod <miod@openbsd.org>
Date:   Wed Jul 18 19:56:02 2012 +0000

    According to Linux, and just verified the hard way, the 8254 timer does not
    interrupt on Indy; do not use it on such systems. Then, bring back a clock0 at
    mainbus attachment to IP22 kernels, and attach it late in the autoconf process
    if no other device has claimed the clock yet.

    This means R4000 and R4400 based Indy may experience the lost clock interrupt
    processor errata again, until a better way to skirt it is found.

And "bring[ing] back a clock0" to IP28 with:

diff --git a/sys/arch/sgi/conf/GENERIC-IP28 b/sys/arch/sgi/conf/GENERIC-IP28
index 9918a08414c..afcae927626 100644
--- a/sys/arch/sgi/conf/GENERIC-IP28
+++ b/sys/arch/sgi/conf/GENERIC-IP28
@@ -37,6 +37,7 @@ config        bsd swap generic
 #
 mainbus0   at root
 cpu*       at mainbus0
+clock0     at mainbus0

 int0       at mainbus0 # Interrupt Controller and scheduling clock
 imc0       at mainbus0 # Memory Controller
diff --git a/sys/arch/sgi/conf/RAMDISK-IP28 b/sys/arch/sgi/conf/RAMDISK-IP28
index e07ea14fbe7..389b0d3655d 100644
--- a/sys/arch/sgi/conf/RAMDISK-IP28
+++ b/sys/arch/sgi/conf/RAMDISK-IP28
@@ -31,6 +31,7 @@ config        bsd root on rd0a swap on rd0b

 mainbus0   at root
 cpu*       at mainbus0
+clock0     at mainbus0

 int0       at mainbus0     # Interrupt Controller and scheduling clock
 imc0       at mainbus0     # Memory Controller
diff --git a/sys/arch/sgi/localbus/int.c b/sys/arch/sgi/localbus/int.c
index c76df00762d..09c06291ce5 100644
--- a/sys/arch/sgi/localbus/int.c
+++ b/sys/arch/sgi/localbus/int.c
@@ -375,8 +375,7 @@ int2_attach(struct device *parent, struct device *self, void *aux)
    /*
     * The 8254 timer does not interrupt on (some?) IP24 systems.
     */
-   if (sys_config.system_type == SGI_IP20 ||
-       sys_config.system_subtype == IP22_INDIGO2)
+   if (sys_config.system_type == SGI_IP20)
        int_8254_cal();
 }

...fixes/works around the breakage caused by the two commits mentioned in https://github.com/the-machine-hall/openbsd-src/issues/2#issuecomment-1510429607. Together with the fix/workaround from #1 the IP28 kernel boots fine again, see https://dmesgd.nycbug.org/index.cgi?do=view&id=7100 for details.

IP22_INDIGO2 actually includes all Indigo²s (i.e. IP22, IP26 and IP28), so removing that from the clause might be a little too much, but as I only rebuild the kernel for IP28 with this patch applied it makes no difference for IP22 and IP26. I might later enable the timers again for IP22 and IP26 but expect them to break similarly to IP28 w/o the patch. So in the end it might be the better solution to use a clock0 on IP20, IP22 and IP26, too, or fixing the 8254 related code in regard to the commits mentioned in https://github.com/the-machine-hall/openbsd-src/issues/2#issuecomment-1510429607.