starfive-tech / linux

Other
152 stars 112 forks source link

workqueue lockup in rcu_sched #18

Closed oaken-source closed 3 years ago

oaken-source commented 3 years ago

Using linux-5.12 on the 7100, after a couple hours or days of running, I get messages like the following on the serial console:

[271072.624733] BUG: workqueue lockup - pool cpus=0-1 flags=0x4 nice=0 stuck for 2530s!
[271072.632592] Showing busy workqueues and worker pools:
[271072.637810] workqueue events_unbound: flags=0x2
[271072.642464]   pwq 4: cpus=0-1 flags=0x4 nice=0 active=3/512 refcnt=5
[271072.648930]     pending: flush_to_ldisc, flush_to_ldisc, flush_to_ldisc
[271072.655750] workqueue ext4-rsv-conversion: flags=0x2000a
[271072.661169]   pwq 4: cpus=0-1 flags=0x4 nice=0 active=1/1 refcnt=3
[271072.667470]     pending: ext4_end_io_rsv_work
[271072.671983] pool 1: cpus=0 node=0 flags=0x1 nice=-20 hung=31s workers=2 manager: 816094 idle: 823322
[271081.464351] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[271081.470431]         (detected by 0, t=254302 jiffies, g=8370033, q=8218)
[271081.476627] rcu: All QSes seen, last rcu_sched kthread activity 254300 (4322046580-4321792280), jiffies_till_next_fqs=1, root ->qsmask 0x0
[271081.489167] rcu: rcu_sched kthread timer wakeup didn't happen for 254299 jiffies! g8370033 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x200
[271081.500810] rcu:    Possible timer handling issue on cpu=1 timer-softirq=2492648
[271081.508132] rcu: rcu_sched kthread starved for 254300 jiffies! g8370033 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x200 ->cpu=1
[271081.518846] rcu:    Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[271081.527904] rcu: RCU grace-period kthread stack dump:
[271081.533050] task:rcu_sched       state:R stack:    0 pid:   11 ppid:     2 flags:0x00000000
[271081.541522] Call Trace:
[271081.544064] [<ffffffe000616bdc>] __schedule+0x1be/0x4a6
[271081.549410] [<ffffffe000616f1a>] schedule+0x56/0xca
[271081.554390] [<ffffffe000619b2a>] schedule_timeout+0x68/0xca
[271081.560071] [<ffffffe00005cddc>] rcu_gp_kthread+0x510/0x93e
[271081.565756] [<ffffffe000023bfa>] kthread+0xfe/0x10c
[271081.570739] [<ffffffe0000032b6>] ret_from_exception+0x0/0xc
[271081.576414] rcu: Stack dump where RCU GP kthread last ran:
[271081.581990] Task dump for CPU 1:
[271081.585308] task:sh              state:R  running task     stack:    0 pid:823462 ppid:822764 flags:0x00000000
[271081.595429] Call Trace:
[271081.597967] [<ffffffe000616bdc>] __schedule+0x1be/0x4a6

This repeats ad nauseum, while everything else seems to be locked up. The serial console is otherwise unresponsive and network access through ping and ssh simply times out.

sarnold commented 3 years ago

I can name that tune in 2 bars: broadcom OOPS => workqueue BUG

However, I've really only seen these on startup (ie, when wifi gets modprobed).

Starting kernel ...

[    0.000000] Linux version 5.12.2-riscv64-x1 (sarnold@ptolemy) (riscv64-unknown-linux-gnu-gcc (Gentoo Hardened 10.2.0-r5 p6) 10.2.0, GNU ld (Gentoo 2.35.2 p1) 2.35.2) #1 SMP Sun May 9 17:32:18 PDT 2021
[    0.000000] OF: fdt: Ignoring memory range 0x80000000 - 0x80200000
[    0.000000] Machine model: BeagleV Starlight Beta
[    0.000000] efi: UEFI not found.
[    0.000000] Reserved memory: created CMA memory pool at 0x00000000a0000000, size 640 MiB
[    0.000000] OF: reserved mem: initialized node linux,cma, compatible id shared-dma-pool
[    0.000000] Reserved memory: created DMA memory pool at 0x00000000f9000000, size 16 MiB
[    0.000000] OF: reserved mem: initialized node framebuffer@f9000000, compatible id shared-dma-pool
[    0.000000] Reserved memory: created DMA memory pool at 0x00000000fb000000, size 32 MiB
[    0.000000] OF: reserved mem: initialized node framebuffer@fb000000, compatible id shared-dma-pool
[    0.000000] NUMA: No NUMA configuration found
[    0.000000] NUMA: Faking a node at [mem 0x0000000080200000-0x000000027fffffff]
[    0.000000] NUMA: NODE_DATA [mem 0x27ffca980-0x27ffd1fff]
[    0.000000] Zone ranges:
[    0.000000]   DMA32    [mem 0x0000000080200000-0x00000000ffffffff]
[    0.000000]   Normal   [mem 0x0000000100000000-0x000000027fffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000080200000-0x00000000f8ffffff]
[    0.000000]   node   0: [mem 0x00000000f9000000-0x00000000f9ffffff]
[    0.000000]   node   0: [mem 0x00000000fa000000-0x00000000faffffff]
[    0.000000]   node   0: [mem 0x00000000fb000000-0x00000000fcffffff]
[    0.000000]   node   0: [mem 0x00000000fd000000-0x000000027fffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000080200000-0x000000027fffffff]
[    0.000000]   DMA32 zone: 512 pages in unavailable ranges
[    0.000000] SBI specification v0.2 detected
[    0.000000] SBI implementation ID=0x1 Version=0x9
[    0.000000] SBI v0.2 TIME extension detected
[    0.000000] SBI v0.2 IPI extension detected
[    0.000000] SBI v0.2 RFENCE extension detected
[    0.000000] software IO TLB: mapped [mem 0x000000009c000000-0x00000000a0000000] (64MB)
[    0.000000] SBI v0.2 HSM extension detected
[    0.000000] riscv: ISA extensions acdfim
[    0.000000] riscv: ELF capabilities acdfim
[    0.000000] percpu: Embedded 17 pages/cpu s32552 r8192 d28888 u69632
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 2063880
[    0.000000] Policy zone: Normal
[    0.000000] Kernel command line: earlyprintk console=ttyS0,115200 console=tty1 root=PARTUUID=2c4185b7-8b17-4531-adde-10e22373a1d5 rw rootfstype=ext4 rootwait net.ifnames=0 stmmaceth=chain_mode:1
[    0.000000] Dentry cache hash table entries: 1048576 (order: 11, 8388608 bytes, linear)
[    0.000000] Inode-cache hash table entries: 524288 (order: 10, 4194304 bytes, linear)
[    0.000000] Sorting __ex_table...
[    0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
[    0.000000] Memory: 6716904K/8386560K available (6446K kernel code, 4871K rwdata, 2048K rodata, 2163K init, 405K bss, 1014296K reserved, 655360K cma-reserved)
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=2, Nodes=1
[    0.000000] rcu: Hierarchical RCU implementation.
[    0.000000] rcu:     RCU restricting CPUs from NR_CPUS=8 to nr_cpu_ids=2.
[    0.000000] rcu:     RCU debug extended QS entry/exit.
[    0.000000]  Tracing variant of Tasks RCU enabled.
[    0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 10 jiffies.
[    0.000000] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=2
[    0.000000] NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0
[    0.000000] riscv-intc: 64 local interrupts mapped
[    0.000000] plic: disable L2 cache irq 128 in plic
[    0.000000] plic: disable L2 cache irq 131 in plic
[    0.000000] plic: disable L2 cache irq 129 in plic
[    0.000000] plic: disable L2 cache irq 130 in plic
[    0.000000] plic: disable L2 cache irq 128 in plic
[    0.000000] plic: disable L2 cache irq 131 in plic
[    0.000000] plic: disable L2 cache irq 129 in plic
[    0.000000] plic: disable L2 cache irq 130 in plic
[    0.000000] plic: interrupt-controller@c000000: mapped 127 interrupts with 2 handlers for 4 contexts.
[    0.000000] random: get_random_bytes called from start_kernel+0x508/0x77c with crng_init=0
[    0.000000] riscv_timer_init_dt: Registering clocksource cpuid [0] hartid [1]
[    0.000000] clocksource: riscv_clocksource: mask: 0xffffffffffffffff max_cycles: 0x171024e6b, max_idle_ns: 440795202301 ns
[    0.000002] sched_clock: 64 bits at 6MHz, resolution 160ns, wraps every 4398046511040ns
[    0.000288] Console: colour dummy device 80x25
[    0.001945] printk: console [tty1] enabled
[    0.002198] Calibrating delay loop (skipped), value calculated using timer frequency.. 12.50 BogoMIPS (lpj=62500)
[    0.002295] pid_max: default: 32768 minimum: 301
[    0.002467] LSM: Security Framework initializing
[    0.003162] Mount-cache hash table entries: 16384 (order: 5, 131072 bytes, linear)
[    0.003823] Mountpoint-cache hash table entries: 16384 (order: 5, 131072 bytes, linear)
[    0.009065] ASID allocator disabled
[    0.009587] rcu: Hierarchical SRCU implementation.
[    0.010698] EFI services will not be available.
[    0.011889] smp: Bringing up secondary CPUs ...
[    0.014133] smp: Brought up 1 node, 2 CPUs
[    0.019164] devtmpfs: initialized
[    0.026042] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
[    0.026181] futex hash table entries: 512 (order: 3, 32768 bytes, linear)
[    0.120456] NET: Registered protocol family 16
[    0.121936] thermal_sys: Registered thermal governor 'step_wise'
[    0.145885] HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages
[    0.145992] HugeTLB registered 2.00 MiB page size, pre-allocated 0 pages
[    0.160724] starfive_gpio 11910000.gpio: StarFive GPIO chip registered 0 GPIOs
[    0.162516] SCSI subsystem initialized
[    0.162949] usbcore: registered new interface driver usbfs
[    0.163163] usbcore: registered new interface driver hub
[    0.163299] usbcore: registered new device driver usb
[    0.164022] i2c_designware 118b0000.i2c: running with gpio recovery mode! scl,sda
[    0.165562] i2c_designware 118c0000.i2c: running with gpio recovery mode! scl,sda
[    0.166498] i2c_designware 12450000.i2c: running with gpio recovery mode! scl,sda
[    0.168883] clocksource: Switched to clocksource riscv_clocksource
[    0.193078] NET: Registered protocol family 2
[    0.194869] tcp_listen_portaddr_hash hash table entries: 4096 (order: 5, 163840 bytes, linear)
[    0.195813] TCP established hash table entries: 65536 (order: 7, 524288 bytes, linear)
[    0.198811] TCP bind hash table entries: 65536 (order: 9, 2097152 bytes, linear)
[    0.209167] TCP: Hash tables configured (established 65536 bind 65536)
[    0.210151] UDP hash table entries: 4096 (order: 6, 393216 bytes, linear)
[    0.212137] UDP-Lite hash table entries: 4096 (order: 6, 393216 bytes, linear)
[    0.214490] NET: Registered protocol family 1
[    0.216331] RPC: Registered named UNIX socket transport module.
[    0.216423] RPC: Registered udp transport module.
[    0.216454] RPC: Registered tcp transport module.
[    0.216482] RPC: Registered tcp NFSv4.1 backchannel transport module.
[    0.219211] Initialise system trusted keyrings
[    0.219871] workingset: timestamp_bits=38 max_order=21 bucket_order=0
[    0.231443] zbud: loaded
[    0.237337] NFS: Registering the id_resolver key type
[    0.237457] Key type id_resolver registered
[    0.237500] Key type id_legacy registered
[    0.237768] nfs4filelayout_init: NFSv4 File Layout Driver Registering...
[    0.237842] nfs4flexfilelayout_init: NFSv4 Flexfile Layout Driver Registering...
[    0.237951] ntfs: driver 2.1.32 [Flags: R/W].
[    0.239046] fuse: init (API version 7.33)
[    0.280550] Key type asymmetric registered
[    0.280652] Asymmetric key parser 'x509' registered
[    0.280802] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 251)
[    0.280944] io scheduler mq-deadline registered
[    0.280992] io scheduler kyber registered
[    0.281487] io scheduler bfq registered
[    0.286536] starfive,vpp-lcdc 12000000.sfivefb: starfive_fb_probe
[    0.286732] starfive,vpp-lcdc 12000000.sfivefb: no dev name matched
[    0.286848] starfive,vpp-lcdc: probe of 12000000.sfivefb failed with error -22
[    0.288481] dw_axi_dmac_platform 100b0000.dma-controller: DesignWare AXI DMA Controller, 4 channels
[    0.290926] dw_axi_dmac_platform 10500000.dma-controller: DesignWare AXI DMA Controller, 16 channels
[    0.291684] L2CACHE: No. of Banks in the cache: 2
[    0.291750] L2CACHE: No. of ways per bank: 16
[    0.291780] L2CACHE: Sets per bank: 1024
[    0.291807] L2CACHE: Bytes per cache block: 64
[    0.291838] L2CACHE: Index of the largest way enabled: 0
[    0.444292] Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled
[    0.447569] dw-apb-uart 11870000.serial: detected caps 00000000 should be 00000100
[    0.447670] 11870000.serial: ttyS1 at MMIO 0x11870000 (irq = 6, base_baud = 4640625) is a 16550
[    0.450204] 12440000.serial: ttyS0 at MMIO 0x12440000 (irq = 7, base_baud = 6250000) is a 16550A
[    1.273540] printk: console [ttyS0] enabled
[    1.281244] vic-rng 118d0000.trng: Initialized
[    1.312944] loop: module loaded
[    1.317198] tps65086 0-005e: Device: TPS650861, OTP: A, Rev: 0
[    1.324996] dw_spi_mmio 12410000.spi: DMA init failed
[    1.333183] libphy: Fixed MDIO Bus: probed
[    1.339073] stmmaceth 10020000.ethernet: IRQ eth_lpi not found
[    1.345122] stmmaceth 10020000.ethernet: force_sf_dma_mode is ignored if force_thresh_dma_mode is set.
[    1.354621] stmmaceth 10020000.ethernet: PTP uses main clock
[    1.361045] stmmaceth 10020000.ethernet: User ID: 0x59, Synopsys ID: 0x37
[    1.367889] stmmaceth 10020000.ethernet:     DWMAC1000
[    1.372868] stmmaceth 10020000.ethernet: DMA HW capability register supported
[    1.380075] stmmaceth 10020000.ethernet: RX Checksum Offload Engine supported
[    1.387222] stmmaceth 10020000.ethernet: COE Type 2
[    1.392141] stmmaceth 10020000.ethernet: Wake-Up On Lan supported
[    1.398243] stmmaceth 10020000.ethernet: Enhanced/Alternate descriptors
[    1.404887] stmmaceth 10020000.ethernet: Enabled extended descriptors
[    1.411381] stmmaceth 10020000.ethernet: Chain mode enabled
[    1.416972] stmmaceth 10020000.ethernet: Enable RX Mitigation via HW Watchdog Timer
[    1.424685] stmmaceth 10020000.ethernet: device MAC address 2c:f7:f1:1b:e3:b1
[    1.444018] libphy: stmmac: probed
[    1.447501] Micrel KSZ9031 Gigabit PHY stmmac-0:00: attached PHY driver (mii_bus:phy_addr=stmmac-0:00, irq=POLL)
[    1.457830] Micrel KSZ9031 Gigabit PHY stmmac-0:07: attached PHY driver (mii_bus:phy_addr=stmmac-0:07, irq=POLL)
[    1.472153] xhci-hcd xhci-hcd.3.auto: xHCI Host Controller
[    1.477750] xhci-hcd xhci-hcd.3.auto: new USB bus registered, assigned bus number 1
[    1.486748] xhci-hcd xhci-hcd.3.auto: hcc params 0x200073c8 hci version 0x100 quirks 0x0000002000018010
[    1.496380] xhci-hcd xhci-hcd.3.auto: irq 12, io mem 0x104d0000
[    1.504398] hub 1-0:1.0: USB hub found
[    1.508318] hub 1-0:1.0: 1 port detected
[    1.513126] xhci-hcd xhci-hcd.3.auto: xHCI Host Controller
[    1.518690] xhci-hcd xhci-hcd.3.auto: new USB bus registered, assigned bus number 2
[    1.526522] xhci-hcd xhci-hcd.3.auto: Host supports USB 3.0 SuperSpeed
[    1.533406] usb usb2: We don't know the algorithms for LPM for this host, disabling LPM.
[    1.543049] hub 2-0:1.0: USB hub found
[    1.546966] hub 2-0:1.0: 1 port detected
[    1.552468] usbcore: registered new interface driver uas
[    1.558033] usbcore: registered new interface driver usb-storage
[    1.564805] sfc_temp_sensor 124a0000.tmpsensor: probe
[    1.580195] temp(c): 32.54
[    1.583659] sdhci: Secure Digital Host Controller Interface driver
[    1.589947] sdhci: Copyright(c) Pierre Ossman
[    1.594478] Synopsys Designware Multimedia Card Interface Driver
[    1.600854] sdhci-pltfm: SDHCI platform and OF driver helper
[    1.601603] dw_mmc 10000000.mmc: IDMAC supports 32-bit address mode.
[    1.607109] usbcore: registered new interface driver usbhid
[    1.613483] dw_mmc 10010000.mmc: IDMAC supports 32-bit address mode.
[    1.618612] usbhid: USB HID core driver
[    1.625066] dw_mmc 10000000.mmc: Using internal DMA controller.
[    1.634929] dw_mmc 10000000.mmc: Version ID is 290a
[    1.638075] NET: Registered protocol family 17
[    1.639969] dw_mmc 10010000.mmc: Using internal DMA controller.
[    1.644952] 9pnet: Installing 9P2000 support
[    1.650444] dw_mmc 10010000.mmc: Version ID is 290a
[    1.654906] Key type dns_resolver registered
[    1.659794] dw_mmc 10000000.mmc: DW MMC controller at irq 29,32 bit host data width,32 deep fifo
[    1.664750] registered taskstats version 1
[    1.673021] dw_mmc 10010000.mmc: DW MMC controller at irq 30,32 bit host data width,32 deep fifo
[    1.677144] Loading compiled-in X.509 certificates
[    1.691092] mmc_host mmc0: card is polling.
[    1.696077] mmc_host mmc1: card is non-removable.
[    1.704952] zswap: loaded using pool zstd/zbud
[    1.888938] usb 1-1: new high-speed USB device number 2 using xhci-hcd
[    1.908969] mmc_host mmc0: Bus speed (slot 0) = 100000000Hz (slot req 400000Hz, actual 400000HZ div = 125)
[    1.918943] mmc_host mmc1: Bus speed (slot 0) = 100000000Hz (slot req 400000Hz, actual 400000HZ div = 125)
[    2.082472] hub 1-1:1.0: USB hub found
[    2.086938] hub 1-1:1.0: 4 ports detected
[    2.139385] Waiting for root device PARTUUID=2c4185b7-8b17-4531-adde-10e22373a1d5...
[    2.161116] mmc1: queuing unknown CIS tuple 0x80 (2 bytes)
[    2.168025] mmc1: queuing unknown CIS tuple 0x80 (3 bytes)
[    2.175012] mmc1: queuing unknown CIS tuple 0x80 (3 bytes)
[    2.183035] mmc1: queuing unknown CIS tuple 0x80 (7 bytes)
[    2.191541] mmc1: queuing unknown CIS tuple 0x81 (9 bytes)
[    2.202490] random: fast init done
[    2.206037] mmc_host mmc0: Bus speed (slot 0) = 100000000Hz (slot req 10000000Hz, actual 10000000HZ div = 5)
[    2.216135] mmc0: new high speed SDHC card at address aaaa
[    2.219421] usb 2-1: new SuperSpeed Gen 1 USB device number 2 using xhci-hcd
[    2.225366] mmcblk0: mmc0:aaaa SS16G 14.8 GiB
[    2.257091] hub 2-1:1.0: USB hub found
[    2.261941] hub 2-1:1.0: 4 ports detected
[    2.268312]  mmcblk0: p1 p2 p3
[    2.317088] mmc_host mmc1: Bus speed (slot 0) = 100000000Hz (slot req 10000000Hz, actual 10000000HZ div = 5)
[    2.335234] mmc1: new high speed SDIO card at address 0001
[    2.336741] EXT4-fs (mmcblk0p3): mounted filesystem with ordered data mode. Opts: (null). Quota mode: disabled.
[    2.351205] VFS: Mounted root (ext4 filesystem) on device 179:3.
[    2.393264] devtmpfs: mounted
[    2.405931] Freeing unused kernel memory: 2160K
[    2.410681] Run /sbin/init as init process
[   16.734425] Oops [#1]
[   16.736746] Modules linked in: cfg80211 rfkill brcmutil tps65086_regulator
[   16.743692] CPU: 1 PID: 78 Comm: kworker/1:2 Not tainted 5.12.2-riscv64-x1 #1
[   16.750876] Hardware name: BeagleV Starlight Beta (DT)
[   16.756058] Workqueue: events request_firmware_work_func
[   16.761448] epc : regdb_fw_cb+0x0/0x108 [cfg80211]
[   16.768028]  ra : request_firmware_work_func+0x3e/0x82
[   16.773234] epc : ffffffdf8117ca2a ra : ffffffe00041c454 sp : ffffffe0812d7da0
[   16.780497]  gp : ffffffe0010de670 tp : ffffffe080050000 t0 : ffffffe08028d598
[   16.787797]  t1 : 0000000000000000 t2 : ef02e302c302ac01 s0 : ffffffe0812d7dd0
[   16.795060]  s1 : ffffffe07ff68840 a0 : ffffffe07ff5f940 a1 : ffffffe07ff65458
[   16.802323]  a2 : 0000000000000000 a3 : b835966eb1ae4a00 a4 : b835966eb1ae4a00
[   16.809599]  a5 : ffffffdf8117ca2a a6 : ffffffe07ffa5000 a7 : 0000000000000002
[   16.816863]  s2 : ffffffe0010df0a8 s3 : 0000000000000000 s4 : ffffffe1fecda640
[   16.824128]  s5 : ffffffe1fecde400 s6 : 0000000000000000 s7 : ffffffe07ff68848
[   16.831393]  s8 : ffffffe0000212be s9 : ffffffe0010df2e0 s10: ffffffe000028510
[   16.838654]  s11: 0000000000000000 t3 : ffffffe000a1a950 t4 : ffffffe0011122f8
[   16.845921]  t5 : 000105030000bf02 t6 : ffffffd00400df64
[   16.851271] status: 0000000200000120 badaddr: 0000005f8117ca2a cause: 000000000000000c
[   16.859240] Call Trace:
[   16.861724] [<ffffffdf8117ca2a>] regdb_fw_cb+0x0/0x108 [cfg80211]
[   16.869590] [<ffffffe0000210c8>] process_one_work+0x13c/0x27c
[   16.875374] [<ffffffe000021284>] worker_thread+0x7c/0x320
[   16.880807] [<ffffffe00002734a>] kthread+0x104/0x116
[   16.885821] [<ffffffe0000032e6>] ret_from_exception+0x0/0xc
[   16.891557] ---[ end trace e05b15e7236c613c ]---
[   20.724592] urandom_read: 3 callbacks suppressed
[   20.724626] random: ln: uninitialized urandom read (6 bytes read)
[   22.210161] random: mktemp: uninitialized urandom read (10 bytes read)
[   25.501540] zram: Added device: zram0
[   25.506621] zram: Added device: zram1
[   25.511837] zram: Added device: zram2
[   25.694313] zram0: detected capacity change from 0 to 1048576
[   26.838945] Adding 524284k swap on /dev/zram0.  Priority:16383 extents:1 across:524284k SSDscFS
[   27.019323] zram1: detected capacity change from 0 to 4194304
[   29.398124] EXT4-fs (zram1): mounted filesystem without journal. Opts: discard. Quota mode: disabled.
[   29.585589] zram2: detected capacity change from 0 to 4194304
[   31.763205] EXT4-fs (zram2): mounting ext2 file system using the ext4 subsystem
[   31.773983] EXT4-fs (zram2): mounted filesystem without journal. Opts: (null). Quota mode: disabled.
[   32.524011] random: dd: uninitialized urandom read (512 bytes read)
[   61.788896] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 44s!
oaken-source commented 3 years ago

I have the wifi module blacklisted:

$> cat /etc/modprobe.d/blacklist.conf 
blacklist bcmdhd
blacklist brcmfmac
blacklist brcmutil
blacklist cfg80211

I haven't checked whether it is built-in in the kernel config I use though.

pdp7 commented 3 years ago

@esmil @sarnold @oaken-source have you ever observed this rcu_sched error in the 5.10 kernel? @tekkamanninja and I are trying to narrow down what might be the cause

oaken-source commented 3 years ago

Yes, 5.10 showed the same behaviour.

oaken-source commented 3 years ago

I think the cause is probably I/O related, it seems to only happen when the system writes a lot of data to the SD card, but I need to verify that still

oaken-source commented 3 years ago

I've been compiling gcc for the last 48 hours, but as soon as that's finished, I have a 5.12.4 kernel built based on https://github.com/esmil/linux/commits/starlight-5.12.y that I will try

pdp7 commented 3 years ago

@oaken-source any update on the testing? thanks!

oaken-source commented 3 years ago

@pdp7 unfortunately, I haven't managed to reliably trigger the board to lock up. It seems to only happen when I'm not looking. But it's still locking up on 5.12.6. I've posted a few more details on the issue you linked above

oaken-source commented 3 years ago

it's been pointed out to me that I'm using the u-boot device trees instead of the ones built with the linux kernel. I'll test if using the correct dtb's fixes the issue and report back.

oaken-source commented 3 years ago

with the updated dtb's, the system still locks up, but now it does so quietly, without any obvious related kernel output.

oaken-source commented 3 years ago

I experienced two more lockups after 27 and 22 hours of runtime. the first one was silent, the second one produced

[84359.606936] dw_mmc 10000000.mmc: Unexpected interrupt latency

shortly before locking up.

oaken-source commented 3 years ago

here is another full output: https://pastebin.com/6b2BgmFd

pdp7 commented 3 years ago

This stability is likely related to the SD card issue: https://github.com/starfive-tech/linux/issues/15

pdp7 commented 3 years ago

@oaken-source did you get a new SD card yet?

We thinks this might be resolved with this new fw_paylouad.bin.out https://forum.beagleboard.org/t/new-firmware-boot-hang-is-fixed/29986/3

oaken-source commented 3 years ago

I've been using that new firmware/bootloader for a couple days now. No issues so far, uptime is at 8 days.

So I think it looks like the issue is indeed resolved with this new image.

pdp7 commented 3 years ago

great to hear