Closed shenki closed 8 years ago
14:08 < NormJ> that issue only occurs on a EVT barreleye. doesn't happen ever
on DVT (newer) systems
14:08 < miltonm> NormJ one EVT or all ?
14:09 < NormJ> all the ones doug tried, but not all. i have a an EVT barreleye
that doesn't have issue.
Unable to handle kernel paging request at virtual address 01b50325
pgd = ce4a4000
[01b50325] *pgd=00000000
Internal error: Oops: 1 [#1] ARM
Modules linked in:
CPU: 0 PID: 1103 Comm: umount Not tainted 4.3.5-openbmc-20160212-1 #1
Hardware name: ASpeed SoC
task: cfa3a180 ti: cfb3e000 task.ti: cfb3e000
PC is at kfree+0x3c/0xa8
LR is at squashfs_cache_delete+0x60/0x9c
pc : [<c008815c>] lr : [<c01236e4>] psr: 00000093
sp : cfb3ff18 ip : 00000010 fp : beeb0ebc
r10: 00000000 r9 : cfb3e000 r8 : 00000038
r7 : 00000004 r6 : a0000013 r5 : cf8db000 r4 : cfadad40
r3 : 01b50309 r2 : ffffffff r1 : cfdf9000 r0 : cf8db000
Flags: nzcv IRQs off FIQs on Mode SVC_32 ISA ARM Segment none
Control: 0005317f Table: 4e4a4000 DAC: 00000051
Process umount (pid: 1103, stack limit = 0xcfb3e190)
Stack: (0xcfb3ff18 to 0xcfb40000)
ff00: cdcafa00 cfadad40
ff20: 00000000 00000000 00000004 c01236e4 cdcaf900 cdcf4800 c052e62c 00000034
ff40: c000a424 c0125dc8 c0125d9c cdcf4800 c037d180 c008e5dc 00000000 cf403b80
ff60: 00000081 c008e810 cdcf4800 c050c548 c052e62c c008e910 ffff0001 cdcb0c20
ff80: cfae387c c00a42ac cfa3a180 c002a97c c000a424 cfb3e000 cfb3ffb0 c000cba8
ffa0: 000bc1a0 000bc290 000bc1c0 c000a2d0 00000000 00000000 00000000 000bc1a0
ffc0: 000bc1a0 000bc290 000bc1c0 00000034 00000000 000bc2c0 00000000 beeb0ebc
ffe0: 000aa3a8 beeb0cc4 0004ebc4 4f77407c 60000010 000bc1c0 cfbe4a40 00000000
[<c008815c>] (kfree) from [<c01236e4>] (squashfs_cache_delete+0x60/0x9c)
[<c01236e4>] (squashfs_cache_delete) from [<c0125dc8>] (squashfs_put_super+0x2c/0x70)
[<c0125dc8>] (squashfs_put_super) from [<c008e5dc>] (generic_shutdown_super+0x6c/0xc0)
[<c008e5dc>] (generic_shutdown_super) from [<c008e810>] (kill_block_super+0x18/0x68)
[<c008e810>] (kill_block_super) from [<c008e910>] (deactivate_locked_super+0x44/0x74)
[<c008e910>] (deactivate_locked_super) from [<c00a42ac>] (cleanup_mnt+0x4c/0x70)
[<c00a42ac>] (cleanup_mnt) from [<c002a97c>] (task_work_run+0x6c/0x80)
[<c002a97c>] (task_work_run) from [<c000cba8>] (do_work_pending+0xa4/0xc0)
[<c000cba8>] (do_work_pending) from [<c000a2d0>] (slow_work_pending+0xc/0x20)
Code: e0813282 e7912282 e3120902 1593301c (e593701c)
---[ end trace 293ad3070fe69835 ]---
This happens on occasionally on my EVT system with latest kernel including jffs2. Looks similar to fail #4 above:
root@barreleye:~# reboot
[ OK ] Stopped target System Time Synchronized.
[ OK ] Stopped Dispatch Password Requests to Console Directory Watch.
[ OK ] Stopped target Swap.
Stopping SSH Per-Connection Server (9.3.62.94:40890)...
[ OK ] Stopped Create Static Device Nodes in /dev.
[ OK ] Stopped target Network.
[ OK ] Stopped Apply Kernel Variables.
Stopping Load/Save Random Seed...
Stopping Network Time Synchronization...
Stopping SSH Per-Connection Server (9.123.229.28:49692)...
[ OK ] Stopped target Local File Systems.
Unmounting /var/volatile...
BUG: Bad page map in process systemd-timesyn pte:4da32100 pmd:4fb2b831
addr:b6e00000 vm_flags:00000075 anon_vma: (null) mapping:cf48b874 index:0
file:libpthread-2.22.so fault:filemap_fault mmap:generic_file_readonly_mmap readpage:squashfs_readpage
CPU: 0 PID: 850 Comm: systemd-timesyn Not tainted 4.3.6-openbmc-20160222-1 #1
Hardware name: ASpeed SoC
[<c000f28c>] (unwind_backtrace) from [<c000cf0c>] (show_stack+0x10/0x14)
[<c000cf0c>] (show_stack) from [<c00787c8>] (print_bad_pte+0x158/0x190)
[<c00787c8>] (print_bad_pte) from [<c0079cec>] (unmap_single_vma+0x4ec/0x500)
[<c0079cec>] (unmap_single_vma) from [<c007a438>] (unmap_vmas+0x44/0x54)
[<c007a438>] (unmap_vmas) from [<c007e97c>] (exit_mmap+0xc8/0x1dc)
[<c007e97c>] (exit_mmap) from [<c0014770>] (mmput+0x38/0xb8)
[<c0014770>] (mmput) from [<c0017380>] (do_exit+0x2f0/0x7b0)
[<c0017380>] (do_exit) from [<c0018654>] (do_group_exit+0x4c/0xb8)
[<c0018654>] (do_group_exit) from [<c00186d0>] (__wake_up_parent+0x0/0x18)
Disabling lock debugging due to kernel taint
Unmounting /run/initramfs/rw...
Stopping SSH Per-Connection Server (9.109.165.198:54558)...
[ OK ] Removed slice system-getty.slice.
[ OK ] Stopped target Remote File Systems.
[ OK ] Stopped target Multi-User System.
Stopping Phosphor OpenBMC BT to DBUS...
Stopping D-Bus System Message Bus...
Stopping Temp placeholder for skeleton function...
Stopping DBUS introspecting REST server....
Stopping Phosphor OpenBMC event management daemon...
Stopping System Logging Service...
[ OK ] Stopped target Login Prompts.
Stopping Serial Getty on ttyS4...
Stopping Phosphor OpenBMC user management daemon...
Stopping Login Service...
Stopping Phosphor OpenBMC DBus REST daemon...
[ OK ] Stopped Forward Password Requests to Wall Directory Watch.
Stopping SSH Per-Connection Server (9.109.165.198:54737)...
Unmounting /run/initramfs/ro...
Stopping Kernel Logging Service...
[ OK ] Stopped Login Service.
[ OK ] Stopped DBUS introspecting REST server..
[ OK ] Stopped Phosphor OpenBMC BT to DBUS.
[ OK ] Stopped Phosphor OpenBMC event management daemon.
[ OK ] Stopped Kernel Logging Service.
BUG: Bad rss-counter state mm:ce491640 idx:0 val:1
BUG: Bad rss-counter state mm:ce491640 idx:2 val:-1
[ OK ] Stopped D-Bus System Message Bus.
[ OK ] Stopped System Logging Service.
[ OK ] Stopped Network Time Synchronization.
[ OK ] Stopped Phosphor OpenBMC user management daemon.
[ OK ] Stopped Serial Getty on ttyS4.
[ OK ] Stopped SSH Per-Connection Server (9.3.62.94:40890).
[ OK ] Stopped SSH Per-Connection Server (9.123.229.28:49692).
[ OK ] Stopped Temp placeholder for skeleton function.
[ OK ] Stopped SSH Per-Connection Server (9.109.165.198:54558).
[ OK ] Stopped SSH Per-Connection Server (9.109.165.198:54737).
[ OK ] Stopped Load/Save Random Seed.
[ OK ] Unmounted /var/volatile.
[ OK ] Unmounted /run/initramfs/rw.
[ OK ] Unmounted /run/initramfs/ro.
[ OK ] Stopped Phosphor OpenBMC DBus REST daemon.
[ OK ] Stopped Phosphor OpenBMC DBus service management daemon.
[ OK ] Reached target Unmount All Filesystems.
[ OK ] Removed slice system-dropbear.slice.
[ OK ] Closed dropbear.socket.
[ OK ] Removed slice system-serial\x2dgetty.slice.
[ OK ] Closed Syslog Socket.
[ OK ] Removed slice User and Session Slice.
[ OK ] Reached target Shutdown.
systemd-shutdown[1]: Sending SIGTERM to remaining processes...
systemd-journald[520]: Received SIGTERM from PID 1 (systemd-shutdow).
systemd-shutdown[1]: Sending SIGKILL to remaining processes...
systemd-shutdown[1]: Sending SIGKILL to PID 1321 (sh).
systemd-shutdown[1]: Hardware watchdog 'aspeed_wdt', version 0
systemd-shutdown[1]: Unmounting file systems.
systemd-shutdown[1]: Unmounting /tmp.
systemd-shutdown[1]: All filesystems unmounted.
systemd-shutdown[1]: Deactivating swaps.
systemd-shutdown[1]: All swaps deactivated.
systemd-shutdown[1]: Detaching loop devices.
systemd-shutdown[1]: All loop devices detached.
systemd-shutdown[1]: Detaching DM devices.
systemd-shutdown[1]: All DM devices detached.
systemd-shutdown[1]: Successfully changed into root pivot.
systemd-shutdown[1]: Returning to initrd...
shutdown: reboot --log-level 6 --log-target kmsg --log-color
+ awk /oldroot|mnt/ { print $2 }
+ sort -r
+ umount /oldroot/sys/kernel/debug
+ umount /oldroot/sys/kernel/config
+ umount /oldroot/sys/fs/cgroup/systemd
+ umount /oldroot/sys/fs/cgroup
+ umount /oldroot/sys
+ umount /oldroot/proc
+ umount /oldroot/dev/shm
+ umount /oldroot/dev/pts
+ umount /oldroot/dev
+ umount /oldroot
------------[ cut here ]------------
Kernel BUG at c005ec1c [verbose debug info unavailable]
Internal error: Oops - BUG: 0 [#1] ARM
Modules linked in:
CPU: 0 PID: 2097 Comm: umount Tainted: G B 4.3.6-openbmc-20160222-1 #1
Hardware name: ASpeed SoC
task: ce88a3c0 ti: cde50000 task.ti: cde50000
PC is at __delete_from_page_cache+0x204/0x2a8
LR is at __delete_from_page_cache+0x11c/0x2a8
pc : [<c005ec1c>] lr : [<c005eb34>] psr: 60000093
sp : cde51df0 ip : c0527400 fp : 00000000
r10: 00000000 r9 : cf48b878 r8 : 00000003
r7 : 00000000 r6 : 00000000 r5 : cf48b874 r4 : cffad640
r3 : 00000000 r2 : 000002bc r1 : c053c334 r0 : 0000592f
Flags: nZCv IRQs off FIQs on Mode SVC_32 ISA ARM Segment none
Control: 0005317f Table: 4de9c000 DAC: 00000051
Process umount (pid: 2097, stack limit = 0xcde50190)
Stack: (0xcde51df0 to 0xcde52000)
1de0: 00000000 00000000 cf48a9b0 cf48a9c8
1e00: cde51ea8 cffad640 00000000 00000013 cf48b874 00000000 00000000 00000000
1e20: cffad640 c005ecfc cffad640 cf48b874 00000000 c0069430 00001000 00000000
1e40: 00000000 00000fff 00000000 00000000 ffffffff 00000000 cf48b874 c00695d4
1e60: cde51e68 00000000 00000000 00000001 00000002 00000003 00000004 00000005
1e80: 00000006 00000007 00000008 00000009 0000000a 0000000b 0000000c 0000000d
1ea0: 0000000e 00000000 cffad640 cffad6a0 cffad6c0 cffad6e0 cffad700 cffad720
1ec0: cffad740 cffad760 cffad780 cffad7a0 cffad7c0 cffad7e0 cffad800 cffad820
1ee0: c000a424 ffffffff ffffffff cde50000 cdcb8800 c000a424 cde50000 00000000
1f00: bef19ebc c0069998 ffffffff ffffffff 00000000 cf48b7b8 c037d180 c00a1060
1f20: cde50000 cde51f38 cec760b8 c00a1128 cdcb89fc c00a1d64 cf48b6dc cec7613c
1f40: 00000000 cdcb8800 c037d180 c053062c 00000034 c008e5d0 00000000 cf4039e0
1f60: 00000081 c008e830 cdcb8800 c050e548 c053062c c008e930 ffff0001 cdc8e9e0
1f80: cdc8463c c00a42d4 ce88a3c0 c002a97c c000a424 cde50000 cde51fb0 c000cba8
1fa0: 000bc1a0 000bc290 000bc1c0 c000a2d0 00000000 00000000 00000000 000bc1a0
1fc0: 000bc1a0 000bc290 000bc1c0 00000034 00000000 000bc2c0 00000000 bef19ebc
1fe0: 000aa3a8 bef19cc4 0004ebc4 4e0b407c 60000010 000bc1c0 00000000 00000000
[<c005ec1c>] (__delete_from_page_cache) from [<c005ecfc>] (delete_from_page_cache+0x3c/0x5c)
[<c005ecfc>] (delete_from_page_cache) from [<c0069430>] (truncate_inode_page+0x98/0xa4)
[<c0069430>] (truncate_inode_page) from [<c00695d4>] (truncate_inode_pages_range+0x168/0x518)
[<c00695d4>] (truncate_inode_pages_range) from [<c0069998>] (truncate_inode_pages+0x14/0x1c)
[<c0069998>] (truncate_inode_pages) from [<c00a1060>] (evict+0x90/0x128)
[<c00a1060>] (evict) from [<c00a1128>] (dispose_list+0x30/0x3c)
[<c00a1128>] (dispose_list) from [<c00a1d64>] (evict_inodes+0xb4/0xbc)
[<c00a1d64>] (evict_inodes) from [<c008e5d0>] (generic_shutdown_super+0x40/0xc0)
[<c008e5d0>] (generic_shutdown_super) from [<c008e830>] (kill_block_super+0x18/0x68)
[<c008e830>] (kill_block_super) from [<c008e930>] (deactivate_locked_super+0x44/0x74)
[<c008e930>] (deactivate_locked_super) from [<c00a42d4>] (cleanup_mnt+0x4c/0x70)
[<c00a42d4>] (cleanup_mnt) from [<c002a97c>] (task_work_run+0x6c/0x80)
[<c002a97c>] (task_work_run) from [<c000cba8>] (do_work_pending+0xa4/0xc0)
[<c000cba8>] (do_work_pending) from [<c000a2d0>] (slow_work_pending+0xc/0x20)
Code: e121f002 e594300c e3530000 ba000000 (e7f001f2)
---[ end trace c053299951a20b73 ]---
Segmentation fault
+ umount /mnt
+ set +x
update: reboot --log-level 6 --log-target kmsg --log-color
Updating bmc...
Erasing block: 8192/8192 (100%)
Writing kb: 32768/32768 (100%)
Verifying kb: 32768/32768 (100%)
Remaining mounts:
tmpfs / tmpfs rw,nosuid,nodev,mode=755 0 0
dev /dev devtmpfs rw,relatime,size=126380k,nr_inodes=31595,mode=755 0 0
proc /proc proc rw,relatime 0 0
sys /sys sysfs rw,relatime 0 0
tmpfs /run tmpfs rw,nosuid,nodev,mode=755 0 0
I think we found some important observations:
It's mos likely caused by race condition between uboot and kernel like below:
The serial port can't be accessed this moment. I'll figure out a command to stop the activity on MAC before running "bootcmd" command from uboot side after serial port becomes available again.
I added some code to dump the corrupted slab cache and got below dump for "kernfs_node_cache". Obviously, the data here is exactly a ARP request whose source IP address is 9.3.40.155
Corrupted CPU list detected on [kernfs_node_cache]
ff ff ff ff ff ff 5c f3 fc 5f 3f 60 08 06 00 01
08 00 06 04 00 01 5c f3 fc 5f 3f 60 09 03 28 9b
00 00 00 00 00 00 09 03 28 96 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 c1 98 e5 ab
c8 05 44 c0 24 71 80 cf 84 7f 80 cf 01 00 00 00
50 00 00 00 08 00 00 00 52 12 00 00 b9 12 00 00
52 12 00 00 7c 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ea 12 00 00 78 01 00 00 12 02 00 00 06 00 00 00
08 00 00 00 40 1e 80 cf
gwshan@gwshan:~$ ping 9.3.40.155
PING 9.3.40.155 (9.3.40.155) 56(84) bytes of data.
64 bytes from 9.3.40.155: icmp_seq=1 ttl=50 time=192 ms
64 bytes from 9.3.40.155: icmp_seq=2 ttl=50 time=193 ms
Writing 0x0 to MACCR (offset: 0x50) should stop the MAC from receiving any frames. It's just a workaround. The final solution is to change uboot to do that before loading and booting kernel image. I'm not sure who is responsible for that? If nobody is in charge of that, I can do that by providing more info:
(A) The source code of uboot (B) Command to flush uboot image
(A) We have our own fork of uboot: https://github.com/openbmc/u-boot/tree/v2013.07-aspeed-openbmc
So can we use the "workaround" until the final fix? If so, can you do a pull request for that?
These have been committed to our branch at https://github.com/openbmc/u-boot/tree/v2013.07-aspeed-openbmc as of https://github.com/openbmc/u-boot/commit/fecb84ab7f3bddc4039b5de31f57440e1cfff650
Reported by Doug: