openbmc / linux

OpenBMC Linux kernel source tree
Other
49 stars 132 forks source link

Kernel OOPS at boot and shutdown #53

Closed shenki closed 8 years ago

shenki commented 8 years ago

Reported by Doug:

There are some problems getting to boot successfully, and I copied some console data to the attached file. 2 out of 6 boot attempts made it to the login prompt, and power was cycled between each attempt.

[  OK  ] Started Create Volatile Files and Directories.
[  OK  ] Started udev Kernel Device Manager.
         Starting Network Time Synchronization...
         Starting Update UTMP about System Boot/Shutdown...
Unable to handle kernel paging request at virtual address ffffffff
pgd = cfb20000
[ffffffff] *pgd=4fffd871, *pte=00000000, *ppte=00000000
Internal error: Oops: 1 [#1] ARM
Modules linked in:
CPU: 0 PID: 512 Comm: kworker/0:2 Not tainted 4.3.5-openbmc-20160212-1 #1
Hardware name: ASpeed SoC
Workqueue: events cache_reap
task: cfa3ac00 ti: ce444000 task.ti: ce444000
PC is at drain_array+0x18/0xec
LR is at cache_reap+0x54/0x114
pc : [<c0087d8c>]    lr : [<c0088060>]    psr: a0000013
sp : ce445ec8  ip : 00000000  fp : 00000008
r10: c05084d8  r9 : c050b7a0  r8 : c0509f00
r7 : ce445ed0  r6 : c050b51c  r5 : cf9c9400  r4 : ffffffff
r3 : 00000000  r2 : ffffffff  r1 : cf9c9400  r0 : cf856b20
[  OK  ] Started Create Static Device Nodes in /dev.
systemd-journald[522]: Received request to flush runtime journal from PID 1
[  OK  ] Reached target Local File Systems (Pre).
         Mounting /var/volatile...
         Starting udev Kernel Device Manager...
[  OK  ] Started Flush Journal to Persistent Storage.
[  OK  ] Mounted /var/volatile.
[  OK  ] Started udev Coldplug all Devices.
[  OK  ] Reached target Local File Systems.
Unable to handle kernel paging request at virtual address ffffffff
pgd = cdcd0000
[ffffffff] *pgd=4fffd871, *pte=00000000, *ppte=00000000
Internal error: Oops: 1 [#1] ARM
Modules linked in:
CPU: 0 PID: 1 Comm: systemd Not tainted 4.3.5-openbmc-20160212-1 #1
Hardware name: ASpeed SoC
task: cf810aa0 ti: cf826000 task.ti: cf826000
PC is at kfree+0x44/0xa8
LR is at remove_proc_subtree+0xc8/0xd4
pc : [<c0088164>]    lr : [<c00d4900>]    psr: 40000093
sp : cf827e90  ip : cf827e98  fp : be969770
r10: cdca12c8  r9 : cdc8e630  r8 : 00000000
r7 : cf800100  r6 : a0000013  r5 : ce45f580  r4 : ffffffff
r3 : cffc1be0  r2 : 00000080  r1 : cfdf9000  r0 : ce45f580
Flags: nZcv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 0005317f  Table: 4dcd0000  DAC: 00000051
Process systemd (pid: 1, stack limit = 0xcf826190)
From Shutdown:

+ umount /oldroot/dev/pts
+ umount /oldroot/dev
+ umount /oldroot
------------[ cut here ]------------
Kernel BUG at c005ec10 [verbose debug info unavailable]
Internal error: Oops - BUG: 0 [#1] ARM
Modules linked in:
CPU: 0 PID: 1095 Comm: umount Tainted: G    B           4.3.5-openbmc-20160212-1 #1
Hardware name: ASpeed SoC
task: cf994860 ti: cde32000 task.ti: cde32000
PC is at __delete_from_page_cache+0x204/0x2a8
LR is at __delete_from_page_cache+0x11c/0x2a8
pc : [<c005ec10>]    lr : [<c005eb28>]    psr: 60000093
sp : cde33df0  ip : c0525400  fp : 00000000
r10: 00000001  r9 : cf4852d8  r8 : 00000003
r7 : 00000000  r6 : 00000000  r5 : cf4852d4  r4 : cffadb20
r3 : 00000000  r2 : 000002bc  r1 : c053a334  r0 : 00003602
Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 0005317f  Table: 4de14000  DAC: 00000051
Process umount (pid: 1095, stack limit = 0xcde32190)
Stack: (0xcde33df0 to 0xcde34000)
3de0:                                     00000000 0000000a cf48cd40 cf48cd5c
3e00: cde33ea8 cffadb20 00000000 00000013 cf4852d4 00000001 00000000 00000001
3e20: cffadb20 c005ecf0 cffadb20 cf4852d4 00000001 c0069428 00001000 00000000
3e40: 00000000 cffadba0 00000000 00000000 ffffffff 00000000 cf4852d4 c00695cc
3e60: cde33e68 00000000 00000000 00000001 00000002 00000003 00000004 00000005
3e80: 00000006 00000007 00000008 00000009 0000000a 0000000b 0000000c 0000000d
3ea0: 0000000a 00000000 cffadba0 cffadb20 cffadbe0 cffadc00 cffadc20 cffadc40
3ec0: cffadc60 cffadc80 cffadca0 cffadcc0 cffad8c0 cffad8a0 cffad880 cffad860
3ee0: c000a424 ffffffff ffffffff cde32000 cdcf6800 c000a424 cde32000 00000000
3f00: bedaaecc c0069990 ffffffff ffffffff 00000000 cf485218 c037d180 c00a1038
3f20: cde32000 cde33f38 cf7b5a38 c00a1100 cdcf69fc c00a1d3c cf48513c cf7b5abc
3f40: 00000000 cdcf6800 c037d180 c052e62c 00000034 c008e5b0 00000000 cf403b80
3f60: 00000081 c008e810 cdcf6800 c050c548 c052e62c c008e910 ffff0001 cdcb0c20
3f80: cfae387c c00a42ac cf994860 c002a97c c000a424 cde32000 cde33fb0 c000cba8
3fa0: 000bc1a0 000bc290 000bc1c0 c000a2d0 00000000 00000000 00000000 000bc1a0
3fc0: 000bc1a0 000bc290 000bc1c0 00000034 00000000 000bc2c0 00000000 bedaaecc
3fe0: 000aa3a8 bedaacd4 0004ebc4 4f77407c 60000010 000bc1c0 00000000 00000000
[<c005ec10>] (__delete_from_page_cache) from [<c005ecf0>] (delete_from_page_cache+0x3c/0x5c)
[<c005ecf0>] (delete_from_page_cache) from [<c0069428>] (truncate_inode_page+0x98/0xa4)
[<c0069428>] (truncate_inode_page) from [<c00695cc>] (truncate_inode_pages_range+0x168/0x518)
[<c00695cc>] (truncate_inode_pages_range) from [<c0069990>] (truncate_inode_pages+0x14/0x1c)
[<c0069990>] (truncate_inode_pages) from [<c00a1038>] (evict+0x90/0x128)
[<c00a1038>] (evict) from [<c00a1100>] (dispose_list+0x30/0x3c)
[<c00a1100>] (dispose_list) from [<c00a1d3c>] (evict_inodes+0xb4/0xbc)
[<c00a1d3c>] (evict_inodes) from [<c008e5b0>] (generic_shutdown_super+0x40/0xc0)
[<c008e5b0>] (generic_shutdown_super) from [<c008e810>] (kill_block_super+0x18/0x68)
[<c008e810>] (kill_block_super) from [<c008e910>] (deactivate_locked_super+0x44/0x74)
[<c008e910>] (deactivate_locked_super) from [<c00a42ac>] (cleanup_mnt+0x4c/0x70)
[<c00a42ac>] (cleanup_mnt) from [<c002a97c>] (task_work_run+0x6c/0x80)
[<c002a97c>] (task_work_run) from [<c000cba8>] (do_work_pending+0xa4/0xc0)
[<c000cba8>] (do_work_pending) from [<c000a2d0>] (slow_work_pending+0xc/0x20)
Code: e121f002 e594300c e3530000 ba000000 (e7f001f2) 
---[ end trace 85608810d6a88310 ]---
Segmentation fault
[  OK  ] Reached target Local File Systems.
         Starting Create Volatile Files and Directories...
[  OK  ] Started Load/Save Random Seed.
[  OK  ] Started udev Kernel Device Manager.
[  OK  ] Started Create Volatile Files and Directories.
         Starting Update UTMP about System Boot/Shutdown...
         Starting Network Time Synchronization...
Unable to handle kernel paging request at virtual address ffffffff
pgd = ce4b0000
[ffffffff] *pgd=4fffd871, *pte=00000000, *ppte=00000000
Internal error: Oops: 1 [#1] ARM
Modules linked in:
CPU: 0 PID: 515 Comm: kworker/0:2 Not tainted 4.3.5-openbmc-20160212-1 #1
Hardware name: ASpeed SoC
Workqueue: events cache_reap
task: cf99f1e0 ti: ce462000 task.ti: ce462000
PC is at drain_array+0x18/0xec
LR is at cache_reap+0x54/0x114
pc : [<c0087d8c>]    lr : [<c0088060>]    psr: a0000013
sp : ce463ec8  ip : cffb2934  fp : 00000008
r10: c05084d8  r9 : c050b7a0  r8 : c0509f00
r7 : ce463ed0  r6 : c050b51c  r5 : cf801280  r4 : ffffffff
r3 : 00000000  r2 : ffffffff  r1 : cf801280  r0 : cf800340
Flags: NzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 0005317f  Table: 4e4b0000  DAC: 00000053
Process kworker/0:2 (pid: 515, stack limit = 0xce462190)
Stack: (0xce463ec8 to 0xce464000)
3ec0:                   cf800340 cf801280 ce463ed0 ce463ed0 cf800340 cf801280
3ee0: c050b51c 00000000 c0509f00 c0088060 00000000 00000000 c05084a4 cfa036a0
3f00: c050b7a0 c0508494 00000000 cfdebb00 00000000 c0027be8 cfa036a0 c050b7a0
3f20: cf9f1500 cfa036a0 c0508494 cfa036b8 ce462000 c05084a4 c0508494 c05084d8
3f40: 00000008 c0028280 00000000 ce43dfe0 00000000 cfa036a0 c0027fd0 00000000
3f60: 00000000 00000000 00000000 c002bf74 00000000 00000000 000001f6 cfa036a0
3f80: 00000000 ce463f84 ce463f84 00000000 ce463f90 ce463f90 ce463fac ce43dfe0
3fa0: c002beb0 00000000 00000000 c000a330 00000000 00000000 00000000 00000000
3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
3fe0: 00000000 00000000 00000000 00000000 00000013 00000000 6af75684 6868d53b
[<c0087d8c>] (drain_array) from [<c0088060>] (cache_reap+0x54/0x114)
[<c0088060>] (cache_reap) from [<c0027be8>] (process_one_work+0x1bc/0x2f8)
[<c0027be8>] (process_one_work) from [<c0028280>] (worker_thread+0x2b0/0x3ec)
[<c0028280>] (worker_thread) from [<c002bf74>] (kthread+0xc4/0xd8)
[<c002bf74>] (kthread) from [<c000a330>] (ret_from_fork+0x14/0x24)
Code: e28d7008 e58d7008 e58d700c 0a000032 (e5942000) 
---[ end trace 0a57a7db8754ec3b ]---
Unable to handle kernel paging request at virtual address fffffff0
pgd = c0004000
[fffffff0] *pgd=4fffd871, *pte=00000000, *ppte=00000000
Internal error: Oops: 37 [#2] ARM
Modules linked in:
CPU: 0 PID: 515 Comm: kworker/0:2 Tainted: G      D         4.3.5-openbmc-20160212-1 #1
Hardware name: ASpeed SoC
task: cf99f1e0 ti: ce462000 task.ti: ce462000
PC is at kthread_data+0x4/0xc
LR is at wq_worker_sleeping+0xc/0xb8
pc : [<c002c2f0>]    lr : [<c002854c>]    psr: 20000093
sp : ce463c48  ip : c0374b14  fp : ce463c74
r10: 00000001  r9 : cf99f380  r8 : cf99f404
r7 : 00000000  r6 : c0508da0  r5 : ce463adc  r4 : 00000000
r3 : 00000000  r2 : ffffffff  r1 : 00000000  r0 : cf99f1e0
Flags: nzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 0005317f  Table: 4e4b0000  DAC: 00000051
Process kworker/0:2 (pid: 515, stack limit = 0xce462190)
Freeing unused kernel memory: 176K (c04d6000 - c0502000)
rofs = mtd5 squashfs rwfs = mtd6 ext4
/dev/mtdblock6 was not cleanly unmounted, check forced.
/dev/mtdblock6: 124/1024 files (0.0% non-contiguous), 215/1024 blocks
EXT4-fs (mtdblock6): mounted filesystem without journal. Opts: (null)
systemd[1]: Failed to insert module 'ipv6': Function not implemented
systemd[1]: Failed to insert module 'kdbus': Function not implemented
random: systemd urandom read with 20 bits of entropy available
systemd[1]: systemd 225 running in system mode. (-PAM -AUDIT -SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP -LIBCRYPTSETUP -GCRYPT -GNUTLS +ACL -XZ -LZ4 -SECCOMP +BLKID -ELFUTILS +KMOD -IDN)
systemd[1]: Detected architecture arm.

Welcome to Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro) 0.1.0 (master)!

...............

[  OK  ] Reached target Local File Systems.
         Starting Create Volatile Files and Directories...
[  OK  ] Started Load/Save Random Seed.
Unable to handle kernel paging request at virtual address ffffffff
pgd = cdcd4000
[ffffffff] *pgd=4fffd871, *pte=00000000, *ppte=00000000
Internal error: Oops: 1 [#1] ARM
Modules linked in:
CPU: 0 PID: 1 Comm: systemd Not tainted 4.3.5-openbmc-20160212-1 #1
Hardware name: ASpeed SoC
task: cf810aa0 ti: cf826000 task.ti: cf826000
PC is at kmem_cache_free+0x54/0xb8
LR is at kernfs_put+0xf4/0x188
pc : [<c0087c5c>]    lr : [<c00dacc4>]    psr: a0000093
sp : cf827e90  ip : 00000000  fp : cf827f64
r10: 7f754f00  r9 : c0525b60  r8 : c0525b60
r7 : a0000013  r6 : ce47b088  r5 : ffffffff  r4 : cf800a00
r3 : a0000093  r2 : 00000080  r1 : cfdf9000  r0 : cf800a00
Flags: NzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 0005317f  Table: 4dcd4000  DAC: 00000051
Process systemd (pid: 1, stack limit = 0xcf826190)
Stack: (0xcf827e90 to 0xcf828000)
7e80:                                     c053c9cc ce47b088 ce47b178 cdcb7ec8
7ea0: cdcb7ec0 c00dacc4 ce47b088 ce47b1a4 a0000013 ce47b088 ce47b178 00000000
7ec0: c0525b60 cf827ee4 ffffff9c 7f754f00 cf827f64 c00db478 c009e33c cf827f2c
7ee0: 0000004c 00000000 00000000 c009da50 c050bed8 ce47b178 ce4f4524 00000000
7f00: cf827f70 c00dbd30 00000000 ce4f44a0 00000000 c005a0dc c050ae4c ce47b178
7f20: cf4e07f0 c005a114 c005a0fc c050ae4c ce47b178 c00db844 c00db7f8 cf4e56e8
7f40: cf4e56e8 c009843c 00000000 cf4e56e8 cfb0b000 c0098d2c cf827f70 cf827f64
7f60: 40000010 00000000 cfaf26f0 cf4e3b28 b8d446ca 0000001e cfb0b034 ce456000
7f80: c000a424 00000000 00000000 b6f47f10 00000028 c000a424 cf826000 00000000
7fa0: 7f6bacbc c000a280 00000000 00000000 7f754f00 00000000 b6e13ba0 00000000
7fc0: 00000000 00000000 b6f47f10 00000028 b6f4053c 7f750f18 7f722ee0 7f6bacbc
7fe0: 7f700d4c bed08734 7f6181d0 b6d9999c 20000010 7f754f00 f4ad8c1b 8b9d0ac4
[<c0087c5c>] (kmem_cache_free) from [<c00dacc4>] (kernfs_put+0xf4/0x188)
[<c00dacc4>] (kernfs_put) from [<c00db478>] (__kernfs_remove+0x200/0x22c)
[<c00db478>] (__kernfs_remove) from [<c00dbd30>] (kernfs_remove+0x1c/0x2c)
[<c00dbd30>] (kernfs_remove) from [<c005a0dc>] (cgroup_destroy_locked+0x54/0x74)
[<c005a0dc>] (cgroup_destroy_locked) from [<c005a114>] (cgroup_rmdir+0x18/0x34)
[<c005a114>] (cgroup_rmdir) from [<c00db844>] (kernfs_iop_rmdir+0x4c/0x70)
[<c00db844>] (kernfs_iop_rmdir) from [<c009843c>] (vfs_rmdir+0x70/0xfc)
[<c009843c>] (vfs_rmdir) from [<c0098d2c>] (do_rmdir+0xd4/0x124)
[<c0098d2c>] (do_rmdir) from [<c000a280>] (ret_fast_syscall+0x0/0x38)
Code: e10f7000 e3873080 e121f003 e5945000 (e895000c) 
---[ end trace 9e1fd1527d8f4e91 ]---
Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
shenki commented 8 years ago
14:08 < NormJ> that issue only occurs on a EVT barreleye.  doesn't happen ever 
               on DVT (newer) systems
14:08 < miltonm> NormJ one EVT or all ?
14:09 < NormJ> all the ones doug tried, but not all.  i have a an EVT barreleye 
               that doesn't have issue.
shenki commented 8 years ago
Unable to handle kernel paging request at virtual address 01b50325
pgd = ce4a4000
[01b50325] *pgd=00000000
Internal error: Oops: 1 [#1] ARM
Modules linked in:
CPU: 0 PID: 1103 Comm: umount Not tainted 4.3.5-openbmc-20160212-1 #1
Hardware name: ASpeed SoC
task: cfa3a180 ti: cfb3e000 task.ti: cfb3e000
PC is at kfree+0x3c/0xa8
LR is at squashfs_cache_delete+0x60/0x9c
pc : [<c008815c>]    lr : [<c01236e4>]    psr: 00000093
sp : cfb3ff18  ip : 00000010  fp : beeb0ebc
r10: 00000000  r9 : cfb3e000  r8 : 00000038
r7 : 00000004  r6 : a0000013  r5 : cf8db000  r4 : cfadad40
r3 : 01b50309  r2 : ffffffff  r1 : cfdf9000  r0 : cf8db000
Flags: nzcv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 0005317f  Table: 4e4a4000  DAC: 00000051
Process umount (pid: 1103, stack limit = 0xcfb3e190)
Stack: (0xcfb3ff18 to 0xcfb40000)
ff00:                                                       cdcafa00 cfadad40
ff20: 00000000 00000000 00000004 c01236e4 cdcaf900 cdcf4800 c052e62c 00000034
ff40: c000a424 c0125dc8 c0125d9c cdcf4800 c037d180 c008e5dc 00000000 cf403b80
ff60: 00000081 c008e810 cdcf4800 c050c548 c052e62c c008e910 ffff0001 cdcb0c20
ff80: cfae387c c00a42ac cfa3a180 c002a97c c000a424 cfb3e000 cfb3ffb0 c000cba8
ffa0: 000bc1a0 000bc290 000bc1c0 c000a2d0 00000000 00000000 00000000 000bc1a0
ffc0: 000bc1a0 000bc290 000bc1c0 00000034 00000000 000bc2c0 00000000 beeb0ebc
ffe0: 000aa3a8 beeb0cc4 0004ebc4 4f77407c 60000010 000bc1c0 cfbe4a40 00000000
[<c008815c>] (kfree) from [<c01236e4>] (squashfs_cache_delete+0x60/0x9c)
[<c01236e4>] (squashfs_cache_delete) from [<c0125dc8>] (squashfs_put_super+0x2c/0x70)
[<c0125dc8>] (squashfs_put_super) from [<c008e5dc>] (generic_shutdown_super+0x6c/0xc0)
[<c008e5dc>] (generic_shutdown_super) from [<c008e810>] (kill_block_super+0x18/0x68)
[<c008e810>] (kill_block_super) from [<c008e910>] (deactivate_locked_super+0x44/0x74)
[<c008e910>] (deactivate_locked_super) from [<c00a42ac>] (cleanup_mnt+0x4c/0x70)
[<c00a42ac>] (cleanup_mnt) from [<c002a97c>] (task_work_run+0x6c/0x80)
[<c002a97c>] (task_work_run) from [<c000cba8>] (do_work_pending+0xa4/0xc0)
[<c000cba8>] (do_work_pending) from [<c000a2d0>] (slow_work_pending+0xc/0x20)
Code: e0813282 e7912282 e3120902 1593301c (e593701c) 
---[ end trace 293ad3070fe69835 ]---
nkskjames commented 8 years ago

This happens on occasionally on my EVT system with latest kernel including jffs2. Looks similar to fail #4 above:

root@barreleye:~# reboot
[  OK  ] Stopped target System Time Synchronized.
[  OK  ] Stopped Dispatch Password Requests to Console Directory Watch.
[  OK  ] Stopped target Swap.
         Stopping SSH Per-Connection Server (9.3.62.94:40890)...
[  OK  ] Stopped Create Static Device Nodes in /dev.
[  OK  ] Stopped target Network.
[  OK  ] Stopped Apply Kernel Variables.
         Stopping Load/Save Random Seed...
         Stopping Network Time Synchronization...
         Stopping SSH Per-Connection Server (9.123.229.28:49692)...
[  OK  ] Stopped target Local File Systems.
         Unmounting /var/volatile...
BUG: Bad page map in process systemd-timesyn  pte:4da32100 pmd:4fb2b831
addr:b6e00000 vm_flags:00000075 anon_vma:  (null) mapping:cf48b874 index:0
file:libpthread-2.22.so fault:filemap_fault mmap:generic_file_readonly_mmap readpage:squashfs_readpage
CPU: 0 PID: 850 Comm: systemd-timesyn Not tainted 4.3.6-openbmc-20160222-1 #1
Hardware name: ASpeed SoC
[<c000f28c>] (unwind_backtrace) from [<c000cf0c>] (show_stack+0x10/0x14)
[<c000cf0c>] (show_stack) from [<c00787c8>] (print_bad_pte+0x158/0x190)
[<c00787c8>] (print_bad_pte) from [<c0079cec>] (unmap_single_vma+0x4ec/0x500)
[<c0079cec>] (unmap_single_vma) from [<c007a438>] (unmap_vmas+0x44/0x54)
[<c007a438>] (unmap_vmas) from [<c007e97c>] (exit_mmap+0xc8/0x1dc)
[<c007e97c>] (exit_mmap) from [<c0014770>] (mmput+0x38/0xb8)
[<c0014770>] (mmput) from [<c0017380>] (do_exit+0x2f0/0x7b0)
[<c0017380>] (do_exit) from [<c0018654>] (do_group_exit+0x4c/0xb8)
[<c0018654>] (do_group_exit) from [<c00186d0>] (__wake_up_parent+0x0/0x18)
Disabling lock debugging due to kernel taint
         Unmounting /run/initramfs/rw...
         Stopping SSH Per-Connection Server (9.109.165.198:54558)...
[  OK  ] Removed slice system-getty.slice.
[  OK  ] Stopped target Remote File Systems.
[  OK  ] Stopped target Multi-User System.
         Stopping Phosphor OpenBMC BT to DBUS...
         Stopping D-Bus System Message Bus...
         Stopping Temp placeholder for skeleton function...
         Stopping DBUS introspecting REST server....
         Stopping Phosphor OpenBMC event management daemon...
         Stopping System Logging Service...
[  OK  ] Stopped target Login Prompts.
         Stopping Serial Getty on ttyS4...
         Stopping Phosphor OpenBMC user management daemon...
         Stopping Login Service...
         Stopping Phosphor OpenBMC DBus REST daemon...
[  OK  ] Stopped Forward Password Requests to Wall Directory Watch.
         Stopping SSH Per-Connection Server (9.109.165.198:54737)...
         Unmounting /run/initramfs/ro...
         Stopping Kernel Logging Service...
[  OK  ] Stopped Login Service.
[  OK  ] Stopped DBUS introspecting REST server..
[  OK  ] Stopped Phosphor OpenBMC BT to DBUS.
[  OK  ] Stopped Phosphor OpenBMC event management daemon.
[  OK  ] Stopped Kernel Logging Service.
BUG: Bad rss-counter state mm:ce491640 idx:0 val:1
BUG: Bad rss-counter state mm:ce491640 idx:2 val:-1
[  OK  ] Stopped D-Bus System Message Bus.
[  OK  ] Stopped System Logging Service.
[  OK  ] Stopped Network Time Synchronization.
[  OK  ] Stopped Phosphor OpenBMC user management daemon.
[  OK  ] Stopped Serial Getty on ttyS4.
[  OK  ] Stopped SSH Per-Connection Server (9.3.62.94:40890).
[  OK  ] Stopped SSH Per-Connection Server (9.123.229.28:49692).
[  OK  ] Stopped Temp placeholder for skeleton function.
[  OK  ] Stopped SSH Per-Connection Server (9.109.165.198:54558).
[  OK  ] Stopped SSH Per-Connection Server (9.109.165.198:54737).
[  OK  ] Stopped Load/Save Random Seed.
[  OK  ] Unmounted /var/volatile.
[  OK  ] Unmounted /run/initramfs/rw.
[  OK  ] Unmounted /run/initramfs/ro.
[  OK  ] Stopped Phosphor OpenBMC DBus REST daemon.
[  OK  ] Stopped Phosphor OpenBMC DBus service management daemon.
[  OK  ] Reached target Unmount All Filesystems.
[  OK  ] Removed slice system-dropbear.slice.
[  OK  ] Closed dropbear.socket.
[  OK  ] Removed slice system-serial\x2dgetty.slice.
[  OK  ] Closed Syslog Socket.
[  OK  ] Removed slice User and Session Slice.
[  OK  ] Reached target Shutdown.
systemd-shutdown[1]: Sending SIGTERM to remaining processes...
systemd-journald[520]: Received SIGTERM from PID 1 (systemd-shutdow).
systemd-shutdown[1]: Sending SIGKILL to remaining processes...
systemd-shutdown[1]: Sending SIGKILL to PID 1321 (sh).
systemd-shutdown[1]: Hardware watchdog 'aspeed_wdt', version 0
systemd-shutdown[1]: Unmounting file systems.
systemd-shutdown[1]: Unmounting /tmp.
systemd-shutdown[1]: All filesystems unmounted.
systemd-shutdown[1]: Deactivating swaps.
systemd-shutdown[1]: All swaps deactivated.
systemd-shutdown[1]: Detaching loop devices.
systemd-shutdown[1]: All loop devices detached.
systemd-shutdown[1]: Detaching DM devices.
systemd-shutdown[1]: All DM devices detached.
systemd-shutdown[1]: Successfully changed into root pivot.
systemd-shutdown[1]: Returning to initrd...
shutdown: reboot --log-level 6 --log-target kmsg --log-color
+ awk /oldroot|mnt/ { print $2 }
+ sort -r
+ umount /oldroot/sys/kernel/debug
+ umount /oldroot/sys/kernel/config
+ umount /oldroot/sys/fs/cgroup/systemd
+ umount /oldroot/sys/fs/cgroup
+ umount /oldroot/sys
+ umount /oldroot/proc
+ umount /oldroot/dev/shm
+ umount /oldroot/dev/pts
+ umount /oldroot/dev
+ umount /oldroot
------------[ cut here ]------------
Kernel BUG at c005ec1c [verbose debug info unavailable]
Internal error: Oops - BUG: 0 [#1] ARM
Modules linked in:
CPU: 0 PID: 2097 Comm: umount Tainted: G    B           4.3.6-openbmc-20160222-1 #1
Hardware name: ASpeed SoC
task: ce88a3c0 ti: cde50000 task.ti: cde50000
PC is at __delete_from_page_cache+0x204/0x2a8
LR is at __delete_from_page_cache+0x11c/0x2a8
pc : [<c005ec1c>]    lr : [<c005eb34>]    psr: 60000093
sp : cde51df0  ip : c0527400  fp : 00000000
r10: 00000000  r9 : cf48b878  r8 : 00000003
r7 : 00000000  r6 : 00000000  r5 : cf48b874  r4 : cffad640
r3 : 00000000  r2 : 000002bc  r1 : c053c334  r0 : 0000592f
Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 0005317f  Table: 4de9c000  DAC: 00000051
Process umount (pid: 2097, stack limit = 0xcde50190)
Stack: (0xcde51df0 to 0xcde52000)
1de0:                                     00000000 00000000 cf48a9b0 cf48a9c8
1e00: cde51ea8 cffad640 00000000 00000013 cf48b874 00000000 00000000 00000000
1e20: cffad640 c005ecfc cffad640 cf48b874 00000000 c0069430 00001000 00000000
1e40: 00000000 00000fff 00000000 00000000 ffffffff 00000000 cf48b874 c00695d4
1e60: cde51e68 00000000 00000000 00000001 00000002 00000003 00000004 00000005
1e80: 00000006 00000007 00000008 00000009 0000000a 0000000b 0000000c 0000000d
1ea0: 0000000e 00000000 cffad640 cffad6a0 cffad6c0 cffad6e0 cffad700 cffad720
1ec0: cffad740 cffad760 cffad780 cffad7a0 cffad7c0 cffad7e0 cffad800 cffad820
1ee0: c000a424 ffffffff ffffffff cde50000 cdcb8800 c000a424 cde50000 00000000
1f00: bef19ebc c0069998 ffffffff ffffffff 00000000 cf48b7b8 c037d180 c00a1060
1f20: cde50000 cde51f38 cec760b8 c00a1128 cdcb89fc c00a1d64 cf48b6dc cec7613c
1f40: 00000000 cdcb8800 c037d180 c053062c 00000034 c008e5d0 00000000 cf4039e0
1f60: 00000081 c008e830 cdcb8800 c050e548 c053062c c008e930 ffff0001 cdc8e9e0
1f80: cdc8463c c00a42d4 ce88a3c0 c002a97c c000a424 cde50000 cde51fb0 c000cba8
1fa0: 000bc1a0 000bc290 000bc1c0 c000a2d0 00000000 00000000 00000000 000bc1a0
1fc0: 000bc1a0 000bc290 000bc1c0 00000034 00000000 000bc2c0 00000000 bef19ebc
1fe0: 000aa3a8 bef19cc4 0004ebc4 4e0b407c 60000010 000bc1c0 00000000 00000000
[<c005ec1c>] (__delete_from_page_cache) from [<c005ecfc>] (delete_from_page_cache+0x3c/0x5c)
[<c005ecfc>] (delete_from_page_cache) from [<c0069430>] (truncate_inode_page+0x98/0xa4)
[<c0069430>] (truncate_inode_page) from [<c00695d4>] (truncate_inode_pages_range+0x168/0x518)
[<c00695d4>] (truncate_inode_pages_range) from [<c0069998>] (truncate_inode_pages+0x14/0x1c)
[<c0069998>] (truncate_inode_pages) from [<c00a1060>] (evict+0x90/0x128)
[<c00a1060>] (evict) from [<c00a1128>] (dispose_list+0x30/0x3c)
[<c00a1128>] (dispose_list) from [<c00a1d64>] (evict_inodes+0xb4/0xbc)
[<c00a1d64>] (evict_inodes) from [<c008e5d0>] (generic_shutdown_super+0x40/0xc0)
[<c008e5d0>] (generic_shutdown_super) from [<c008e830>] (kill_block_super+0x18/0x68)
[<c008e830>] (kill_block_super) from [<c008e930>] (deactivate_locked_super+0x44/0x74)
[<c008e930>] (deactivate_locked_super) from [<c00a42d4>] (cleanup_mnt+0x4c/0x70)
[<c00a42d4>] (cleanup_mnt) from [<c002a97c>] (task_work_run+0x6c/0x80)
[<c002a97c>] (task_work_run) from [<c000cba8>] (do_work_pending+0xa4/0xc0)
[<c000cba8>] (do_work_pending) from [<c000a2d0>] (slow_work_pending+0xc/0x20)
Code: e121f002 e594300c e3530000 ba000000 (e7f001f2) 
---[ end trace c053299951a20b73 ]---
Segmentation fault
+ umount /mnt
+ set +x
update: reboot --log-level 6 --log-target kmsg --log-color
Updating bmc...
Erasing block: 8192/8192 (100%) 
Writing kb: 32768/32768 (100%) 
Verifying kb: 32768/32768 (100%) 
Remaining mounts:
tmpfs / tmpfs rw,nosuid,nodev,mode=755 0 0
dev /dev devtmpfs rw,relatime,size=126380k,nr_inodes=31595,mode=755 0 0
proc /proc proc rw,relatime 0 0
sys /sys sysfs rw,relatime 0 0
tmpfs /run tmpfs rw,nosuid,nodev,mode=755 0 0
nkskjames commented 8 years ago

I think we found some important observations:

  1. The EVT system was plugged directly into lab switch. The DVT system was plugged into a local switch.
  2. If the BMC network is plugged in and we do a cold reboot, then seg fault is likely to occur.
  3. The seg faults does not occur on warm reboots (with jffs2 filesystem)
  4. If we wait at uboot prompt for around 10secs before letting kernel start, then seg fault does not occur
  5. If we unplug BMC network, the seg fault does not occur
  6. I went back to original uboot FB patches and seg fault still occurs
gwshan commented 8 years ago

It's mos likely caused by race condition between uboot and kernel like below:

  1. The uboot has enabled NCSI and there has BDs pointing to valid memory buffer.
  2. Kernel boots and lots of memory slabs (struct kmem_cache) are created. One of the slab cache ("kernfs_node_cache" as being seen in the following example) is using the memory block that uboot reserved to store ingress frames.
  3. Ingress ARP request received and stored to system memory by uboot. It's corrupting the slab cache "kernfs_node_cache".
  4. Kernel tries to allocate the slab object from percpu's hot list and run into crash because of the corrupted slab cache "kernfs_node_cache".

The serial port can't be accessed this moment. I'll figure out a command to stop the activity on MAC before running "bootcmd" command from uboot side after serial port becomes available again.

I added some code to dump the corrupted slab cache and got below dump for "kernfs_node_cache". Obviously, the data here is exactly a ARP request whose source IP address is 9.3.40.155

Corrupted CPU list detected on [kernfs_node_cache]
ff ff ff ff ff ff 5c f3 fc 5f 3f 60 08 06 00 01 
08 00 06 04 00 01 5c f3 fc 5f 3f 60 09 03 28 9b 
00 00 00 00 00 00 09 03 28 96 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 c1 98 e5 ab 
c8 05 44 c0 24 71 80 cf 84 7f 80 cf 01 00 00 00 
50 00 00 00 08 00 00 00 52 12 00 00 b9 12 00 00 
52 12 00 00 7c 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
ea 12 00 00 78 01 00 00 12 02 00 00 06 00 00 00 
08 00 00 00 40 1e 80 cf 
gwshan@gwshan:~$ ping 9.3.40.155
PING 9.3.40.155 (9.3.40.155) 56(84) bytes of data.
64 bytes from 9.3.40.155: icmp_seq=1 ttl=50 time=192 ms
64 bytes from 9.3.40.155: icmp_seq=2 ttl=50 time=193 ms
gwshan commented 8 years ago

Writing 0x0 to MACCR (offset: 0x50) should stop the MAC from receiving any frames. It's just a workaround. The final solution is to change uboot to do that before loading and booting kernel image. I'm not sure who is responsible for that? If nobody is in charge of that, I can do that by providing more info:

(A) The source code of uboot (B) Command to flush uboot image

nkskjames commented 8 years ago

(A) We have our own fork of uboot: https://github.com/openbmc/u-boot/tree/v2013.07-aspeed-openbmc

So can we use the "workaround" until the final fix? If so, can you do a pull request for that?

shenki commented 8 years ago

These have been committed to our branch at https://github.com/openbmc/u-boot/tree/v2013.07-aspeed-openbmc as of https://github.com/openbmc/u-boot/commit/fecb84ab7f3bddc4039b5de31f57440e1cfff650