liyi-ibm / linux

Linux kernel source tree
Other
0 stars 1 forks source link

memory exception #8

Open liyi-ibm opened 5 years ago

liyi-ibm commented 5 years ago

We observe kernel BUG in mm.h and then we observed NMI back trace logs

Kernel BUG is from the following code /*

  • Drop a ref, return true if the refcount fell to zero (the page has no users) / static inline int put_page_testzero(struct page page) { VM_BUG_ON_PAGE(page_ref_count(page) == 0, page); return page_ref_dec_and_test(page); }
Jul 14 13:23:46 tdw-9-10-129-180 kernel: page:c00a00080493c700 count:0
mapcount:0 mapping:          (null) index:0x0
Jul 14 13:23:46 tdw-9-10-129-180 kernel: flags: 0x83ffff000000000()
Jul 14 13:23:46 tdw-9-10-129-180 kernel: raw: 083ffff000000000
0000000000000000 0000000000000000 00000000ffffffff
Jul 14 13:23:46 tdw-9-10-129-180 kernel: raw: c00a000800599f20
c000203993d97c68 0000000000000000 0000000000000000
Jul 14 13:23:46 tdw-9-10-129-180 kernel: page dumped because:
VM_BUG_ON_PAGE(page_ref_count(page) == 0)
Jul 14 13:23:46 tdw-9-10-129-180 kernel: ------------[ cut here ]------------
Jul 14 13:23:46 tdw-9-10-129-180 kernel: kernel BUG at
./include/linux/mm.h:473!
Jul 14 13:23:46 tdw-9-10-129-180 kernel: Oops: Exception in kernel mode,
sig: 5 [#3]
Jul 14 13:23:46 tdw-9-10-129-180 kernel: LE SMP NR_CPUS=1024 NUMA PowerNV
Jul 14 13:23:46 tdw-9-10-129-180 kernel: Modules linked in: i2c_dev joydev
at24 ses enclosure ipmi_powernv ipmi_devintf opal_prd i2c_opal
ipmi_msghandler ofpart powernv_flash mtd nfsd auth_rpcgss oid_registry
nfs_acl lockd grace binfmt_misc sunrpc usb_storage ast i2c_algo_bit
drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm
i2c_core ixgbe mpt3sas mdio ptp pps_core raid_class scsi_transport_sas
Jul 14 13:23:46 tdw-9-10-129-180 kernel: CPU: 94 PID: 51520 Comm: java
Tainted: G      D W       4.14.49-1 #1
Jul 14 13:23:46 tdw-9-10-129-180 kernel: task: c000202004f03f00 task.stack:
c000202c8fa0c000
Jul 14 13:23:46 tdw-9-10-129-180 kernel: NIP:  c000000000067d8c LR:
c000000000067d88 CTR: 000000003003e014
Jul 14 13:23:46 tdw-9-10-129-180 kernel: REGS: c000202c8fa0f550 TRAP: 0700  
Tainted: G      D W        (4.14.49-1)
Jul 14 13:23:46 tdw-9-10-129-180 kernel: MSR:  900000000282b033
<SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 42022222  XER: 20040000
Jul 14 13:23:46 tdw-9-10-129-180 kernel: CFAR: c0000000002c425c SOFTE: 1
#012GPR00: c000000000067d88 c000202c8fa0f7d0 c0000000013d3000
000000000000003e #012GPR04: 0000000000000000 c000000000099f64
900000000280b033 0000000035c30058 #012GPR08: 0000000000000001
0000000000000007 0000000000000006 9000000002803003 #012GPR12:
0000000000002000 c000000007d20a00 c000200403231450 00000000ebe00000
#012GPR16: 0000000000040100 0000000000000000 e07fffffffffefff
0000000000000004 #012GPR20: c00a00080d633a80 0000000000000000
c00a000804b72080 c0002033dda82af8 #012GPR24: 8603ea8c352000c0
c000202c8fa0f8c0 c0002018e2940064 c0002018e2940000 #012GPR28:
c0002033dda82af8 c00a00080046f000 c00000000157aef0 c00a00080493c700
Jul 14 13:23:46 tdw-9-10-129-180 kernel: NIP [c000000000067d8c]
pte_fragment_free+0xcc/0xd0
Jul 14 13:23:46 tdw-9-10-129-180 kernel: LR [c000000000067d88]
pte_fragment_free+0xc8/0xd0
Jul 14 13:23:46 tdw-9-10-129-180 kernel: Call Trace:
Jul 14 13:23:46 tdw-9-10-129-180 kernel: [c000202c8fa0f7d0]
[c000000000067d88] pte_fragment_free+0xc8/0xd0 (unreliable)
Jul 14 13:23:46 tdw-9-10-129-180 kernel: [c000202c8fa0f800]
[c000000000331358] zap_huge_pmd+0x148/0x510
Jul 14 13:23:46 tdw-9-10-129-180 kernel: [c000202c8fa0f8a0]
[c0000000002cbc90] unmap_page_range+0xe80/0x10a0
Jul 14 13:23:46 tdw-9-10-129-180 kernel: [c000202c8fa0f9e0]
[c0000000002cc3a4] unmap_vmas+0x84/0xf0
Jul 14 13:23:46 tdw-9-10-129-180 kernel: [c000202c8fa0fa30]
[c0000000002da7d8] exit_mmap+0xf8/0x210
Jul 14 13:23:46 tdw-9-10-129-180 kernel: [c000202c8fa0faf0]
[c0000000000f7ee8] mmput+0xb8/0x1f0
Jul 14 13:23:46 tdw-9-10-129-180 kernel: [c000202c8fa0fb20]
[c000000000101ff8] do_exit+0x358/0xcd0
Jul 14 13:23:46 tdw-9-10-129-180 kernel: [c000202c8fa0fbe0]
[c000000000102a44] do_group_exit+0x64/0x100
Jul 14 13:23:46 tdw-9-10-129-180 kernel: [c000202c8fa0fc20]
[c000000000112360] get_signal+0x210/0x700
Jul 14 13:23:46 tdw-9-10-129-180 kernel: [c000202c8fa0fd10]
[c00000000001c330] do_signal+0x80/0x2e0
Jul 14 13:23:46 tdw-9-10-129-180 kernel: [c000202c8fa0fe00]
[c00000000001c724] do_notify_resume+0xd4/0x100
Jul 14 13:23:46 tdw-9-10-129-180 kernel: [c000202c8fa0fe30]
[c00000000000c044] ret_from_except_lite+0x70/0x74
Jul 14 13:23:46 tdw-9-10-129-180 kernel: Instruction dump:
Jul 14 13:23:46 tdw-9-10-129-180 kernel: 60420000 f89f0008 7fe3fb78 38800008
482464bd 60000000 4bffffbc 3c82ff91
Jul 14 13:23:46 tdw-9-10-129-180 kernel: 7fe3fb78 388406c8 4825c615 60000000
<0fe00000> 3c4c0137 3842b270 7c0802a6
Jul 14 13:23:46 tdw-9-10-129-180 kernel: ---[ end trace 7d5a3e9aae7be6c5 ]---

NMI back trace log:

Jul 14 14:31:06 tdw-9-10-129-180 kernel: NMI backtrace for cpu 87 Jul 14 14:31:06 tdw-9-10-129-180 kernel: CPU: 87 PID: 42216 Comm: java Tainted: G D W 4.14.49-1 #1 Jul 14 14:31:06 tdw-9-10-129-180 kernel: Call Trace: Jul 14 14:31:06 tdw-9-10-129-180 kernel: [c00000178cd06990] [c000000000acb91c] dump_stack+0xb0/0xf4 (unreliable) Jul 14 14:31:06 tdw-9-10-129-180 kernel: [c00000178cd069d0] [c000000000ad4a44] nmi_cpu_backtrace+0x1a4/0x210 Jul 14 14:31:06 tdw-9-10-129-180 kernel: [c00000178cd06a60] [c000000000ad4c8c] nmi_trigger_cpumask_backtrace+0x1dc/0x220 Jul 14 14:31:06 tdw-9-10-129-180 kernel: [c00000178cd06b00] [c00000000002e558] arch_trigger_cpumask_backtrace+0x28/0x40 Jul 14 14:31:06 tdw-9-10-129-180 kernel: [c00000178cd06b20] [c00000000018a934] rcu_dump_cpu_stacks+0xfc/0x158 Jul 14 14:31:06 tdw-9-10-129-180 kernel: [c00000178cd06b70] [c000000000189d78] rcu_check_callbacks+0x898/0xaa0 Jul 14 14:31:06 tdw-9-10-129-180 kernel: [c00000178cd06ca0] [c0000000001952b4] update_process_times+0x44/0x90 Jul 14 14:31:06 tdw-9-10-129-180 kernel: [c00000178cd06cd0] [c0000000001abecc] tick_sched_handle.isra.13+0x4c/0x80 Jul 14 14:31:06 tdw-9-10-129-180 kernel: [c00000178cd06cf0] [c0000000001abf60] tick_sched_timer+0x60/0xc0 Jul 14 14:31:06 tdw-9-10-129-180 kernel: [c00000178cd06d30] [c000000000195eb8] __hrtimer_run_queues+0xf8/0x330

liyi-ibm commented 5 years ago

Fixed by patch: https://github.com/liyi-ibm/linux/commit/856c18cbbb4e6d7f3db4afe9d5ecd88064f8ae61

liyi-ibm commented 5 years ago

Above patch does not apply to 4.14.49-6 yet. So sometimes there is bug reported, like bellow:

[Thu Jul 25 15:49:25 2019] page:c00a000803809bc0 count:0 mapcount:-127 mapping:          (null) index:0x0
[Thu Jul 25 15:49:25 2019] flags: 0x83ffff000000000()
[Thu Jul 25 15:49:25 2019] raw: 083ffff000000000 0000000000000000 0000000000000000 00000000ffffff80
[Thu Jul 25 15:49:25 2019] raw: c00a0008000b99a0 c00a00080013e5a0 0000000000000000 0000000000000000
[Thu Jul 25 15:49:25 2019] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
[Thu Jul 25 15:49:25 2019] ------------[ cut here ]------------
[Thu Jul 25 15:49:25 2019] kernel BUG at ./include/linux/mm.h:473!
[Thu Jul 25 15:49:25 2019] Oops: Exception in kernel mode, sig: 5 [#1]
[Thu Jul 25 15:49:25 2019] LE SMP NR_CPUS=1024 NUMA PowerNV
[Thu Jul 25 15:49:25 2019] Modules linked in: i2c_dev joydev at24 ofpart ipmi_powernv ipmi_devintf powernv_flash ipmi_msghandler mtd opal_prd i2c_opal nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc binfmt_misc usb_storage ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ixgbe ttm mpt3sas drm mdio ptp pps_core raid_class scsi_transport_sas i2c_core
[Thu Jul 25 15:49:25 2019] CPU: 113 PID: 26411 Comm: java Not tainted 4.14.49-6.ppc64le #1
[Thu Jul 25 15:49:25 2019] task: c0002001a1040000 task.stack: c0000000d6560000
[Thu Jul 25 15:49:25 2019] NIP:  c000000000068e8c LR: c000000000068e88 CTR: 000000003003ea30
[Thu Jul 25 15:49:25 2019] REGS: c0000000d65635f0 TRAP: 0700   Not tainted  (4.14.49-6.ppc64le)
[Thu Jul 25 15:49:25 2019] MSR:  900000000282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 42024422  XER: 20040000
[Thu Jul 25 15:49:25 2019] CFAR: c0000000002d017c SOFTE: 1 
GPR00: c000000000068e88 c0000000d6563870 c0000000013f4000 000000000000003e 
GPR04: 0000000000000000 c00000000009cc84 900000000280b033 0000000035ce8058 
GPR08: 0000000000000001 0000000000000007 0000000000000006 9000000002803003 
GPR12: 0000000000004000 c000000007d8db00 0000000000000001 c0000000d6563a50 
GPR16: 00007fff80000000 c000000059260548 00007fff7fffffff 8000200e026f0005 
GPR20: 0000000000000000 0000000000000001 00007fff38e00000 c00000000159aee8 
GPR24: 00007fffab0bffff 60000000000000e0 c00000000159aef0 c000000000000000 
GPR28: 00007fffab0c0000 c000200e026f0000 c0000000d6563a50 c00a000803809bc0 
[Thu Jul 25 15:49:25 2019] NIP [c000000000068e8c] pte_fragment_free+0xcc/0xd0
[Thu Jul 25 15:49:25 2019] LR [c000000000068e88] pte_fragment_free+0xc8/0xd0
[Thu Jul 25 15:49:25 2019] Call Trace:
[Thu Jul 25 15:49:25 2019] [c0000000d6563870] [c000000000068e88] pte_fragment_free+0xc8/0xd0 (unreliable)
[Thu Jul 25 15:49:25 2019] [c0000000d65638a0] [c0000000002d5dd8] tlb_remove_table+0x108/0x140
[Thu Jul 25 15:49:25 2019] [c0000000d65638e0] [c000000000068ec4] pgtable_free_tlb+0x34/0x50
[Thu Jul 25 15:49:25 2019] [c0000000d6563900] [c0000000002d627c] free_pgd_range+0x39c/0x850
[Thu Jul 25 15:49:25 2019] [c0000000d65639d0] [c0000000002d6884] free_pgtables+0x154/0x200
[Thu Jul 25 15:49:25 2019] [c0000000d6563a30] [c0000000002e6804] exit_mmap+0xc4/0x1f0
[Thu Jul 25 15:49:25 2019] [c0000000d6563af0] [c0000000000fb208] mmput+0xb8/0x1f0
[Thu Jul 25 15:49:25 2019] [c0000000d6563b20] [c0000000001053b8] do_exit+0x358/0xcd0
[Thu Jul 25 15:49:25 2019] [c0000000d6563be0] [c000000000105e04] do_group_exit+0x64/0x100
[Thu Jul 25 15:49:25 2019] [c0000000d6563c20] [c000000000115b68] get_signal+0x218/0x730
[Thu Jul 25 15:49:25 2019] [c0000000d6563d10] [c00000000001cbc0] do_signal+0x80/0x2e0
[Thu Jul 25 15:49:25 2019] [c0000000d6563e00] [c00000000001cfb4] do_notify_resume+0xd4/0x100
[Thu Jul 25 15:49:25 2019] [c0000000d6563e30] [c00000000000c644] ret_from_except_lite+0x70/0x74
[Thu Jul 25 15:49:25 2019] Instruction dump:
[Thu Jul 25 15:49:25 2019] 60420000 f89f0008 7fe3fb78 38800008 4825101d 60000000 4bffffbc 3c82ff90 
[Thu Jul 25 15:49:25 2019] 7fe3fb78 388403c8 48267435 60000000 <0fe00000> 3c4c0139 3842b170 7c0802a6 
[Thu Jul 25 15:49:25 2019] ---[ end trace 1ada8642a1cd292e ]---

[Thu Jul 25 15:49:27 2019] Fixing recursive fault but reboot is needed!