cpu lockup when cpu idle

liyi-ibm commented 5 years ago

@vaidy Is it a CPU hard lockup?
[Tue Dec 11 23:30:19 2018] Watchdog CPU:97 Hard LOCKUP
[Tue Dec 11 23:30:19 2018] Modules linked in: i2c_dev ses enclosure scsi_transport_sas ipmi_powernv at24 ipmi_devintf ofpart ipmi_msghandler powernv_flash i2c_opal mtd opal_prd nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc xfs libcrc32c joydev ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm i40e i2c_core aacraid ptp pps_core
[Tue Dec 11 23:30:19 2018] CPU: 97 PID: 0 Comm: swapper/97 Not tainted 4.14.49-4.ppc64le #1
[Tue Dec 11 23:30:19 2018] task: c000201cb740c180 task.stack: c000201cbd158000
[Tue Dec 11 23:30:19 2018] NIP:  c00000000000b464 LR: c000000000016224 CTR: c000000000141910
[Tue Dec 11 23:30:19 2018] REGS: c000201cbd15b6e0 TRAP: 0e81   Not tainted  (4.14.49-4.ppc64le)
[Tue Dec 11 23:30:19 2018] MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 42004284  XER: 20040000
[Tue Dec 11 23:30:19 2018] CFAR: c000000000138d90 SOFTE: 1
GPR00: c000000000af9e38 c000201cbd15b960 c0000000013f3b00 0000000028000000
GPR04: 0000000000000000 0000000000263ba4 0000000000000001 0000000000000000
GPR08: 0000000000000000 0000000000000378 c0002009d766a800 c000201cc6f45910
GPR12: 0000000000000000 c00000000fd82b00
[Tue Dec 11 23:30:19 2018] NIP [c00000000000b464] replay_interrupt_return+0x0/0x4
[Tue Dec 11 23:30:19 2018] LR [c000000000016224] arch_local_irq_restore.part.12+0x84/0xb0
[Tue Dec 11 23:30:19 2018] Call Trace:
[Tue Dec 11 23:30:19 2018] [c000201cbd15b960] [c000000000140864] vtime_account_irq_enter+0x64/0x80 (unreliable)
[Tue Dec 11 23:30:19 2018] [c000201cbd15b980] [c000000000af9e38] __do_softirq+0x138/0x424
[Tue Dec 11 23:30:19 2018] [c000201cbd15ba70] [c000000000105d28] irq_exit+0x138/0x150
[Tue Dec 11 23:30:19 2018] [c000201cbd15ba90] [c0000000000245ac] timer_interrupt+0xac/0xe0
[Tue Dec 11 23:30:19 2018] [c000201cbd15bac0] [c0000000000094f0] decrementer_common+0x180/0x190
[Tue Dec 11 23:30:19 2018] --- interrupt: 901 at replay_interrupt_return+0x0/0x4
   LR = arch_local_irq_restore.part.12+0x84/0xb0
[Tue Dec 11 23:30:19 2018] [c000201cbd15bdb0] [c000201cbd15be30] 0xc000201cbd15be30 (unreliable)
[Tue Dec 11 23:30:19 2018] [c000201cbd15bdd0] [c0000000008dd678] cpuidle_enter_state+0x128/0x410
[Tue Dec 11 23:30:19 2018] [c000201cbd15be30] [c00000000015df9c] call_cpuidle+0x4c/0x90
[Tue Dec 11 23:30:19 2018] [c000201cbd15be50] [c00000000015e3b0] do_idle+0x2c0/0x370
[Tue Dec 11 23:30:19 2018] [c000201cbd15bec0] [c00000000015e64c] cpu_startup_entry+0x3c/0x50
[Tue Dec 11 23:30:19 2018] [c000201cbd15bef0] [c00000000004938c] start_secondary+0x4ec/0x530
[Tue Dec 11 23:30:19 2018] [c000201cbd15bf90] [c00000000000b86c] start_secondary_prolog+0x10/0x14
[Tue Dec 11 23:30:19 2018] Instruction dump:
[Tue Dec 11 23:30:19 2018] 7d8000a6 e9628008 7d200026 618c8000 2c030900 4182df28 2c030500 4182ee70
[Tue Dec 11 23:30:19 2018] 2c030a00 4182ffa4 2c030e60 4182eb40 <4e800020> 7c781b78 48000419 48000431
[Tue Dec 11 23:30:19 2018] Watchdog CPU:97 became unstuck

liyi-ibm commented 5 years ago

N: that's actually a known bug, somewhere in the timer subsystem I don't think we have a root cause because it's difficult to reproduce or trace. The watchdog timer gets lost when the CPU goes idle. It seems to be pretty harmless so we can ignore it for now.

V: This did not look to be related to cgroup cfs balancing. I did not know about the new scenario that you mentioned. I will work this as an independent issue.

N: No it's not, it's some bug we haven't been able to track down. It happens without cgroups at all. It doesn't appear to be too harmful though, doesn't seem to cause real lockups.

liyi-ibm commented 5 years ago

V: The cfs settings can help in solving the lockup caused by cfs scheduler. You have not yet hit that, but you will if you run long enough.

hard lockup only means a cpu was doing something for very long time (10 seconds) and not run any workload. We are solving each scenario where cpu spend lots of time. So the hard lock message is a way to observe what the cpu is doing (memory allocation work or cpu scheduler, etc) The above message you got is a missed timer. the cpu was really idle and does not point to a problem where it was stuck doing some kernel work. Hence that can be ignored. You should continue your tests with different load and utilization to be sure that we can hit the problem and also solve it with different settings.

liyi-ibm / linux

cpu lockup when cpu idle #13