ut-osa / ingens

Coordinated and Efficient Huge Page Management with Ingens
42 stars 16 forks source link

Soft lockup #1

Open WeixiZhu94 opened 5 years ago

WeixiZhu94 commented 5 years ago

I constantly get the kernel stuck when profiling benchmarks. Below is the dmesg:

[ 1371.302325] general protection fault: 0000 [#1] SMP [ 1371.302358] Modules linked in: snd_hda_codec_hdmi(E) snd_hda_codec_realtek(E) snd_hda_codec_generic(E) i915(E) drm_kms_helper(E) drm(E) fb_sys_fops(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) snd_hda_intel(E) snd_hda_codec(E) snd_hda_core(E) snd_hwdep(E) snd_pcm(E) x86_pkg_temp_thermal(E) coretemp(E) crct10dif_pclmul(E) crc32_pclmul(E) cryptd(E) snd_timer(E) snd(E) soundcore(E) joydev(E) shpchp(E) video(E) acpi_pad(E) binfmt_misc(E) sch_fq_codel(E) autofs4(E) hid_generic(E) i40e(E) e1000e(E) igb(E) i2c_algo_bit(E) vxlan(E) dca(E) ip6_udp_tunnel(E) udp_tunnel(E) usbhid(E) i2c_hid(E) ahci(E) ptp(E) hid(E) libahci(E) nvme(E) pps_core(E) i2c_core(E) [ 1371.302689] CPU: 0 PID: 68 Comm: khugepaged Tainted: G W E 4.3.0-ingens #1 [ 1371.302724] Hardware name: Supermicro SYS-5039A-IL/X11SAE, BIOS 2.1 09/12/2017 [ 1371.302757] task: ffff8808693b1e00 ti: ffff8808693b8000 task.ti: ffff8808693b8000 [ 1371.302790] RIP: 0010:[] [] collect_mm_slot+0x47/0x90 [ 1371.302832] RSP: 0018:ffff8808693bbdc8 EFLAGS: 00010246 [ 1371.302856] RAX: dead000000000200 RBX: ffff8808681e1800 RCX: 0000000000000001 [ 1371.302887] RDX: dead000000000100 RSI: ffff8808072c8270 RDI: ffff8808072c8270 [ 1371.302917] RBP: ffff8808693bbdd0 R08: 0000000000000000 R09: ffff88086d803500 [ 1371.302948] R10: 0000000000000001 R11: ffffea004ba90e00 R12: ffffffff82400ee0 [ 1371.302979] R13: ffffffff82400ee0 R14: ffff880868303300 R15: 0000000000000000 [ 1371.303010] FS: 0000000000000000(0000) GS:ffff88086e400000(0000) knlGS:0000000000000000 [ 1371.303046] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1371.303071] CR2: 0000557bb0f54d80 CR3: 0000000003011000 CR4: 00000000003406f0 [ 1371.303102] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1371.303133] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1371.303163] Stack: [ 1371.303174] 0000000000000001 ffff8808693bbec8 ffffffff812216c0 ffff8808693b1e00 [ 1371.303213] ffff8808693b1e00 ffff8808693bbe90 0000000804d8d067 ffff8808072c8270 [ 1371.303251] 00000fff01062673 ffffea00440dc000 ffff8808681e1868 ffffea00482431b0 [ 1371.303290] Call Trace: [ 1371.303304] [] khugepaged+0x880/0x1750 [ 1371.303331] [] ? prepare_to_wait_event+0xf0/0xf0 [ 1371.304174] [] ? maybe_pmd_mkwrite+0x30/0x30 [ 1371.305015] [] kthread+0xc9/0xe0 [ 1371.305856] [] ? kthread_park+0x60/0x60 [ 1371.306713] [] ret_from_fork+0x3f/0x70 [ 1371.307561] [] ? kthread_park+0x60/0x60 [ 1371.308408] Code: 74 1e 48 8b 17 48 85 d2 48 89 10 74 04 48 89 42 08 48 c7 07 00 00 00 00 48 c7 47 08 00 00 00 00 48 8b 57 10 48 8b 47 18 48 89 fe <48> 89 42 08 48 89 10 48 b8 00 01 00 00 00 00 ad de 48 89 47 10 [ 1371.310705] RIP [] collect_mm_slot+0x47/0x90 [ 1371.311125] RSP [ 1371.312173] ---[ end trace bee778246de6fdee ]--- [ 1399.873535] NMI watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [java:11825] [ 1399.874146] Modules linked in: snd_hda_codec_hdmi(E) snd_hda_codec_realtek(E) snd_hda_codec_generic(E) i915(E) drm_kms_helper(E) drm(E) fb_sys_fops(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) snd_hda_intel(E) snd_hda_codec(E) snd_hda_core(E) snd_hwdep(E) snd_pcm(E) x86_pkg_temp_thermal(E) coretemp(E) crct10dif_pclmul(E) crc32_pclmul(E) cryptd(E) snd_timer(E) snd(E) soundcore(E) joydev(E) shpchp(E) video(E) acpi_pad(E) binfmt_misc(E) sch_fq_codel(E) autofs4(E) hid_generic(E) i40e(E) e1000e(E) igb(E) i2c_algo_bit(E) vxlan(E) dca(E) ip6_udp_tunnel(E) udp_tunnel(E) usbhid(E) i2c_hid(E) ahci(E) ptp(E) hid(E) libahci(E) nvme(E) pps_core(E) i2c_core(E) [ 1399.878527] CPU: 4 PID: 11825 Comm: java Tainted: G D W E 4.3.0-ingens #1 [ 1399.879156] Hardware name: Supermicro SYS-5039A-IL/X11SAE, BIOS 2.1 09/12/2017 [ 1399.879794] task: ffff880805f29e00 ti: ffff8808041e8000 task.ti: ffff8808041e8000 [ 1399.880438] RIP: 0010:[] [] native_queued_spin_lock_slowpath+0x15e/0x170 [ 1399.881745] RSP: 0000:ffff8808041ebdb0 EFLAGS: 00000202 [ 1399.882399] RAX: 0000000000000101 RBX: ffff880863cdfc60 RCX: 0000000000000001 [ 1399.883058] RDX: 0000000000000101 RSI: 0000000000000001 RDI: ffffffff823eefc4 [ 1399.883712] RBP: ffff8808041ebdb0 R08: 0000000000000101 R09: ffff880863cdfc60 [ 1399.884367] R10: 0000000000000055 R11: ffff880000000000 R12: ffff880868363800 [ 1399.885030] R13: ffffea00478a6280 R14: 00007f0d5d000000 R15: ffff880806d7fcc0 [ 1399.885687] FS: 00007f0d73945700(0000) GS:ffff88086e500000(0000) knlGS:0000000000000000 [ 1399.886348] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1399.887008] CR2: 00007f0d5d000000 CR3: 000000080636b000 CR4: 00000000003406e0 [ 1399.887673] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1399.888328] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1399.888979] Stack: [ 1399.889618] ffff8808041ebdc0 ffffffff811b9ee3 ffff8808041ebdd0 ffffffff818ce0a0 [ 1399.890288] ffff8808041ebe00 ffffffff81224028 ffff880868363800 ffff8808595de000 [ 1399.890963] ffff880868363800 ffffea00478a6280 ffff8808041ebec0 ffffffff811ee0eb [ 1399.891635] Call Trace: [ 1399.892298] [] queued_spin_lock_slowpath+0xb/0xf [ 1399.892977] [] _raw_spin_lock+0x20/0x30 [ 1399.893641] [] khugepaged_enter+0x58/0x160 [ 1399.894290] [] handle_mm_fault+0x1bdb/0x1c70 [ 1399.894926] [] __do_page_fault+0x19a/0x430 [ 1399.895551] [] do_page_fault+0x30/0x80 [ 1399.896171] [] page_fault+0x28/0x30 [ 1399.896786] Code: 01 48 8b 01 48 85 c0 75 0a f3 90 48 8b 01 48 85 c0 74 f6 c7 40 08 01 00 00 00 e9 61 ff ff ff 83 fa 01 75 07 e9 c2 fe ff ff f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 5d c3 66 90 0f 1f 44 [ 1427.868777] NMI watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [java:11825] [ 1427.869417] Modules linked in: snd_hda_codec_hdmi(E) snd_hda_codec_realtek(E) snd_hda_codec_generic(E) i915(E) drm_kms_helper(E) drm(E) fb_sys_fops(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) snd_hda_intel(E) snd_hda_codec(E) snd_hda_core(E) snd_hwdep(E) snd_pcm(E) x86_pkg_temp_thermal(E) coretemp(E) crct10dif_pclmul(E) crc32_pclmul(E) cryptd(E) snd_timer(E) snd(E) soundcore(E) joydev(E) shpchp(E) video(E) acpi_pad(E) binfmt_misc(E) sch_fq_codel(E) autofs4(E) hid_generic(E) i40e(E) e1000e(E) igb(E) i2c_algo_bit(E) vxlan(E) dca(E) ip6_udp_tunnel(E) udp_tunnel(E) usbhid(E) i2c_hid(E) ahci(E) ptp(E) hid(E) libahci(E) nvme(E) pps_core(E) i2c_core(E) [ 1427.873932] CPU: 4 PID: 11825 Comm: java Tainted: G D W EL 4.3.0-ingens #1 [ 1427.874570] Hardware name: Supermicro SYS-5039A-IL/X11SAE, BIOS 2.1 09/12/2017 [ 1427.875203] task: ffff880805f29e00 ti: ffff8808041e8000 task.ti: ffff8808041e8000 [ 1427.875839] RIP: 0010:[] [] native_queued_spin_lock_slowpath+0x15e/0x170 [ 1427.877136] RSP: 0000:ffff8808041ebdb0 EFLAGS: 00000202 [ 1427.877793] RAX: 00000000001c0101 RBX: ffff880863cdfc60 RCX: 0000000000000001 [ 1427.878456] RDX: 0000000000000101 RSI: 0000000000000001 RDI: ffffffff823eefc4 [ 1427.879120] RBP: ffff8808041ebdb0 R08: 0000000000000101 R09: ffff880863cdfc60 [ 1427.879774] R10: 0000000000000055 R11: ffff880000000000 R12: ffff880868363800 [ 1427.880422] R13: ffffea00478a6280 R14: 00007f0d5d000000 R15: ffff880806d7fcc0 [ 1427.881074] FS: 00007f0d73945700(0000) GS:ffff88086e500000(0000) knlGS:0000000000000000 [ 1427.881732] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1427.882396] CR2: 00007f0d5d000000 CR3: 000000080636b000 CR4: 00000000003406e0 [ 1427.883063] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1427.883719] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1427.884372] Stack: [ 1427.885023] ffff8808041ebdc0 ffffffff811b9ee3 ffff8808041ebdd0 ffffffff818ce0a0 [ 1427.885691] ffff8808041ebe00 ffffffff81224028 ffff880868363800 ffff8808595de000 [ 1427.886350] ffff880868363800 ffffea00478a6280 ffff8808041ebec0 ffffffff811ee0eb [ 1427.887010] Call Trace: [ 1427.887661] [] queued_spin_lock_slowpath+0xb/0xf [ 1427.888326] [] _raw_spin_lock+0x20/0x30 [ 1427.889001] [] khugepaged_enter+0x58/0x160 [ 1427.889669] [] handle_mm_fault+0x1bdb/0x1c70 [ 1427.890338] [] __do_page_fault+0x19a/0x430 [ 1427.890992] [] do_page_fault+0x30/0x80 [ 1427.891628] [] page_fault+0x28/0x30 [ 1427.892246] Code: 01 48 8b 01 48 85 c0 75 0a f3 90 48 8b 01 48 85 c0 74 f6 c7 40 08 01 00 00 00 e9 61 ff ff ff 83 fa 01 75 07 e9 c2 fe ff ff f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 5d c3 66 90 0f 1f 44

WeixiZhu94 commented 5 years ago

Another bug:

After running several benchmarks, this soft lock-up bug appears randomly when shutting down the machine:

A stop job is running for Network Time Synchronization (26s / 1min 32s) NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! A stop job is running for Raise network interfaces (54s / 1min 30s) NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! A stop job is running for ifup for em1 ...

Please Check the screenshot attached:

bug

yuhc commented 5 years ago

Hi Weixi,

What Java benchmark are you running? It looks like the spinlock is held quite for a while by some other thread.

Because this project isn't not in active maintenance and I don't have a spare machine right now to install and test ingens and your benchmark, it may take some time for me to find the problem.

WeixiZhu94 commented 5 years ago

It is SpecJVM 2008. I tested eclipse and jython. Ingens was tuned to use util-threshold of 0 and I enabled the proactive memory compaction.

You can try this command: /usr/bin/time -v java -jar ../dacapo-9.12-MR1-bach.jar \ eclipse --iterations=10

Or /usr/bin/time -v java -jar ../dacapo-9.12-MR1-bach.jar \ jython --iterations=10

From my experience, the bug happens quite randomly… I installed Ingens on Ubuntu server, 14.04 and 18.03. Bugs can happen on both cases.

-Weixi