openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.45k stars 1.73k forks source link

System lockup when rsync'ing files between zpools #7867

Closed ccoager closed 5 years ago

ccoager commented 6 years ago

System information

Distribution Name | Ubuntu Distribution Version | 18.04.1 Linux Kernel | 4.15.0-33-generic Architecture | x86_64 ZFS Version | 0.7.5-1ubuntu16.3, 0.7.9, 0.7.10 SPL Version | 0.7.5-1ubuntu2, 0.7.9, 0.7.10

Motherboard | M5A88-M CPU | AMD FX 4100 RAM | 16GB

Describe the problem you're observing

System is hard locking up when transferring files between zpools. The source zpool is about 3tb with some large files (20-30GB).

Describe how to reproduce the problem

rsync -iav /source /destination

Include any warning/errors/backtraces from the system logs

There are no messages in dmesg, nothing logged to syslog, nothing on the console and sysrq does not respond. EDIT: See comment below for syslog dump.

Some history about this server. I've had it for a number of years. It was previously running Ubuntu 16.04 which was stable. The issue started happening once I upgraded to Ubuntu 18.04. The zpools were upgraded after the distribution was upgraded. The issue occurs after about an hour of transferring files which is reproducible. If I do not rsync files between zpools the server remains stable and does not lockup.

Source zpool: pool: MJ1311YNG21HAA state: ONLINE scan: none requested config:

    NAME              STATE     READ WRITE CKSUM
    MJ1311YNG21HAA    ONLINE       0     0     0
      MJ1311YNG21HAA  ONLINE       0     0     0

errors: No known data errors

Destination zpool: pool: tank2018 state: ONLINE scan: scrub repaired 36K in 42h53m with 0 errors on Mon Aug 20 06:11:52 2018 config:

    NAME                                              STATE     READ WRITE CKSUM
    tank2018                                          ONLINE       0     0     0
      raidz1-0                                        ONLINE       0     0     0
        scsi-3600605b00c6e4e70ff000021021db69f-part1  ONLINE       0     0     0
        scsi-3600605b00c6e4e70ff000021021db450-part1  ONLINE       0     0     0
        scsi-3600605b00c6e4e70ff000021021db49d-part1  ONLINE       0     0     0
        scsi-3600605b00c6e4e70ff000021021db4ca-part1  ONLINE       0     0     0
        scsi-3600605b00c6e4e70ff000021021db4f3-part1  ONLINE       0     0     0
        scsi-3600605b00c6e4e70ff000021021db520-part1  ONLINE       0     0     0
        scsi-3600605b00c6e4e70ff000021021db54b-part1  ONLINE       0     0     0
        scsi-3600605b00c6e4e70ff000021021db579-part1  ONLINE       0     0     0

errors: No known data errors

rincebrain commented 6 years ago

This could be a whole bunch of things, really.

If you can, what'd be really helpful is if you could boot the machine without X and with automatic blanking on the console screen turned off, run the rsync until it hangs, and then in theory it should print some nasty message to the console screen when it locks up that badly.

(I'm assuming you already dug around in journalctl / /var/log/[...] to confirm it didn't successfully save anything from said mashup?)

ccoager commented 6 years ago

X is not running and screen blanking is already disabled. The screen still goes blank when it locks up, and of course, unresponsive. Yeah, I've looked through the relevant logs, syslog and kern, but nothing gets logged when it locks up.

skalteis commented 6 years ago

You could try pressing Alt-SysRq-l (lowercase L) to maybe still get a stack trace for all CPUs. [hopefully Ubuntu compiles its kernel with magic sysrq on...]

ccoager commented 6 years ago

Sysrq is enabled and does work on this server. However, when it is locked up like this, none of the sysrq commands work.

rincebrain commented 6 years ago

@ccoager does the monitor have an LED to say "no signal" versus "on but all dark"? Because the former sounds like what's happening, and that's exciting.

It might make sense to try setting up kdump (the Linux kernel core dump tool) and seeing if that can successfully trigger and save when the world goes to hell. Otherwise I'd be suggesting finding a serial port to output the debug info to.

ccoager commented 6 years ago

@rincebrain The monitor does not say "no signal". The color changes from a dark black to a lighter shade with no text on it.

I'll look into kdump and the serial port.

ccoager commented 6 years ago

@rincebrain Doesn't look like kdump is writing to disk, probably because of #6476.

rincebrain commented 6 years ago

@ccoager Doesn't #6476 suggest a simple workaround of editing an initramfs config file?

ccoager commented 6 years ago

@rincebrain I've done the necessary from all the documentation I've read. When I trigger a crash via sysrq, it doesn't write to disk unfortunately.

rincebrain commented 6 years ago

@ccoager did you regenerate the kdump initramfs after doing it?

ccoager commented 6 years ago

@rincebrain Yes

ccoager commented 6 years ago

Just tested 0.7.9, server locked up on this version as well. I noticed the free memory depletes as time goes on. Same issue as #3677?

rincebrain commented 6 years ago

@ccoager that'd be...pretty astonishing, the memory code has churned a fair bit in the interim.

What does /proc/spl/kstat/zfs/arcstats say when you start the rsync, and then a few minutes into it, and so on until fire?

ccoager commented 6 years ago

I captured a lot of logs for top output, free and arcstats at 10 minute intervals until the server crashed. stats.log.gz

ccoager commented 6 years ago

@rincebrain I booted up the sysresccd-5.2.0_zfs_0.7.9.iso LiveCD and retested. Same behavior as Ubuntu 18.04, server locks up, console goes blank, network goes offline and sysrq doesn't respond.

ccoager commented 6 years ago

@rincebrain Tried spl_taskq_thread_dynamic = 0, still crashed.

ccoager commented 6 years ago

Replaced the motherboard and CPU and tested the RAM for 30+ hours. Still crashing.

ccoager commented 6 years ago

@rincebrain Flamegraph for perf data captured right before crash. kernel.zip

ccoager commented 6 years ago

@rincebrain Flamegraph after reboot kernel.afterreboot.svg.zip

ccoager commented 6 years ago

@rincebrain Flamegraph after reboot kernel.zip

ccoager commented 6 years ago

@rincebrain Finally captured a crash in syslog. crash_09-13-2018.txt

creshal commented 6 years ago

We're seeing the issue on two different machines as well, reproducible on ZFS 0.7.9 and .10. So far I haven't ben able to actually collect a crash dump; even with nmi_watchdog enabled kdump doesn't seem to trigger.

ccoager commented 6 years ago

@creshal It took literally about 50 crashes before I actually got one to write to syslog before dying. Keep trying and checking! Can you also provide some details on your machine? Hardware? OS distro?

ccoager commented 6 years ago

Last few kernel messages before latest crash:

Sep 16 15:28:42 gaia kernel: [13534.319674] INFO: task nagios3:23609 blocked for more than 120 seconds. Sep 16 15:28:42 gaia kernel: [13534.319830] Tainted: P B OEL 4.15.18 #2 Sep 16 15:28:42 gaia kernel: [13534.319947] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 16 15:28:42 gaia kernel: [13534.320173] nagios3 D 0 23609 23034 0x00000120 Sep 16 15:28:42 gaia kernel: [13534.320179] Call Trace: Sep 16 15:28:42 gaia kernel: [13534.320188] schedule+0x469/0x1180 Sep 16 15:28:42 gaia kernel: [13534.320194] ? firmware_map_remove+0x187/0x187 Sep 16 15:28:42 gaia kernel: [13534.320484] ? zfs_get_done+0xc0/0xc0 [zfs] Sep 16 15:28:42 gaia kernel: [13534.320490] ? atomic_t_wait+0x90/0x90 Sep 16 15:28:42 gaia kernel: [13534.320494] schedule+0x54/0x130 Sep 16 15:28:42 gaia kernel: [13534.320499] bit_wait+0x14/0x90 Sep 16 15:28:42 gaia kernel: [13534.320504] wait_on_bit+0x71/0xe0 Sep 16 15:28:42 gaia kernel: [13534.320508] ? atomic_t_wait+0x90/0x90 Sep 16 15:28:42 gaia kernel: [13534.320513] ? atomic_t_wait+0x90/0x90 Sep 16 15:28:42 gaia kernel: [13534.320518] inode_wait_for_writeback+0x13a/0x1b0 Sep 16 15:28:42 gaia kernel: [13534.320523] ? inode_switch_wbs_rcu_fn+0x130/0x130 Sep 16 15:28:42 gaia kernel: [13534.320529] ? wake_up_atomic_t+0x50/0x50 Sep 16 15:28:42 gaia kernel: [13534.320534] ? dentry_unlink_inode+0x27e/0x330 Sep 16 15:28:42 gaia kernel: [13534.320540] inode_wait_for_writeback+0x26/0x40 Sep 16 15:28:42 gaia kernel: [13534.320545] evict+0x20a/0x470 Sep 16 15:28:42 gaia kernel: [13534.320550] ? destroy_inode+0x100/0x100 Sep 16 15:28:42 gaia kernel: [13534.320555] iput+0x270/0x4e0 Sep 16 15:28:42 gaia kernel: [13534.320561] do_unlinkat+0x2d9/0x600 Sep 16 15:28:42 gaia kernel: [13534.320566] ? SyS_rmdir+0x20/0x20 Sep 16 15:28:42 gaia kernel: [13534.320571] ? digsig_verify+0xc40/0xc40 Sep 16 15:28:42 gaia kernel: [13534.320576] ? kasan_kmalloc+0xad/0xe0 Sep 16 15:28:42 gaia kernel: [13534.320582] ? getname_flags+0x78/0x310 Sep 16 15:28:42 gaia kernel: [13534.320587] ? getname_flags+0x118/0x310 Sep 16 15:28:42 gaia kernel: [13534.320592] ? SyS_unlinkat+0x50/0x50 Sep 16 15:28:42 gaia kernel: [13534.320597] SyS_unlink+0x1f/0x30 Sep 16 15:28:42 gaia kernel: [13534.320602] do_syscall_64+0x101/0x2c0 Sep 16 15:28:42 gaia kernel: [13534.320607] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Sep 16 15:28:42 gaia kernel: [13534.320611] RIP: 0033:0x7f00353b0a97 Sep 16 15:28:42 gaia kernel: [13534.320613] RSP: 002b:00007ffef1d306e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000057 Sep 16 15:28:42 gaia kernel: [13534.320618] RAX: ffffffffffffffda RBX: 00007ffef1d308d0 RCX: 00007f00353b0a97 Sep 16 15:28:42 gaia kernel: [13534.320621] RDX: 000055d112e67660 RSI: 00007f003567cb38 RDI: 00007ffef1d308d0 Sep 16 15:28:42 gaia kernel: [13534.320623] RBP: 000055d112e638a0 R08: 000055d112e638a0 R09: 0000000000000000 Sep 16 15:28:42 gaia kernel: [13534.320625] R10: 3030323b30303531 R11: 0000000000000246 R12: 00007ffef1d308d0 Sep 16 15:28:42 gaia kernel: [13534.320628] R13: 0000000000000001 R14: 000055d112e70380 R15: 0000000000000000 Sep 16 15:28:42 gaia kernel: [13534.320960] INFO: task sync:19647 blocked for more than 120 seconds. Sep 16 15:28:42 gaia kernel: [13534.321106] Tainted: P B OEL 4.15.18 #2 Sep 16 15:28:42 gaia kernel: [13534.321225] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 16 15:28:42 gaia kernel: [13534.321391] sync D 0 19647 23033 0x00000000 Sep 16 15:28:42 gaia kernel: [13534.321395] Call Trace: Sep 16 15:28:42 gaia kernel: [13534.321401] __schedule+0x469/0x1180 Sep 16 15:28:42 gaia kernel: [13534.321407] ? firmware_map_remove+0x187/0x187 Sep 16 15:28:42 gaia kernel: [13534.321472] schedule+0x54/0x130 Sep 16 15:28:42 gaia kernel: [13534.321476] wb_wait_for_completion+0xe1/0x130 Sep 16 15:28:42 gaia kernel: [13534.321481] ? blockdumpmark_inode_dirty+0x1c0/0x1c0 Sep 16 15:28:42 gaia kernel: [13534.321485] ? switch_to_asm+0x34/0x70 Sep 16 15:28:42 gaia kernel: [13534.321489] ? wait_woken+0x180/0x180 Sep 16 15:28:42 gaia kernel: [13534.321494] ? schedule+0x499/0x1180 Sep 16 15:28:42 gaia kernel: [13534.321499] sync_inodes_sb+0x157/0x1e0 Sep 16 15:28:42 gaia kernel: [13534.321504] ? try_to_writeback_inodes_sb+0x50/0x50 Sep 16 15:28:42 gaia kernel: [13534.321510] ? rcu_sched_qs.part.47+0x70/0x70 Sep 16 15:28:42 gaia kernel: [13534.321516] ? down_read+0x82/0x100 Sep 16 15:28:42 gaia kernel: [13534.321521] ? down_write_killable+0x100/0x100 Sep 16 15:28:42 gaia kernel: [13534.321525] ? _raw_spin_unlock_bh+0x1e/0x20 Sep 16 15:28:42 gaia kernel: [13534.321529] ? do_sync_work+0x100/0x100 Sep 16 15:28:42 gaia kernel: [13534.321534] sync_inodes_one_sb+0x29/0x40 Sep 16 15:28:42 gaia kernel: [13534.321540] iterate_supers+0x101/0x230 Sep 16 15:28:42 gaia kernel: [13534.321544] ? sync_filesystem+0x90/0x90 Sep 16 15:28:42 gaia kernel: [13534.321549] sys_sync+0x9d/0x140 Sep 16 15:28:42 gaia kernel: [13534.321553] ? sync_filesystem+0x90/0x90 Sep 16 15:28:42 gaia kernel: [13534.321559] ? mm_fault_error+0x250/0x250 Sep 16 15:28:42 gaia kernel: [13534.321564] do_syscall_64+0x101/0x2c0 Sep 16 15:28:42 gaia kernel: [13534.321570] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Sep 16 15:28:42 gaia kernel: [13534.321573] RIP: 0033:0x7fdedfbaf287 Sep 16 15:28:42 gaia kernel: [13534.321575] RSP: 002b:00007fff8c7d9ed8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2 Sep 16 15:28:42 gaia kernel: [13534.321580] RAX: ffffffffffffffda RBX: 00007fff8c7da008 RCX: 00007fdedfbaf287 Sep 16 15:28:42 gaia kernel: [13534.321582] RDX: 00007fdedfe86001 RSI: 0000000000000000 RDI: 00007fdedfc4d4b0 Sep 16 15:28:42 gaia kernel: [13534.321584] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000 Sep 16 15:28:42 gaia kernel: [13534.321587] R10: 0000557d3a27d010 R11: 0000000000000246 R12: 0000000000000001 Sep 16 15:28:42 gaia kernel: [13534.321589] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 Sep 16 15:28:42 gaia kernel: [13534.321595] INFO: task cleanup:19672 blocked for more than 120 seconds. Sep 16 15:28:42 gaia kernel: [13534.321740] Tainted: P B OEL 4.15.18 #2 Sep 16 15:28:42 gaia kernel: [13534.321860] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 16 15:28:42 gaia kernel: [13534.322038] cleanup D 0 19672 17838 0x00000120 Sep 16 15:28:42 gaia kernel: [13534.322042] Call Trace: Sep 16 15:28:42 gaia kernel: [13534.322048] schedule+0x469/0x1180 Sep 16 15:28:42 gaia kernel: [13534.322054] ? firmware_map_remove+0x187/0x187 Sep 16 15:28:42 gaia kernel: [13534.322060] ? aa_path_perm+0x1bb/0x270 Sep 16 15:28:42 gaia kernel: [13534.322065] schedule+0x54/0x130 Sep 16 15:28:42 gaia kernel: [13534.322091] cv_wait_common+0x1f6/0x240 [spl] Sep 16 15:28:42 gaia kernel: [13534.322117] ? cv_destroy+0x230/0x230 [spl] Sep 16 15:28:42 gaia kernel: [13534.322121] ? wait_woken+0x180/0x180 Sep 16 15:28:42 gaia kernel: [13534.322147] cv_wait+0x15/0x20 [spl] Sep 16 15:28:42 gaia kernel: [13534.322371] zil_commit.part.6+0x134/0x12d0 [zfs] Sep 16 15:28:42 gaia kernel: [13534.322396] ? tsd_hash_search+0xd4/0x170 [spl] Sep 16 15:28:42 gaia kernel: [13534.322483] ? tsd_set+0x4da/0xc60 [spl] Sep 16 15:28:42 gaia kernel: [13534.322489] ? rcu_sched_qs.part.47+0x70/0x70 Sep 16 15:28:42 gaia kernel: [13534.322762] ? zil_claim_log_record+0x90/0x90 [zfs] Sep 16 15:28:42 gaia kernel: [13534.322847] ? _cond_resched+0x1c/0x70 Sep 16 15:28:42 gaia kernel: [13534.322852] ? mutex_lock+0x12/0x40 Sep 16 15:28:42 gaia kernel: [13534.323067] ? rrw_enter_read_impl+0x220/0x2a0 [zfs] Sep 16 15:28:42 gaia kernel: [13534.323285] zil_commit+0x37/0x50 [zfs] Sep 16 15:28:42 gaia kernel: [13534.323512] zfs_fsync+0x125/0x230 [zfs] Sep 16 15:28:42 gaia kernel: [13534.323743] zpl_fsync+0x10c/0x230 [zfs] Sep 16 15:28:42 gaia kernel: [13534.323753] vfs_fsync_range+0xab/0x1d0 Sep 16 15:28:42 gaia kernel: [13534.323976] ? zpl_readpage+0x260/0x260 [zfs] Sep 16 15:28:42 gaia kernel: [13534.323985] do_fsync+0x3d/0x70 Sep 16 15:28:42 gaia kernel: [13534.323997] ? SyS_syncfs+0xa0/0xa0 Sep 16 15:28:42 gaia kernel: [13534.324009] SyS_fsync+0x10/0x20 Sep 16 15:28:42 gaia kernel: [13534.324019] do_syscall_64+0x101/0x2c0 Sep 16 15:28:42 gaia kernel: [13534.324029] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Sep 16 15:28:42 gaia kernel: [13534.324039] RIP: 0033:0x7f6981040b10 Sep 16 15:28:42 gaia kernel: [13534.324047] RSP: 002b:00007ffd411d2588 EFLAGS: 00000246 ORIG_RAX: 000000000000004a Sep 16 15:28:42 gaia kernel: [13534.324065] RAX: ffffffffffffffda RBX: 000055794cfc2550 RCX: 00007f6981040b10 Sep 16 15:28:42 gaia kernel: [13534.324073] RDX: 0000000000000000 RSI: 00000000000001c0 RDI: 000000000000000f Sep 16 15:28:42 gaia kernel: [13534.324084] RBP: 0000000000000001 R08: 0000000000000000 R09: 000000000000000f Sep 16 15:28:42 gaia kernel: [13534.324095] R10: 000000000000009a R11: 0000000000000246 R12: 000055794cfc10d0 Sep 16 15:28:42 gaia kernel: [13534.324105] R13: 000055794cfc1650 R14: 0000000000000001 R15: 000055794b53a3f0 Sep 16 15:30:43 gaia kernel: [13655.171349] INFO: task nagios3:23609 blocked for more than 120 seconds. Sep 16 15:30:43 gaia kernel: [13655.171507] Tainted: P B OEL 4.15.18 #2 Sep 16 15:30:43 gaia kernel: [13655.171621] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 16 15:30:43 gaia kernel: [13655.171786] nagios3 D 0 23609 23034 0x00000120 Sep 16 15:30:43 gaia kernel: [13655.171792] Call Trace: Sep 16 15:30:43 gaia kernel: [13655.171802] schedule+0x469/0x1180 Sep 16 15:30:43 gaia kernel: [13655.171808] ? firmware_map_remove+0x187/0x187 Sep 16 15:30:43 gaia kernel: [13655.172031] ? zfs_get_done+0xc0/0xc0 [zfs] Sep 16 15:30:43 gaia kernel: [13655.172058] ? atomic_t_wait+0x90/0x90 Sep 16 15:30:43 gaia kernel: [13655.172062] schedule+0x54/0x130 Sep 16 15:30:43 gaia kernel: [13655.172067] bit_wait+0x14/0x90 Sep 16 15:30:43 gaia kernel: [13655.172073] wait_on_bit+0x71/0xe0 Sep 16 15:30:43 gaia kernel: [13655.172077] ? atomic_t_wait+0x90/0x90 Sep 16 15:30:43 gaia kernel: [13655.172081] ? atomic_t_wait+0x90/0x90 Sep 16 15:30:43 gaia kernel: [13655.172086] inode_wait_for_writeback+0x13a/0x1b0 Sep 16 15:30:43 gaia kernel: [13655.172090] ? inode_switch_wbs_rcu_fn+0x130/0x130 Sep 16 15:30:43 gaia kernel: [13655.172097] ? wake_up_atomic_t+0x50/0x50 Sep 16 15:30:43 gaia kernel: [13655.172102] ? dentry_unlink_inode+0x27e/0x330 Sep 16 15:30:43 gaia kernel: [13655.172107] inode_wait_for_writeback+0x26/0x40 Sep 16 15:30:43 gaia kernel: [13655.172111] evict+0x20a/0x470 Sep 16 15:30:43 gaia kernel: [13655.172117] ? destroy_inode+0x100/0x100 Sep 16 15:30:43 gaia kernel: [13655.172122] iput+0x270/0x4e0 Sep 16 15:30:43 gaia kernel: [13655.172128] do_unlinkat+0x2d9/0x600 Sep 16 15:30:43 gaia kernel: [13655.172132] ? SyS_rmdir+0x20/0x20 Sep 16 15:30:43 gaia kernel: [13655.172137] ? digsig_verify+0xc40/0xc40 Sep 16 15:30:43 gaia kernel: [13655.172142] ? kasan_kmalloc+0xad/0xe0 Sep 16 15:30:43 gaia kernel: [13655.172147] ? getname_flags+0x78/0x310 Sep 16 15:30:43 gaia kernel: [13655.172152] ? getname_flags+0x118/0x310 Sep 16 15:30:43 gaia kernel: [13655.172156] ? SyS_unlinkat+0x50/0x50 Sep 16 15:30:43 gaia kernel: [13655.172161] SyS_unlink+0x1f/0x30 Sep 16 15:30:43 gaia kernel: [13655.172166] do_syscall_64+0x101/0x2c0 Sep 16 15:30:43 gaia kernel: [13655.172171] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Sep 16 15:30:43 gaia kernel: [13655.172175] RIP: 0033:0x7f00353b0a97 Sep 16 15:30:43 gaia kernel: [13655.172177] RSP: 002b:00007ffef1d306e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000057 Sep 16 15:30:43 gaia kernel: [13655.172183] RAX: ffffffffffffffda RBX: 00007ffef1d308d0 RCX: 00007f00353b0a97 Sep 16 15:30:43 gaia kernel: [13655.172186] RDX: 000055d112e67660 RSI: 00007f003567cb38 RDI: 00007ffef1d308d0 Sep 16 15:30:43 gaia kernel: [13655.172189] RBP: 000055d112e638a0 R08: 000055d112e638a0 R09: 0000000000000000 Sep 16 15:30:43 gaia kernel: [13655.172191] R10: 3030323b30303531 R11: 0000000000000246 R12: 00007ffef1d308d0 Sep 16 15:30:43 gaia kernel: [13655.172194] R13: 0000000000000001 R14: 000055d112e70380 R15: 0000000000000000 Sep 16 15:30:43 gaia kernel: [13655.180900] INFO: task sync:19647 blocked for more than 120 seconds. Sep 16 15:30:43 gaia kernel: [13655.181045] Tainted: P B OEL 4.15.18 #2 Sep 16 15:30:43 gaia kernel: [13655.181165] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 16 15:30:43 gaia kernel: [13655.181388] sync D 0 19647 23033 0x00000000 Sep 16 15:30:43 gaia kernel: [13655.181394] Call Trace: Sep 16 15:30:43 gaia kernel: [13655.181402] schedule+0x469/0x1180 Sep 16 15:30:43 gaia kernel: [13655.181408] ? firmware_map_remove+0x187/0x187 Sep 16 15:30:43 gaia kernel: [13655.181414] schedule+0x54/0x130 Sep 16 15:30:43 gaia kernel: [13655.181419] wb_wait_for_completion+0xe1/0x130 Sep 16 15:30:43 gaia kernel: [13655.181424] ? block_dump_mark_inode_dirty+0x1c0/0x1c0 Sep 16 15:30:43 gaia kernel: [13655.181428] ? switch_to_asm+0x34/0x70 Sep 16 15:30:43 gaia kernel: [13655.181433] ? wait_woken+0x180/0x180 Sep 16 15:30:43 gaia kernel: [13655.181437] ? schedule+0x499/0x1180 Sep 16 15:30:43 gaia kernel: [13655.181442] sync_inodes_sb+0x157/0x1e0 Sep 16 15:30:43 gaia kernel: [13655.181447] ? try_to_writeback_inodes_sb+0x50/0x50 Sep 16 15:30:43 gaia kernel: [13655.181453] ? rcu_sched_qs.part.47+0x70/0x70 Sep 16 15:30:43 gaia kernel: [13655.181459] ? down_read+0x82/0x100 Sep 16 15:30:43 gaia kernel: [13655.181464] ? down_write_killable+0x100/0x100 Sep 16 15:30:43 gaia kernel: [13655.181467] ? _raw_spin_unlock_bh+0x1e/0x20 Sep 16 15:30:43 gaia kernel: [13655.181472] ? do_sync_work+0x100/0x100 Sep 16 15:30:43 gaia kernel: [13655.181477] sync_inodes_one_sb+0x29/0x40 Sep 16 15:30:43 gaia kernel: [13655.181483] iterate_supers+0x101/0x230 Sep 16 15:30:43 gaia kernel: [13655.181487] ? sync_filesystem+0x90/0x90 Sep 16 15:30:43 gaia kernel: [13655.181492] sys_sync+0x9d/0x140 Sep 16 15:30:43 gaia kernel: [13655.181497] ? sync_filesystem+0x90/0x90 Sep 16 15:30:43 gaia kernel: [13655.181502] ? mm_fault_error+0x250/0x250 Sep 16 15:30:43 gaia kernel: [13655.181508] do_syscall_64+0x101/0x2c0 Sep 16 15:30:43 gaia kernel: [13655.181514] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Sep 16 15:30:43 gaia kernel: [13655.181517] RIP: 0033:0x7fdedfbaf287 Sep 16 15:30:43 gaia kernel: [13655.181520] RSP: 002b:00007fff8c7d9ed8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2 Sep 16 15:30:43 gaia kernel: [13655.181524] RAX: ffffffffffffffda RBX: 00007fff8c7da008 RCX: 00007fdedfbaf287 Sep 16 15:30:43 gaia kernel: [13655.181527] RDX: 00007fdedfe86001 RSI: 0000000000000000 RDI: 00007fdedfc4d4b0 Sep 16 15:30:43 gaia kernel: [13655.181529] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000 Sep 16 15:30:43 gaia kernel: [13655.181532] R10: 0000557d3a27d010 R11: 0000000000000246 R12: 0000000000000001 Sep 16 15:30:43 gaia kernel: [13655.181534] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 Sep 16 15:30:43 gaia kernel: [13655.181540] INFO: task cleanup:19672 blocked for more than 120 seconds. Sep 16 15:30:43 gaia kernel: [13655.181690] Tainted: P B OEL 4.15.18 #2 Sep 16 15:30:43 gaia kernel: [13655.181816] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 16 15:30:43 gaia kernel: [13655.181987] cleanup D 0 19672 17838 0x00000120 Sep 16 15:30:43 gaia kernel: [13655.181991] Call Trace: Sep 16 15:30:43 gaia kernel: [13655.181997] schedule+0x469/0x1180 Sep 16 15:30:43 gaia kernel: [13655.182003] ? firmware_map_remove+0x187/0x187 Sep 16 15:30:43 gaia kernel: [13655.182009] ? aa_path_perm+0x1bb/0x270 Sep 16 15:30:43 gaia kernel: [13655.182014] schedule+0x54/0x130 Sep 16 15:30:43 gaia kernel: [13655.182040] cv_wait_common+0x1f6/0x240 [spl] Sep 16 15:30:43 gaia kernel: [13655.182066] ? cv_destroy+0x230/0x230 [spl] Sep 16 15:30:43 gaia kernel: [13655.182070] ? wait_woken+0x180/0x180 Sep 16 15:30:43 gaia kernel: [13655.182097] cv_wait+0x15/0x20 [spl] Sep 16 15:30:43 gaia kernel: [13655.182329] zil_commit.part.6+0x134/0x12d0 [zfs] Sep 16 15:30:43 gaia kernel: [13655.182356] ? tsd_hash_search+0xd4/0x170 [spl] Sep 16 15:30:43 gaia kernel: [13655.182381] ? tsd_set+0x4da/0xc60 [spl] Sep 16 15:30:43 gaia kernel: [13655.182387] ? rcu_sched_qs.part.47+0x70/0x70 Sep 16 15:30:43 gaia kernel: [13655.182606] ? zil_claim_log_record+0x90/0x90 [zfs] Sep 16 15:30:43 gaia kernel: [13655.182612] ? _cond_resched+0x1c/0x70 Sep 16 15:30:43 gaia kernel: [13655.182616] ? mutex_lock+0x12/0x40 Sep 16 15:30:43 gaia kernel: [13655.182826] ? rrw_enter_read_impl+0x220/0x2a0 [zfs] Sep 16 15:30:43 gaia kernel: [13655.183045] zil_commit+0x37/0x50 [zfs] Sep 16 15:30:43 gaia kernel: [13655.183266] zfs_fsync+0x125/0x230 [zfs] Sep 16 15:30:43 gaia kernel: [13655.183487] zpl_fsync+0x10c/0x230 [zfs] Sep 16 15:30:43 gaia kernel: [13655.183493] vfs_fsync_range+0xab/0x1d0 Sep 16 15:30:43 gaia kernel: [13655.183715] ? zpl_readpage+0x260/0x260 [zfs] Sep 16 15:30:43 gaia kernel: [13655.183720] do_fsync+0x3d/0x70 Sep 16 15:30:43 gaia kernel: [13655.183725] ? SyS_syncfs+0xa0/0xa0 Sep 16 15:30:43 gaia kernel: [13655.183730] SyS_fsync+0x10/0x20 Sep 16 15:30:43 gaia kernel: [13655.183734] do_syscall_64+0x101/0x2c0 Sep 16 15:30:43 gaia kernel: [13655.183740] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Sep 16 15:30:43 gaia kernel: [13655.183743] RIP: 0033:0x7f6981040b10 Sep 16 15:30:43 gaia kernel: [13655.183745] RSP: 002b:00007ffd411d2588 EFLAGS: 00000246 ORIG_RAX: 000000000000004a Sep 16 15:30:43 gaia kernel: [13655.183749] RAX: ffffffffffffffda RBX: 000055794cfc2550 RCX: 00007f6981040b10 Sep 16 15:30:43 gaia kernel: [13655.183752] RDX: 0000000000000000 RSI: 00000000000001c0 RDI: 000000000000000f Sep 16 15:30:43 gaia kernel: [13655.183754] RBP: 0000000000000001 R08: 0000000000000000 R09: 000000000000000f Sep 16 15:30:43 gaia kernel: [13655.183756] R10: 000000000000009a R11: 0000000000000246 R12: 000055794cfc10d0 Sep 16 15:30:43 gaia kernel: [13655.183758] R13: 000055794cfc1650 R14: 0000000000000001 R15: 000055794b53a3f0 Sep 16 15:32:43 gaia kernel: [13775.991171] INFO: task alertmanager:18365 blocked for more than 120 seconds. Sep 16 15:32:43 gaia kernel: [13775.991336] Tainted: P B OEL 4.15.18 #2 Sep 16 15:32:43 gaia kernel: [13775.991455] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 16 15:32:43 gaia kernel: [13775.991627] alertmanager D 0 18365 16770 0x00000120 Sep 16 15:32:43 gaia kernel: [13775.991632] Call Trace: Sep 16 15:32:43 gaia kernel: [13775.991641] schedule+0x469/0x1180 Sep 16 15:32:43 gaia kernel: [13775.991647] ? firmware_map_remove+0x187/0x187 Sep 16 15:32:43 gaia kernel: [13775.991653] schedule+0x54/0x130 Sep 16 15:32:43 gaia kernel: [13775.991680] cv_wait_common+0x1f6/0x240 [spl] Sep 16 15:32:43 gaia kernel: [13775.991704] ? cv_destroy+0x230/0x230 [spl] Sep 16 15:32:43 gaia kernel: [13775.991710] ? wait_woken+0x180/0x180 Sep 16 15:32:43 gaia kernel: [13775.991737] __cv_wait+0x15/0x20 [spl] Sep 16 15:32:43 gaia kernel: [13775.991964] zil_commit.part.6+0x134/0x12d0 [zfs] Sep 16 15:32:43 gaia kernel: [13775.991990] ? tsd_hash_search+0xd4/0x170 [spl] Sep 16 15:32:43 gaia kernel: [13775.992016] ? tsd_set+0x4da/0xc60 [spl] Sep 16 15:32:43 gaia kernel: [13775.992022] ? rcu_sched_qs.part.47+0x70/0x70 Sep 16 15:32:43 gaia kernel: [13775.992241] ? zil_claim_log_record+0x90/0x90 [zfs] Sep 16 15:32:43 gaia kernel: [13775.992246] ? _cond_resched+0x1c/0x70 Sep 16 15:32:43 gaia kernel: [13775.992251] ? mutex_lock+0x12/0x40 Sep 16 15:32:43 gaia kernel: [13775.992461] ? rrw_enter_read_impl+0x220/0x2a0 [zfs] Sep 16 15:32:43 gaia kernel: [13775.992681] zil_commit+0x37/0x50 [zfs] Sep 16 15:32:43 gaia kernel: [13775.992906] zfs_fsync+0x125/0x230 [zfs] Sep 16 15:32:43 gaia kernel: [13775.993122] zpl_fsync+0x10c/0x230 [zfs] Sep 16 15:32:43 gaia kernel: [13775.993130] vfs_fsync_range+0xab/0x1d0 Sep 16 15:32:43 gaia kernel: [13775.993374] ? zpl_readpage+0x260/0x260 [zfs] Sep 16 15:32:43 gaia kernel: [13775.993380] do_fsync+0x3d/0x70 Sep 16 15:32:43 gaia kernel: [13775.993385] ? SyS_syncfs+0xa0/0xa0 Sep 16 15:32:43 gaia kernel: [13775.993389] SyS_fsync+0x10/0x20 Sep 16 15:32:43 gaia kernel: [13775.993395] do_syscall_64+0x101/0x2c0 Sep 16 15:32:43 gaia kernel: [13775.993400] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Sep 16 15:32:43 gaia kernel: [13775.993403] RIP: 0033:0x4a4864 Sep 16 15:32:43 gaia kernel: [13775.993405] RSP: 002b:000000c420298cf8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a Sep 16 15:32:43 gaia kernel: [13775.993410] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00000000004a4864 Sep 16 15:32:43 gaia kernel: [13775.993412] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000007 Sep 16 15:32:43 gaia kernel: [13775.993414] RBP: 000000c420298d40 R08: 0000000000000000 R09: 0000000000000000 Sep 16 15:32:43 gaia kernel: [13775.993416] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 Sep 16 15:32:43 gaia kernel: [13775.993418] R13: 0000000000000018 R14: 0000000000000013 R15: 0000000000000001 Sep 16 15:32:43 gaia kernel: [13775.993423] INFO: task alertmanager:18366 blocked for more than 120 seconds. Sep 16 15:32:43 gaia kernel: [13775.993587] Tainted: P B OEL 4.15.18 #2 Sep 16 15:32:43 gaia kernel: [13775.993702] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 16 15:32:43 gaia kernel: [13775.993879] alertmanager D 0 18366 16770 0x00000120 Sep 16 15:32:43 gaia kernel: [13775.993884] Call Trace: Sep 16 15:32:43 gaia kernel: [13775.993890] schedule+0x469/0x1180 Sep 16 15:32:43 gaia kernel: [13775.993895] ? firmware_map_remove+0x187/0x187 Sep 16 15:32:43 gaia kernel: [13775.993901] schedule+0x54/0x130 Sep 16 15:32:43 gaia kernel: [13775.993926] cv_wait_common+0x1f6/0x240 [spl] Sep 16 15:32:43 gaia kernel: [13775.993952] ? cv_destroy+0x230/0x230 [spl] Sep 16 15:32:43 gaia kernel: [13775.993957] ? wait_woken+0x180/0x180 Sep 16 15:32:43 gaia kernel: [13775.993981] cv_wait+0x15/0x20 [spl] Sep 16 15:32:43 gaia kernel: [13775.994193] zil_commit.part.6+0x134/0x12d0 [zfs] Sep 16 15:32:43 gaia kernel: [13775.994218] ? tsd_hash_search+0xd4/0x170 [spl] Sep 16 15:32:43 gaia kernel: [13775.994254] ? tsd_set+0x4da/0xc60 [spl] Sep 16 15:32:43 gaia kernel: [13775.994260] ? rcu_sched_qs.part.47+0x70/0x70 Sep 16 15:32:43 gaia kernel: [13775.994477] ? zil_claim_log_record+0x90/0x90 [zfs] Sep 16 15:32:43 gaia kernel: [13775.994483] ? _cond_resched+0x1c/0x70 Sep 16 15:32:43 gaia kernel: [13775.994487] ? mutex_lock+0x12/0x40 Sep 16 15:32:43 gaia kernel: [13775.994701] ? rrw_enter_read_impl+0x220/0x2a0 [zfs] Sep 16 15:32:43 gaia kernel: [13775.994918] zil_commit+0x37/0x50 [zfs] Sep 16 15:32:43 gaia kernel: [13775.995136] zfs_fsync+0x125/0x230 [zfs] Sep 16 15:32:43 gaia kernel: [13775.995357] zpl_fsync+0x10c/0x230 [zfs] Sep 16 15:32:43 gaia kernel: [13775.995363] vfs_fsync_range+0xab/0x1d0 Sep 16 15:32:43 gaia kernel: [13775.995581] ? zpl_readpage+0x260/0x260 [zfs] Sep 16 15:32:43 gaia kernel: [13775.995587] do_fsync+0x3d/0x70 Sep 16 15:32:43 gaia kernel: [13775.995592] ? SyS_syncfs+0xa0/0xa0 Sep 16 15:32:43 gaia kernel: [13775.995596] SyS_fsync+0x10/0x20 Sep 16 15:32:43 gaia kernel: [13775.995601] do_syscall_64+0x101/0x2c0 Sep 16 15:32:43 gaia kernel: [13775.995606] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Sep 16 15:32:43 gaia kernel: [13775.995608] RIP: 0033:0x4a4864 Sep 16 15:32:43 gaia kernel: [13775.995610] RSP: 002b:000000c420299c80 EFLAGS: 00000246 ORIG_RAX: 000000000000004a Sep 16 15:32:43 gaia kernel: [13775.995615] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00000000004a4864 Sep 16 15:32:43 gaia kernel: [13775.995617] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000006 Sep 16 15:32:43 gaia kernel: [13775.995619] RBP: 000000c420299cc8 R08: 0000000000000000 R09: 0000000000000000 Sep 16 15:32:43 gaia kernel: [13775.995621] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 Sep 16 15:32:43 gaia kernel: [13775.995623] R13: 0000000000000018 R14: 0000000000000013 R15: 0000000000000001

rincebrain commented 6 years ago

@creshal You might need to set some of these sysctls to nonzero values to get it to panic if an NMI fires. There are caveats to each of them, obviously, since it's triggering panic on different NMI types, but you might find them useful if your goal is to get kdump to fire.

kernel.panic_on_io_nmi = 0
kernel.panic_on_unrecovered_nmi = 0
kernel.unknown_nmi_panic = 0

There are also ones for panic on oops and so on.

creshal commented 6 years ago

Machine details:

Software data
Distribution Name Debian
Distribution Version 9.5
Linux Kernel 4.9.110-3+deb9u4
Architecture x86_64
ZFS Version 0.7.9, 0.7.10
SPL Version 0.7.9, 0.7.10
Hardware data
Motherboard Dell PowerEdge R210 II base board
CPU Xeon E3-1230 V2
RAM 16GB

@rincebrain I've enabled all those settings, plus a few others:

kernel.hardlockup_panic = 1
kernel.panic_on_io_nmi = 1
kernel.panic_on_oops = 1
kernel.panic_on_rcu_stall = 1
kernel.panic_on_unrecovered_nmi = 1
kernel.panic_on_warn = 1
kernel.softlockup_panic = 1
kernel.unknown_nmi_panic = 1

However, kdump still doesn't trigger. (A simulated panic via sysrq does, so kdump itself should work.) Any other settings I could check?

rincebrain commented 6 years ago

@creshal if it were me, I'd try two things, probably.

1) If you have the serial console set up from the iDRAC, see if that has anything interesting to say when this locks up (and if you don't, try setting it up)

2) If for some reason option 1 can't be made to work, configure the actual serial port on it as a console and have another machine connected to it continuously.

3) There's also the option to try convincing your server not to blank the display if it's left idle and having a display connected continuously (or using the iDRAC upgrade feature of a virtual KVM), but there's no guarantee it'll log enough to help on screen at once, and since it's only an upgrade feature on the iDRAC, you might not have the option anyway.

Also, independent of either of those, what does ipmitool sel list on that machine show, since most forms of lockup log something useful to the BMC event log even if nothing else does.

creshal commented 6 years ago

SEL doesn't log anything about those lockups, and the BMC is unreachable during them. Screen goes blank during the crash too (or whatever this is); the only option I haven't yet tried is a physical serial interface, which might take some time to arrange (remote location).

ccoager commented 6 years ago

I can confirm the same issue on zfs 0.7.11.

creshal commented 6 years ago

Bloody hell. A few reboots later, and now I can't reproduce it any more, even on 0.7.10. :/ Ignore my noise.

ccoager commented 5 years ago

My issue seems to have gone away after upgrading to ZFS 0.7.12 and kernel 4.15.0-42-generic (Ubuntu). You can close this issue.