Open jimmyw opened 2 years ago
What are the non-default settings on the datasets on the pool?
Have you been using send/recv at all?
Did it work previously any only break like this after a recent upgrade, or is it a recent setup and you have no prior data one way or another?
[jimmy@terra ~]$ zpool get all | grep local
nas autoexpand on local
nas ashift 12 local
nas autotrim on local
nas feature@async_destroy enabled local
nas feature@empty_bpobj active local
nas feature@lz4_compress active local
nas feature@multi_vdev_crash_dump enabled local
nas feature@spacemap_histogram active local
nas feature@enabled_txg active local
nas feature@hole_birth active local
nas feature@extensible_dataset active local
nas feature@embedded_data active local
nas feature@bookmarks enabled local
nas feature@filesystem_limits enabled local
nas feature@large_blocks enabled local
nas feature@large_dnode active local
nas feature@sha512 enabled local
nas feature@skein enabled local
nas feature@edonr enabled local
nas feature@userobj_accounting active local
nas feature@encryption enabled local
nas feature@project_quota active local
nas feature@device_removal enabled local
nas feature@obsolete_counts enabled local
nas feature@zpool_checkpoint enabled local
nas feature@spacemap_v2 active local
nas feature@allocation_classes enabled local
nas feature@resilver_defer enabled local
nas feature@bookmark_v2 enabled local
nas feature@redaction_bookmarks enabled local
nas feature@redacted_datasets enabled local
nas feature@bookmark_written enabled local
nas feature@log_spacemap active local
nas feature@livelist active local
nas feature@device_rebuild enabled local
nas feature@zstd_compress active local
nas feature@draid enabled local
No send/recv at all
System have been unstable for a while, have been trying to switch kernels and versions without success. System hangs and complains about a core that was locked. Worked without an issue at all a month ago, not sure what changed.
I hoped this stack trace would help in any way..
I see now that the pool is degraded, have not been seeing issues until now. A ssd with cache and log have failed. Pretty sure this have happened after the crashes, but probably related. Will try to replace the drive and see if anything changes..
l2arc and slog (cache and log) devices being marked failed shouldn't be actively harmful, though if they were misbehaving while not marked failed who knows.
Specifically, zfs get all | grep -v default
was more what I was curious about, though zpool get all
is also useful information.
Wondering if 5.17.5 or something shipped something exciting. May go try a build of it on a testbed.
Huh, 5.17.5 is running fine with my single NVMe vdev under a LUKS device using the default Arch 5.17.5 kernel .config. We'll see in a few days if it lets the magic smoke out.
Did you use any special settings for LUKS?
I have been experiencing the same issue on a Dell PowerEdge R610 server / Gentoo / Linux 5.17.5 / ZFS 2.1.4. I don't use LUKS. It started happening after upgrading the kernel to 5.17.5 and glibc to glibc-2.35-r4
r610 /home/dell # zfs get all | grep -v default NAME PROPERTY VALUE SOURCE r610-sas type filesystem - r610-sas creation Sun Feb 21 19:05 2021 - r610-sas used 2.16T - r610-sas available 1.79T - r610-sas referenced 36.5K - r610-sas compressratio 1.23x - r610-sas mounted no - r610-sas recordsize 1M received r610-sas mountpoint none received r610-sas compression zstd received r610-sas atime off received r610-sas aclinherit passthrough-x received r610-sas createtxg 1 - r610-sas xattr sa received r610-sas version 5 - r610-sas utf8only off - r610-sas normalization none - r610-sas casesensitivity sensitive - r610-sas guid 11187035221725804445 - r610-sas usedbysnapshots 0B - r610-sas usedbydataset 36.5K - r610-sas usedbychildren 2.16T - r610-sas usedbyrefreservation 0B - r610-sas objsetid 54 - r610-sas dnodesize auto received r610-sas refcompressratio 1.00x - r610-sas written 36.5K - r610-sas logicalused 2.68T - r610-sas logicalreferenced 12K - r610-sas acltype posix received r610-sas/container/windows type filesystem - r610-sas/container/windows creation Sat Oct 23 8:56 2021 - r610-sas/container/windows used 28.0G - r610-sas/container/windows available 1.79T - r610-sas/container/windows referenced 28.0G - r610-sas/container/windows compressratio 1.46x - r610-sas/container/windows mounted yes - r610-sas/container/windows recordsize 128K local r610-sas/container/windows mountpoint /srv/container/windows inherited from r610-sas/container r610-sas/container/windows compression zstd inherited from r610-sas r610-sas/container/windows atime off inherited from r610-sas r610-sas/container/windows aclinherit passthrough-x inherited from r610-sas r610-sas/container/windows createtxg 3597135 - r610-sas/container/windows xattr sa inherited from r610-sas r610-sas/container/windows version 5 - r610-sas/container/windows utf8only off - r610-sas/container/windows normalization none - r610-sas/container/windows casesensitivity sensitive - r610-sas/container/windows guid 6305482984620260152 - r610-sas/container/windows usedbysnapshots 0B - r610-sas/container/windows usedbydataset 28.0G - r610-sas/container/windows usedbychildren 0B - r610-sas/container/windows usedbyrefreservation 0B - r610-sas/container/windows objsetid 14492 - r610-sas/container/windows dnodesize auto inherited from r610-sas r610-sas/container/windows refcompressratio 1.46x - r610-sas/container/windows written 28.0G - r610-sas/container/windows logicalused 40.9G - r610-sas/container/windows logicalreferenced 40.9G - r610-sas/container/windows acltype posix inherited from r610-sas r610-sas/system type filesystem - r610-sas/system creation Sun Feb 21 19:40 2021 - r610-sas/system used 1.48G - r610-sas/system available 1.79T - r610-sas/system referenced 1.48G - r610-sas/system compressratio 4.05x - r610-sas/system mounted yes - r610-sas/system recordsize 1M inherited from r610-sas r610-sas/system mountpoint none inherited from r610-sas r610-sas/system compression zstd inherited from r610-sas r610-sas/system atime off inherited from r610-sas r610-sas/system aclinherit passthrough-x inherited from r610-sas r610-sas/system createtxg 222 - r610-sas/system xattr sa inherited from r610-sas r610-sas/system version 5 - r610-sas/system utf8only off - r610-sas/system normalization none - r610-sas/system casesensitivity sensitive - r610-sas/system guid 13395046143574701671 - r610-sas/system usedbysnapshots 0B - r610-sas/system usedbydataset 1.48G - r610-sas/system usedbychildren 0B - r610-sas/system usedbyrefreservation 0B - r610-sas/system objsetid 3348 - r610-sas/system dnodesize auto inherited from r610-sas r610-sas/system refcompressratio 4.05x - r610-sas/system written 1.48G - r610-sas/system logicalused 5.57G - r610-sas/system logicalreferenced 5.57G - r610-sas/system acltype posix inherited from r610-sas
...and this is the the change between -r3 and -r4 of glibc in Gentoo:
diff -Naur glibc-2.35-r3.ebuild glibc-2.35-r4.ebuild
- # We take care of patching our binutils to use both hash styles,
- # and many people like to force gnu hash style only, so disable
- # this overriding check. #347761
- export libc_cv_hashstyle=no
@mapmot, can you please share one or more of the BUG: messages and stacktraces from your logs when this happens?
@rincebrain, here is the log. What triggered it was a git status
command, after the server being mostly idle. The last time it happened was after an emerge
command.
[735471.464489] BUG: kernel NULL pointer dereference, address: 000000000000000b
[735471.464581] #PF: supervisor write access in kernel mode
[735471.464637] #PF: error_code(0x0002) - not-present page
[735471.464693] PGD 0 P4D 0
[735471.464726] Oops: 0002 [#1] SMP NOPTI
[735471.464771] CPU: 23 PID: 108277 Comm: dp_sync_taskq Tainted: P IO 5.17.5-gentoo #2
[735471.464866] Hardware name: Dell Inc. PowerEdge R610/0P8FRD, BIOS 6.6.0 05/22/2018
[735471.464946] RIP: 0010:dbuf_sync_list+0x67/0x250 [zfs]
[735471.465076] Code: 0e e8 5d fe ff ff 49 8b 47 10 48 39 c5 74 5e 49 8b 47 10 49 89 c2 4d 2b 57 08 74 51 49 83 7a 18 00 75 4a 48 8b 08 48 8b 50 08 <48> 89 51 08 48 89 0a 4c 89 30 4c 89 68 08 49 8b 42 20 48 85 c0 74
[735471.465238] RSP: 0018:ffff8c04a86d3bd0 EFLAGS: 00010246
[735471.465291] RAX: ffff8c04bf173200 RBX: ffff8c039c054000 RCX: 0000000000000003
[735471.465355] RDX: ffff8c04bf171f10 RSI: 0000000000000286 RDI: ffff8c034b7f9800
[735471.465422] RBP: ffff8c04bf171f10 R08: 0000000000000000 R09: ffff8c04a86d3ae0
[735471.465490] R10: ffff8c04bf173200 R11: dead000000000100 R12: 0000000000000000
[735471.465555] R13: dead000000000122 R14: dead000000000100 R15: ffff8c04bf171f00
[735471.465622] FS: 0000000000000000(0000) GS:ffff8c162fdc0000(0000) knlGS:0000000000000000
[735471.465697] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[735471.465754] CR2: 000000000000000b CR3: 000000005380a006 CR4: 00000000000226e0
[735471.465823] Call Trace:
[735471.465857]
@rincebrain, another one:
May 02 10:57:39 r610.maze kernel: BUG: kernel NULL pointer dereference, address: 000000000000000b
May 02 10:57:39 r610.maze kernel: #PF: supervisor write access in kernel mode
May 02 10:57:39 r610.maze kernel: #PF: error_code(0x0002) - not-present page
May 02 10:57:39 r610.maze kernel: PGD 0 P4D 0
May 02 10:57:39 r610.maze kernel: Oops: 0002 [#1] SMP NOPTI
May 02 10:57:39 r610.maze kernel: CPU: 20 PID: 1214 Comm: dp_sync_taskq Tainted: P IO 5.17.4-gentoo #1
May 02 10:57:39 r610.maze kernel: Hardware name: Dell Inc. PowerEdge R610/0P8FRD, BIOS 6.6.0 05/22/2018
May 02 10:57:39 r610.maze kernel: RIP: 0010:dbuf_sync_list+0x67/0x250 [zfs]
May 02 10:57:39 r610.maze kernel: Code: 0e e8 5d fe ff ff 49 8b 47 10 48 39 c5 74 5e 49 8b 47 10 49 89 c2 4d 2b 57 08 74 51 49 83 7a 18 00 75 4a 48 8b 08 48 8b 50 08 <48> 89 51 08 48 89 0a 4c 89 30 4c 89 68 08 >
May 02 10:57:39 r610.maze kernel: RSP: 0018:ffffa1508d62bbd0 EFLAGS: 00010246
May 02 10:57:39 r610.maze kernel: RAX: ffffa15a93f87800 RBX: ffffa15088cb4180 RCX: 0000000000000003
May 02 10:57:39 r610.maze kernel: RDX: ffffa15bae3a2b10 RSI: ffffffffffffffff RDI: ffffa15087423308
May 02 10:57:39 r610.maze kernel: RBP: ffffa15bae3a2b10 R08: 0000000000000000 R09: ffffffffc0f7e900
May 02 10:57:39 r610.maze kernel: R10: ffffa15a93f87800 R11: 0001514c0001511a R12: 0000000000000000
May 02 10:57:39 r610.maze kernel: R13: dead000000000122 R14: dead000000000100 R15: ffffa15bae3a2b00
May 02 10:57:39 r610.maze kernel: FS: 0000000000000000(0000) GS:ffffa1636fd00000(0000) knlGS:0000000000000000
May 02 10:57:39 r610.maze kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 02 10:57:39 r610.maze kernel: CR2: 000000000000000b CR3: 00000006eba0a003 CR4: 00000000000226e0
May 02 10:57:39 r610.maze kernel: Call Trace:
May 02 10:57:39 r610.maze kernel:
@rincebrain, the last one, this is while trying to gracefully reboot (systemctl reboot
). Nothing graceful happened, the watchdog timer reset the system :(
May 02 12:02:42 r610.maze kernel: BUG: kernel NULL pointer dereference, address: 000000000000000b
May 02 12:02:42 r610.maze kernel: #PF: supervisor write access in kernel mode
May 02 12:02:42 r610.maze kernel: #PF: error_code(0x0002) - not-present page
May 02 12:02:42 r610.maze kernel: PGD 0 P4D 0
May 02 12:02:42 r610.maze kernel: Oops: 0002 [#1] SMP NOPTI
May 02 12:02:42 r610.maze kernel: CPU: 8 PID: 4352 Comm: txg_sync Tainted: P IO 5.17.5-gentoo #2
May 02 12:02:42 r610.maze kernel: Hardware name: Dell Inc. PowerEdge R610/0P8FRD, BIOS 6.6.0 05/22/2018
May 02 12:02:42 r610.maze kernel: RIP: 0010:dbuf_sync_list+0x67/0x250 [zfs]
May 02 12:02:42 r610.maze kernel: Code: 0e e8 5d fe ff ff 49 8b 47 10 48 39 c5 74 5e 49 8b 47 10 49 89 c2 4d 2b 57 08 74 51 49 83 7a 18 00 75 4a 48 8b 08 48 8b 50 08 <48> 89 51 08 48 89 0a 4c 89 30 4c 89 68 08 >
May 02 12:02:42 r610.maze kernel: RSP: 0018:ffffa1daab2af938 EFLAGS: 00010246
May 02 12:02:42 r610.maze kernel: RAX: ffffa1da8346f400 RBX: ffffa1da0726f680 RCX: 0000000000000003
May 02 12:02:42 r610.maze kernel: RDX: ffffa1da8346c710 RSI: 0000000000000001 RDI: ffffa1da8346c700
May 02 12:02:42 r610.maze kernel: RBP: ffffa1da8346c710 R08: 0000000000000000 R09: ffffa1dab14ceb40
May 02 12:02:42 r610.maze kernel: R10: ffffa1da8346f400 R11: 0000000000000007 R12: 0000000000000001
May 02 12:02:42 r610.maze kernel: R13: dead000000000122 R14: dead000000000100 R15: ffffa1da8346c700
May 02 12:02:42 r610.maze kernel: FS: 0000000000000000(0000) GS:ffffa1ecefa00000(0000) knlGS:0000000000000000
May 02 12:02:42 r610.maze kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 02 12:02:42 r610.maze kernel: CR2: 000000000000000b CR3: 00000001fa20a003 CR4: 00000000000226e0
May 02 12:02:42 r610.maze kernel: Call Trace:
May 02 12:02:42 r610.maze kernel:
What's the storage backing this pool in what configuration?
I have some recent suspicions about one or two places that might be doing things incorrectly that this might align with.
What's the storage backing this pool in what configuration?
I have some recent suspicions about one or two places that might be doing things incorrectly that this might align with.
6 x 1 TB SAS drives in raidz1 on Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 in IT/HBA mode.
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.
System information
Describe the problem you're observing
Server crashes during almost idle operation
Describe how to reproduce the problem
Wait a few days.
-->