Closed peterska closed 1 year ago
ZFS is trying to deactivate a feature that is not active and it very much does not like that.
Do any variations of the send command without -L
, -e
or -c
avoid the panic? Knowing if turning any of them off makes this work would help in debugging this.
The error message talks about the large_blocks feature, although it is only enabled and not active at both the sender and receiver. The -L flag is probably responsible for triggering this. Could it be related to https://github.com/openzfs/zfs/pull/13782 ?
I will not be able to try anything until tomorrow. The machine is at the office and has currently experienced this issue and I am unable to ssh in.
I can confirm that this happens also to me. zfs core dump during send/receive:
It happens here if sender has resordsize 128k and receivers has recordsize 1M:
# zfs get recordsize zHome/home zstore/data/BACKUP/rakete_home/home
NAME PROPERTY VALUE SOURCE
zHome/home recordsize 128K inherited from zHome
zstore/data/BACKUP/rakete_home/home recordsize 1M inherited from zstore
syncoid --sendoptions="L" --mbuffer-size=512M --no-sync-snap zHome/home zstore/data/BACKUP/rakete_home/home
NEWEST SNAPSHOT: 2022-12-03--23:23
Sending incremental zHome/home@2022-11-29--08:05 ... 2022-12-03--23:23 (~ 34 KB):
0,00 B 0:00:00 [0,00 B/s] [> ] 0%
cannot receive: failed to read from stream
CRITICAL ERROR: zfs send -L -I 'zHome/home'@'2022-11-29--08:05' 'zHome/home'@'2022-12-03--23:23' | mbuffer -q -s 128k -m 512M 2>/dev/null | pv -p -t -e -r -b -s 35264 | zfs receive -s -F 'zstore/data/BACKUP/rakete_home/home' 2>&1 failed: 256 at /usr/bin/syncoid line 817.
I tried various permuations of zfs send flags and here are the results
zfs send -RLec -I snapshot_name : crashes
zfs send -Rec -I snapshot_name : crashes
zfs send -Rc -I snapshot_name : crashes
zfs send -R -I snapshot_name : works
I use zstd-1 compression on all my datasets. compression is inherited from the top dataset. I use the default recordsize for all datasets except for one that uses 8K
Does zfs send -RLe -I snapshot_name
work? What I am really wondering is which flags (just 1 I hope) cause crashes to narrow the scope that needs to be debugged.
I have left the office now and won't be there for a while. Most of us work from home. Next time someone goes to the office I can try it for you.
just got the same panic using only zfs send -R -I snapshot_name so trying send -RLe will not shed any light on the issue. Here is the full stacktrace:
[Sun Dec 4 16:52:10 2022] VERIFY3(0 == zap_remove(mos, dsobj, spa_feature_table[f].fi_guid, tx)) failed (0 == 2)
[Sun Dec 4 16:52:10 2022] PANIC at dsl_dataset.c:1116:dsl_dataset_deactivate_feature_impl()
[Sun Dec 4 16:52:10 2022] Showing stack for process 1627
[Sun Dec 4 16:52:10 2022] CPU: 7 PID: 1627 Comm: txg_sync Tainted: P OE 6.0.10-200.fc36.x86_64 #1
[Sun Dec 4 16:52:10 2022] Hardware name: Gigabyte Technology Co., Ltd. Z87X-UD3H/Z87X-UD3H-CF, BIOS 10b 06/12/2014
[Sun Dec 4 16:52:10 2022] Call Trace:
[Sun Dec 4 16:52:10 2022] <TASK>
[Sun Dec 4 16:52:10 2022] dump_stack_lvl+0x44/0x5c
[Sun Dec 4 16:52:10 2022] spl_panic+0xf0/0x108 [spl]
[Sun Dec 4 16:52:10 2022] dsl_dataset_deactivate_feature_impl+0xfb/0x100 [zfs]
[Sun Dec 4 16:52:10 2022] dsl_dataset_clone_swap_sync_impl+0x88e/0xa70 [zfs]
[Sun Dec 4 16:52:10 2022] ? _raw_spin_unlock+0x15/0x30
[Sun Dec 4 16:52:10 2022] ? dsl_dataset_hold_flags+0xb0/0x230 [zfs]
[Sun Dec 4 16:52:10 2022] dmu_recv_end_sync+0x180/0x560 [zfs]
[Sun Dec 4 16:52:10 2022] ? preempt_count_add+0x6a/0xa0
[Sun Dec 4 16:52:10 2022] dsl_sync_task_sync+0xa5/0xf0 [zfs]
[Sun Dec 4 16:52:10 2022] dsl_pool_sync+0x3d3/0x4e0 [zfs]
[Sun Dec 4 16:52:10 2022] spa_sync+0x555/0xf60 [zfs]
[Sun Dec 4 16:52:10 2022] ? _raw_spin_unlock+0x15/0x30
[Sun Dec 4 16:52:10 2022] ? spa_txg_history_init_io+0xf3/0x110 [zfs]
[Sun Dec 4 16:52:10 2022] txg_sync_thread+0x227/0x3e0 [zfs]
[Sun Dec 4 16:52:10 2022] ? txg_fini+0x260/0x260 [zfs]
[Sun Dec 4 16:52:10 2022] ? __thread_exit+0x20/0x20 [spl]
[Sun Dec 4 16:52:10 2022] thread_generic_wrapper+0x57/0x70 [spl]
[Sun Dec 4 16:52:10 2022] kthread+0xe6/0x110
[Sun Dec 4 16:52:10 2022] ? kthread_complete_and_exit+0x20/0x20
[Sun Dec 4 16:52:10 2022] ret_from_fork+0x1f/0x30
[Sun Dec 4 16:52:10 2022] </TASK>
I was afraid that might happen. Thanks for confirming it.
Am Sonntag, dem 04.12.2022 um 05:38 -0800 schrieb George Amanakis:
As far as I can see #13782 is not included in 2.1.7. Is it possible to retry with current master or cherry-pick that commit on 2.1.7 and try again? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
This patch is already part of zfs 2.1.7
See at the bottom of this page: https://github.com/openzfs/zfs/compare/zfs-2.1.6...zfs-2.1.7
Is this reproducible on 2.1.6?
No, I've been running 2.1.6 with zfs send -RLec since it's release. The script has been running unchanged for about 5 years and I have never experienced this issue. It happened instantly once I updated to zfs 2.1.7
Could you try reverting c8d2ab05e1e6bf21185723fabdebbe5bb3374381 to see if that makes it stop? I do not see any thing else that changed in 2.1.7 that could possibly be related to this.
Thanks, I wil try reverting https://github.com/openzfs/zfs/commit/c8d2ab05e1e6bf21185723fabdebbe5bb3374381 It might be a couple of days before I can try it.
I've reverted https://github.com/openzfs/zfs/commit/c8d2ab05e1e6bf21185723fabdebbe5bb3374381 and tried the zfs send again. Unfortunately my pool seems to be in a weird state and fails with lots of errors about snapshots not existing. After the failure, on the recv end the receive dataset has been renamed to xxxrecv-xxxxx-1. I tried renaming the dataset back to raid3/raid2 but each time I did the zfs send it was renamed to raid3/raid2recv-xxxxx-1 again.
I am running a pool scrub and will try again once it has finished.
The sending pool is raid2 and it is recevied into raid3/raid2 on the receiving end.
All the snapshots mentioned in the errors below exist on the receving end, they've just be renamed to raid3/raid2recv-65018-1
Here is an extract of some of the errors.
cannot open 'raid3/raid2/git-migration@20221204-10.53.25-Sun-backup': dataset does not exist
cannot open 'raid3/raid2/git-migration': dataset does not exist
cannot receive: specified fs (raid3/raid2) does not exist
attempting destroy raid3/raid2
failed - trying rename raid3/raid2 to raid3/raid2recv-65018-1
local fs raid3/raid2/home does not have fromsnap (20221204-16.51.32-Sun-backup in stream); must have been deleted locally; ignoring
local fs raid3/raid2/git does not have fromsnap (20221204-16.51.32-Sun-backup in stream); must have been deleted locally; ignoring
local fs raid3/raid2/databases does not have fromsnap (20221204-16.51.32-Sun-backup in stream); must have been deleted locally; ignoring
local fs raid3/raid2/products does not have fromsnap (20221204-16.51.32-Sun-backup in stream); must have been deleted locally; ignoring
I did a full test again, but reverting the patch did not help. At least, what I found, reverting this patch prevents a complete freeze of the pool. But I still see coredumps during send/receive
This is the patch I am reverting:
From 1af2632e439156f5edd404b35e5c9e444b607733 Mon Sep 17 00:00:00 2001
From: George Amanakis <gamanakis@gmail.com>
Date: Tue, 30 Aug 2022 22:15:56 +0200
Subject: [PATCH] Fix setting the large_block feature after receiving a
snapshot
We are not allowed to dirty a filesystem when done receiving
a snapshot. In this case the flag SPA_FEATURE_LARGE_BLOCKS will
not be set on that filesystem since the filesystem is not on
dp_dirty_datasets, and a subsequent encrypted raw send will fail.
Fix this by checking in dsl_dataset_snapshot_sync_impl() if the feature
needs to be activated and do so if appropriate.
Signed-off-by: George Amanakis <gamanakis@gmail.com>
---
module/zfs/dsl_dataset.c | 15 ++++
tests/runfiles/common.run | 2 +-
tests/zfs-tests/tests/Makefile.am | 1 +
.../rsend/send_raw_large_blocks.ksh | 78 +++++++++++++++++++
4 files changed, 95 insertions(+), 1 deletion(-)
create mode 100755 tests/zfs-tests/tests/functional/rsend/send_raw_large_blocks.ksh
diff --git a/module/zfs/dsl_dataset.c b/module/zfs/dsl_dataset.c
index c7577fc584a..4da4effca60 100644
--- a/module/zfs/dsl_dataset.c
+++ b/module/zfs/dsl_dataset.c
@@ -1760,6 +1760,21 @@ dsl_dataset_snapshot_sync_impl(dsl_dataset_t *ds, const char *snapname,
}
}
+ /*
+ * We are not allowed to dirty a filesystem when done receiving
+ * a snapshot. In this case the flag SPA_FEATURE_LARGE_BLOCKS will
+ * not be set and a subsequent encrypted raw send will fail. Hence
+ * activate this feature if needed here.
+ */
+ for (spa_feature_t f = 0; f < SPA_FEATURES; f++) {
+ if (zfeature_active(f, ds->ds_feature_activation[f]) &&
+ !(zfeature_active(f, ds->ds_feature[f]))) {
+ dsl_dataset_activate_feature(dsobj, f,
+ ds->ds_feature_activation[f], tx);
+ ds->ds_feature[f] = ds->ds_feature_activation[f];
+ }
+ }
+
ASSERT3U(ds->ds_prev != 0, ==,
dsl_dataset_phys(ds)->ds_prev_snap_obj != 0);
if (ds->ds_prev) {
With stock zfs 2.1.7 (incl. this patch) my pool completely freezes when doing send / receive:
syncoid --sendoptions="L" --mbuffer-size=512M --no-sync-snap zHome/home zstore/data/BACKUP/rakete_home/home
NEWEST SNAPSHOT: 2022-12-09--10:47
Sending incremental zHome/home@2022-12-09--10:32 ... 2022-12-09--10:47 (~ 4 KB):
0,00 B 0:00:00 [0,00 B/s] [> ] 0%
cannot receive: failed to read from stream
CRITICAL ERROR: zfs send -L -I 'zHome/home'@'2022-12-09--10:32' 'zHome/home'@'2022-12-09--10:47' | mbuffer -q -s 128k -m 512M 2>/dev/null | pv -p -t -e -r -b -s 4720 | zfs receive -s -F 'zstore/data/BACKUP/rakete_home/home' 2>&1 failed: 256 at /usr/bin/syncoid line 817.
gives the following in the log:
Dez 09 10:35:47 rakete kernel: INFO: task txg_sync:3044 blocked for more than 122 seconds.
Dez 09 10:35:47 rakete kernel: Tainted: P OE 6.0.11-zen1-1-zen #1
Dez 09 10:35:47 rakete kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dez 09 10:35:47 rakete kernel: task:txg_sync state:D stack: 0 pid: 3044 ppid: 2 flags:0x00004000
Dez 09 10:35:47 rakete kernel: Call Trace:
Dez 09 10:35:47 rakete kernel: <TASK>
Dez 09 10:35:47 rakete kernel: __schedule+0xb43/0x1350
Dez 09 10:35:47 rakete kernel: schedule+0x5e/0xd0
Dez 09 10:35:47 rakete kernel: spl_panic+0x10a/0x10c [spl 29da18a0fede4076df583dd8fb83790522bfe897]
Dez 09 10:35:47 rakete kernel: dsl_dataset_deactivate_feature_impl+0xfb/0x100 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel: dsl_dataset_clone_swap_sync_impl+0x90b/0xe30 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel: dsl_dataset_rollback_sync+0x109/0x1c0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel: dsl_sync_task_sync+0xac/0xf0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel: dsl_pool_sync+0x40d/0x5c0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel: spa_sync+0x56c/0xf90 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel: ? spa_txg_history_init_io+0x193/0x1c0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel: txg_sync_thread+0x22b/0x3f0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel: ? txg_wait_open+0xf0/0xf0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel: ? __thread_exit+0x20/0x20 [spl 29da18a0fede4076df583dd8fb83790522bfe897]
Dez 09 10:35:47 rakete kernel: thread_generic_wrapper+0x5e/0x70 [spl 29da18a0fede4076df583dd8fb83790522bfe897]
Dez 09 10:35:47 rakete kernel: kthread+0xde/0x110
Dez 09 10:35:47 rakete kernel: ? kthread_complete_and_exit+0x20/0x20
Dez 09 10:35:47 rakete kernel: ret_from_fork+0x22/0x30
Dez 09 10:35:47 rakete kernel: </TASK>
I see multiple of those blocked task messages for various tasks. The pool is then frozen. Reboot takes ages and I had to force reboot with REISUB.
When I reverted the patch I did not experience freezes but the same error and coredumps. Unfortuanlety I can not provide a coredump because my system says: Resource limits disable core dumping for process 82335 (zfs)
And I do not know why that is.
Anyways, interesting is also that this error resp. coredump does not happen every time. Approximately 1 out of 4 send/receive processes execute fine.
And just to reassure you: This is a new bug in 2.1.7. I am using this backup process with syncoid since many years without any problem. Reverting back to zfs 2.1.6 gives me a stable system again.
Could you do a git bisect with good being 2.1.6 and bad 2.1.7?
On Fri, Dec 9, 2022, 11:51 AM mabod @.***> wrote:
I did a full test again, but reverting the patch did not help. At least, what I found, reverting this patch prevents a complete freeze of the pool. But I still see coredumps during send/receive
This is the patch I am reverting:
From 1af2632e439156f5edd404b35e5c9e444b607733 Mon Sep 17 00:00:00 2001 From: George Amanakis @.***> Date: Tue, 30 Aug 2022 22:15:56 +0200 Subject: [PATCH] Fix setting the large_block feature after receiving a snapshot
We are not allowed to dirty a filesystem when done receiving a snapshot. In this case the flag SPA_FEATURE_LARGE_BLOCKS will not be set on that filesystem since the filesystem is not on dp_dirty_datasets, and a subsequent encrypted raw send will fail. Fix this by checking in dsl_dataset_snapshot_sync_impl() if the feature needs to be activated and do so if appropriate.
Signed-off-by: George Amanakis @.***>
module/zfs/dsl_dataset.c | 15 ++++ tests/runfiles/common.run | 2 +- tests/zfs-tests/tests/Makefile.am | 1 + .../rsend/send_raw_large_blocks.ksh | 78 +++++++++++++++++++ 4 files changed, 95 insertions(+), 1 deletion(-) create mode 100755 tests/zfs-tests/tests/functional/rsend/send_raw_large_blocks.ksh
diff --git a/module/zfs/dsl_dataset.c b/module/zfs/dsl_dataset.c index c7577fc584a..4da4effca60 100644 --- a/module/zfs/dsl_dataset.c +++ b/module/zfs/dsl_dataset.c @@ -1760,6 +1760,21 @@ dsl_dataset_snapshot_sync_impl(dsl_dataset_t ds, const char snapname, } }
- /*
- We are not allowed to dirty a filesystem when done receiving
- a snapshot. In this case the flag SPA_FEATURE_LARGE_BLOCKS will
- not be set and a subsequent encrypted raw send will fail. Hence
- activate this feature if needed here.
- */
- for (spa_feature_t f = 0; f < SPA_FEATURES; f++) {
- if (zfeature_active(f, ds->ds_feature_activation[f]) &&
- !(zfeature_active(f, ds->ds_feature[f]))) {
- dsl_dataset_activate_feature(dsobj, f,
- ds->ds_feature_activation[f], tx);
- ds->ds_feature[f] = ds->ds_feature_activation[f];
- }
- }
ASSERT3U(ds->ds_prev != 0, ==, dsl_dataset_phys(ds)->ds_prev_snap_obj != 0); if (ds->ds_prev) {
With stock zfs 2.1.7 (incl. this patch) my pool completely freezes when doing send / receive:
syncoid --sendoptions="L" --mbuffer-size=512M --no-sync-snap zHome/home zstore/data/BACKUP/rakete_home/home NEWEST SNAPSHOT: 2022-12-09--10:47 Sending incremental @.***:32 ... 2022-12-09--10:47 (~ 4 KB): 0,00 B 0:00:00 [0,00 B/s] [> ] 0% cannot receive: failed to read from stream CRITICAL ERROR: zfs send -L -I 'zHome/home'@'2022-12-09--10:32' 'zHome/home'@'2022-12-09--10:47' | mbuffer -q -s 128k -m 512M 2>/dev/null | pv -p -t -e -r -b -s 4720 | zfs receive -s -F 'zstore/data/BACKUP/rakete_home/home' 2>&1 failed: 256 at /usr/bin/syncoid line 817.
gives the following in the log:
Dez 09 10:35:47 rakete kernel: INFO: task txg_sync:3044 blocked for more than 122 seconds. Dez 09 10:35:47 rakete kernel: Tainted: P OE 6.0.11-zen1-1-zen #1 Dez 09 10:35:47 rakete kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Dez 09 10:35:47 rakete kernel: task:txg_sync state:D stack: 0 pid: 3044 ppid: 2 flags:0x00004000 Dez 09 10:35:47 rakete kernel: Call Trace: Dez 09 10:35:47 rakete kernel:
Dez 09 10:35:47 rakete kernel: __schedule+0xb43/0x1350 Dez 09 10:35:47 rakete kernel: schedule+0x5e/0xd0 Dez 09 10:35:47 rakete kernel: spl_panic+0x10a/0x10c [spl 29da18a0fede4076df583dd8fb83790522bfe897] Dez 09 10:35:47 rakete kernel: dsl_dataset_deactivate_feature_impl+0xfb/0x100 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85] Dez 09 10:35:47 rakete kernel: dsl_dataset_clone_swap_sync_impl+0x90b/0xe30 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85] Dez 09 10:35:47 rakete kernel: dsl_dataset_rollback_sync+0x109/0x1c0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85] Dez 09 10:35:47 rakete kernel: dsl_sync_task_sync+0xac/0xf0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85] Dez 09 10:35:47 rakete kernel: dsl_pool_sync+0x40d/0x5c0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85] Dez 09 10:35:47 rakete kernel: spa_sync+0x56c/0xf90 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85] Dez 09 10:35:47 rakete kernel: ? spa_txg_history_init_io+0x193/0x1c0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85] Dez 09 10:35:47 rakete kernel: txg_sync_thread+0x22b/0x3f0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85] Dez 09 10:35:47 rakete kernel: ? txg_wait_open+0xf0/0xf0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85] Dez 09 10:35:47 rakete kernel: ? __thread_exit+0x20/0x20 [spl 29da18a0fede4076df583dd8fb83790522bfe897] Dez 09 10:35:47 rakete kernel: thread_generic_wrapper+0x5e/0x70 [spl 29da18a0fede4076df583dd8fb83790522bfe897] Dez 09 10:35:47 rakete kernel: kthread+0xde/0x110 Dez 09 10:35:47 rakete kernel: ? kthread_complete_and_exit+0x20/0x20 Dez 09 10:35:47 rakete kernel: ret_from_fork+0x22/0x30 Dez 09 10:35:47 rakete kernel: I see multiple of those blocked task messages for various tasks. The pool is then frozen. Reboot takes ages and I had to force reboot with REISUB.
When I reverted the patch I did not experience freezes but the same error and coredumps. Unfortuanlety I can not provide a coredump because my system says: Resource limits disable core dumping for process 82335 (zfs) And I do not know why that is.
Anyways, interesting is also that this error resp. coredump does not happen every time. Approximately 1 out of 4 send/receive processes execute fine.
And just to reassure you: This is a new bug in 2.1.7. I am using this backup process with syncoid since many years without any problem. Reverting back to zfs 2.1.6 gives me a stable system again.
— Reply to this email directly, view it on GitHub https://github.com/openzfs/zfs/issues/14252#issuecomment-1344149681, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2Y2IOXPMFSF6EKS3ZV4KLWMMFM5ANCNFSM6AAAAAASSJ2TW4 . You are receiving this because you commented.Message ID: @.***>
I could do that but I need some help. I am not a programmer resp. git expert. I need guidance on this.
I have a couple of questions to start with:
1) do I need to do the bisect for zfs-dkms and zfs-utils or just for zfs-utils? (I am on endeavouros) 2) How could I incorporate the git bisect into the pkgbuild process so that I get a new package for each bisect? Or should I just compile the soruce separetly and then only use the resulting binary? 3) What would be last good commit to use?
The straightforward way would be to compile manually: 1) git bisect start 2) git bisect good zfs-2.1.6 3) git bisect bad zfs-2.1.7 4) compile 5) load the modules with "./scripts/zfs.sh -v" and set the PATH to the main OpenZFS source directory 6) test, enter accordingly "git bisect good" or "git bisect bad" 7) goto step 4
step 5 does not work for me. I have the module running and I have active pools. How do I reload the zfs module in a running live environment? And what do you mean with "set the PATH to the main OpenZFS source directory" ?
./scripts/zfs.sh -vu should unload the modules. Make sure you have exported all pools.
I have done some comprehensive testing after reverting https://github.com/openzfs/zfs/commit/c8d2ab05e1e6bf21185723fabdebbe5bb3374381 and can report that I am no longer getting the zfs recv panic in dsl_dataset_deactivate_feature_impl. I tried lots incremental sends using the full zfs send -RLec and also a full send. That definitely fixed it. @mabod your coredump issue is probably related to another commit.
@peterska Would you be willing to try a possible fix?
@gamanakis Let me review the patch and if it looks ok to me I will test it. Will not be able to test it until Monday Sydney time.
./scripts/zfs.sh -vu should unload the modules. Make sure you have exported all pools.
I exported all pools. I rebooted. I verfied with the 'zpool status' command that no polls are importedt. But when I excecute "./scripts/zfs.sh -vu" it fails by saying "module is still in use". Why is that?
@mabod your coredump issue is probably related to another commit.
yes, must be. I checked with the reverted pacth again. The coredump does exist. no doubt. May be it has to do with encryption. The datasets are encrypted.
@peterska I am thinking that in your case the code inserted in c8d2ab0
runs in open context, which shouldn't be the case. I am trying to reproduce this on 2.1.7 with no success. This is my latest iteration:
64 log_must zpool create -f pool_lb $TESTDIR/vdev_a
65
66 log_must zfs create pool_lb/fs
67 log_must dd if=/dev/urandom of=$file bs=1024 count=1024
68 log_must zfs snapshot pool_lb/fs@snap1
69
70 log_must dd if=/dev/urandom of=$file1 bs=1024 count=1024
71 log_must zfs snapshot pool_lb/fs@snap2
72 log_must dd if=/dev/urandom of=$file1 bs=1024 count=1024
73 log_must zfs snapshot pool_lb/fs@snap3
74 log_must dd if=/dev/urandom of=$file1 bs=1024 count=1024
75 log_must zfs snapshot pool_lb/fs@snap4
76 log_must dd if=/dev/urandom of=$file1 bs=1024 count=1024
77 log_must zfs snapshot pool_lb/fs@snap5
78 log_must dd if=/dev/urandom of=$file1 bs=1024 count=1024
79 log_must zfs snapshot pool_lb/fs@snap6
80 log_must dd if=/dev/urandom of=$file1 bs=1024 count=1024
81 log_must zfs snapshot pool_lb/fs@snap7
82 log_must dd if=/dev/urandom of=$file1 bs=1024 count=1024
83 log_must zfs snapshot pool_lb/fs@snap8
84 log_must dd if=/dev/urandom of=$file1 bs=1024 count=1024
85 log_must zfs snapshot pool_lb/fs@snap9
86
87 log_must eval "zfs send -RLec pool_lb/fs@snap5 > $backup"
88 log_must eval "zfs recv pool_lb/testfs5 < $backup"
89
90 log_must dd if=/dev/urandom of=/pool_lb/testfs5/mine bs=1024 count=1024
91 log_must zfs snapshot pool_lb/testfs5@snap10
92 log_must dd if=/dev/urandom of=/pool_lb/testfs5/mine bs=1024 count=1024
93 log_must zfs snapshot pool_lb/testfs5@snap20
94 log_must dd if=/dev/urandom of=/pool_lb/testfs5/mine bs=1024 count=1024
95 log_must zfs snapshot pool_lb/testfs5@snap30
96 log_must dd if=/dev/urandom of=/pool_lb/testfs5/mine bs=1024 count=1024
97 log_must zfs snapshot pool_lb/testfs5@snap40
98 log_must eval "zfs send -RLec -I @snap5 pool_lb/fs@snap9 > $backup"
99 log_must eval "zfs recv -F pool_lb/testfs5 < $backup"
100
101 log_must eval "zfs send -RLec pool_lb/testfs5@snap9 > $raw_backup"
The patch I was thinking about is the following:
diff --git a/module/zfs/dsl_dataset.c b/module/zfs/dsl_dataset.c
index 71d876cae..da49ca94b 100644
--- a/module/zfs/dsl_dataset.c
+++ b/module/zfs/dsl_dataset.c
@@ -1753,13 +1753,17 @@ dsl_dataset_snapshot_sync_impl(dsl_dataset_t *ds, const char *snapname,
* not be set and a subsequent encrypted raw send will fail. Hence
* activate this feature if needed here.
*/
- for (spa_feature_t f = 0; f < SPA_FEATURES; f++) {
- if (zfeature_active(f, ds->ds_feature_activation[f]) &&
- !(zfeature_active(f, ds->ds_feature[f]))) {
- dsl_dataset_activate_feature(dsobj, f,
- ds->ds_feature_activation[f], tx);
- ds->ds_feature[f] = ds->ds_feature_activation[f];
+ if (dmu_tx_is_syncing(tx)) {
+ for (spa_feature_t f = 0; f < SPA_FEATURES; f++) {
+ if (zfeature_active(f, ds->ds_feature_activation[f]) &&
+ !(zfeature_active(f, ds->ds_feature[f]))) {
+ dsl_dataset_activate_feature(dsobj, f,
+ ds->ds_feature_activation[f], tx);
+ ds->ds_feature[f] = ds->ds_feature_activation[f];
+ }
}
+ } else {
+ cmn_err(CE_NOTE, "not syncing context!");
}
ASSERT3U(ds->ds_prev != 0, ==,
This runs the code inserted in c8d2ab0
only in syncing context, otherwise it dumps a note in dmesg. It would help if you could end up with a script reproducing the bug in a pool I have access to, or if you could run patch above.
@gamanakis the patch looks ok to me. I will try it later today and let you know the results.
To help you reproduce the panic, here are the non default settings of my pool and dataset:
pool:
NAME PROPERTY VALUE SOURCE
raid2 ashift 12 local
raid2 autotrim on local
raid2 feature@async_destroy enabled local
raid2 feature@empty_bpobj active local
raid2 feature@lz4_compress active local
raid2 feature@multi_vdev_crash_dump enabled local
raid2 feature@spacemap_histogram active local
raid2 feature@enabled_txg active local
raid2 feature@hole_birth active local
raid2 feature@extensible_dataset active local
raid2 feature@embedded_data active local
raid2 feature@bookmarks enabled local
raid2 feature@filesystem_limits enabled local
raid2 feature@large_blocks enabled local
raid2 feature@large_dnode enabled local
raid2 feature@sha512 enabled local
raid2 feature@skein enabled local
raid2 feature@edonr enabled local
raid2 feature@userobj_accounting active local
raid2 feature@encryption enabled local
raid2 feature@project_quota active local
raid2 feature@device_removal enabled local
raid2 feature@obsolete_counts enabled local
raid2 feature@zpool_checkpoint enabled local
raid2 feature@spacemap_v2 active local
raid2 feature@allocation_classes enabled local
raid2 feature@resilver_defer enabled local
raid2 feature@bookmark_v2 enabled local
raid2 feature@redaction_bookmarks enabled local
raid2 feature@redacted_datasets enabled local
raid2 feature@bookmark_written enabled local
raid2 feature@log_spacemap active local
raid2 feature@livelist enabled local
raid2 feature@device_rebuild enabled local
raid2 feature@zstd_compress active local
raid2 feature@draid enabled local
dataset:
NAME PROPERTY VALUE SOURCE
raid2 compression zstd-1 local
raid2 atime off local
raid2 xattr sa local
raid2 relatime on local
@gamanakis I finally got a chance to test out your patch today. I could not apply it directly, I guess github changed some spaces/tabs, so I patched it manually. The bottom of this comment shows the surrounding code so I can convince you I applied it correctly.
I ran about 8 zfs sned -RLec commands and there were no lockups/panics. I have left the new code running on my server so we can detect any issues as it goes about it daily backup tasks.
for (spa_feature_t f = 0; f < SPA_FEATURES; f++) {
if (zfeature_active(f, ds->ds_feature[f])) {
dsl_dataset_activate_feature(dsobj, f,
ds->ds_feature[f], tx);
}
}
/*
* We are not allowed to dirty a filesystem when done receiving
* a snapshot. In this case the flag SPA_FEATURE_LARGE_BLOCKS will
* not be set and a subsequent encrypted raw send will fail. Hence
* activate this feature if needed here.
*/
if (dmu_tx_is_syncing(tx)) {
for (spa_feature_t f = 0; f < SPA_FEATURES; f++) {
if (zfeature_active(f, ds->ds_feature_activation[f]) &&
!(zfeature_active(f, ds->ds_feature[f]))) {
dsl_dataset_activate_feature(dsobj, f,
ds->ds_feature_activation[f], tx);
ds->ds_feature[f] = ds->ds_feature_activation[f];
}
}
} else {
cmn_err(CE_NOTE, "not syncing context!");
}
ASSERT3U(ds->ds_prev != 0, ==,
dsl_dataset_phys(ds)->ds_prev_snap_obj != 0);
I am back in testing this. I finally managed to get it working by reverting the commit https://github.com/openzfs/zfs/commit/c8d2ab05e1e6bf21185723fabdebbe5bb3374381 I don't know what went wrong during my first tests, but now I succeeded and I have a working 2.1.7 installation. I tested more than a dozen send/receive and had no hang or core dump.
But the new patch from @gamanakis does not work. With that my zfs freezes again.
Dez 13 07:03:28 rakete kernel: VERIFY3(0 == zap_remove(mos, dsobj, spa_feature_table[f].fi_guid, tx)) failed (0 == 2)
Dez 13 07:03:28 rakete kernel: PANIC at dsl_dataset.c:1116:dsl_dataset_deactivate_feature_impl()
Dez 13 07:03:28 rakete kernel: Showing stack for process 2971
Dez 13 07:03:28 rakete kernel: CPU: 15 PID: 2971 Comm: txg_sync Tainted: P W OE 6.1.0-zen1-1-zen #1 7b46ab8a09476e11acb0e7c950>
Dez 13 07:03:28 rakete kernel: Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS ULTRA/X570 AORUS ULTRA, BIOS F36d 07/20/2022
Dez 13 07:03:28 rakete kernel: Call Trace:
Dez 13 07:03:28 rakete kernel: <TASK>
Dez 13 07:03:28 rakete kernel: dump_stack_lvl+0x48/0x60
Dez 13 07:03:28 rakete kernel: spl_panic+0xf4/0x10c [spl 44dc46be6c1ef3141abc1ca53c67e8b523954926]
Dez 13 07:03:28 rakete kernel: dsl_dataset_deactivate_feature_impl+0xfb/0x100 [zfs 1a1003b386336cac77bfe2d16ddc05b04b90a0ad]
Dez 13 07:03:28 rakete kernel: dsl_dataset_clone_swap_sync_impl+0x8d4/0xe30 [zfs 1a1003b386336cac77bfe2d16ddc05b04b90a0ad]
Dez 13 07:03:28 rakete kernel: dsl_dataset_rollback_sync+0x109/0x1c0 [zfs 1a1003b386336cac77bfe2d16ddc05b04b90a0ad]
Dez 13 07:03:28 rakete kernel: dsl_sync_task_sync+0xac/0xf0 [zfs 1a1003b386336cac77bfe2d16ddc05b04b90a0ad]
Dez 13 07:03:28 rakete kernel: dsl_pool_sync+0x40d/0x5c0 [zfs 1a1003b386336cac77bfe2d16ddc05b04b90a0ad]
Dez 13 07:03:28 rakete kernel: spa_sync+0x56c/0xf90 [zfs 1a1003b386336cac77bfe2d16ddc05b04b90a0ad]
Dez 13 07:03:28 rakete kernel: ? spa_txg_history_init_io+0x193/0x1c0 [zfs 1a1003b386336cac77bfe2d16ddc05b04b90a0ad]
Dez 13 07:03:28 rakete kernel: txg_sync_thread+0x22b/0x3f0 [zfs 1a1003b386336cac77bfe2d16ddc05b04b90a0ad]
Dez 13 07:03:28 rakete kernel: ? cpumask_weight.constprop.0+0x20/0x20 [zfs 1a1003b386336cac77bfe2d16ddc05b04b90a0ad]
Dez 13 07:03:28 rakete kernel: ? __thread_exit+0x20/0x20 [spl 44dc46be6c1ef3141abc1ca53c67e8b523954926]
Dez 13 07:03:28 rakete kernel: thread_generic_wrapper+0x5e/0x70 [spl 44dc46be6c1ef3141abc1ca53c67e8b523954926]
Dez 13 07:03:28 rakete kernel: kthread+0xde/0x110
Dez 13 07:03:28 rakete kernel: ? kthread_complete_and_exit+0x20/0x20
Dez 13 07:03:28 rakete kernel: ret_from_fork+0x22/0x30
Dez 13 07:03:28 rakete kernel: </TASK>
I am using syncoid to do the send/receive and the only send options are "-L -I". The only receive options are "-s -F", like here:
zfs send -L -I 'zHome/home'@'2022-12-09--10:32' 'zHome/home'@'2022-12-09--10:47' | mbuffer -q -s 128k -m 512M 2>/dev/null | pv -p -t -e -r -b -s 4720 | zfs receive -s -F 'zstore/data/BACKUP/rakete_home/home'
EDIT: All involved datasets are encrypted with a keyfile
It is pretty quiet now about this issue. Will a fix be considered for 2.1.8 release?
@mabod I am troubled by your stack trace, are you sure you applied the patch proposed here correctly?
I am just testing it again. I have built the packages completely new with your patch. So far I did 13 send/receive with various datasets without any issue. I will do more tests during the day.
That's good to hear, I will submit a PR. Let me know how it goes.
In the meantime I did 37 send/receive with the datasets which had issues before. Everything went fine this time. No issues.
I am having this issue as well, only happened after updating to 2.1.7. I'm also using syncoid with the L, c, and w flags.
I first experienced this issue a while ago, and had to force a reboot (didn't know the REISUB sequence back then), and it occurred again just now when the draid/raidz2 pool was resilvering.
I'll try the patch posted above and report back.
I compiled and installed fresh packages with the patch applied, but I'm still running into the same issues.
[ 383.813742] VERIFY3(0 == zap_remove(mos, dsobj, spa_feature_table[f].fi_guid, tx)) failed (0 == 2)
[ 383.813745] PANIC at dsl_dataset.c:1116:dsl_dataset_deactivate_feature_impl()
[ 383.813747] Showing stack for process 65186
[ 383.813748] CPU: 6 PID: 65186 Comm: txg_sync Tainted: P OE 6.1.0-x64v3-xanmod1 #0~20221213.629e4bb9
[ 383.813750] Hardware name: To Be Filled By O.E.M. xxx
[ 383.813751] Call Trace:
[ 383.813753] <TASK>
[ 383.813755] dump_stack_lvl+0x44/0x5c
[ 383.813762] spl_panic+0xcb/0xe3 [spl]
[ 383.813769] ? dbuf_rele_and_unlock+0x4bf/0x730 [zfs]
[ 383.813816] ? fzap_remove+0xac/0x140 [zfs]
[ 383.813868] ? spl_kmem_alloc+0xad/0xf0 [spl]
[ 383.813874] dsl_dataset_deactivate_feature_impl+0xf6/0x100 [zfs]
[ 383.813916] dsl_dataset_clone_swap_sync_impl+0x860/0xa40 [zfs]
[ 383.813956] ? dsl_dataset_hold_flags+0x97/0x230 [zfs]
[ 383.813992] dmu_recv_end_sync+0x179/0x580 [zfs]
[ 383.814031] dsl_sync_task_sync+0xa3/0xf0 [zfs]
[ 383.814074] dsl_pool_sync+0x3c7/0x4f0 [zfs]
[ 383.814113] spa_sync+0x551/0xf60 [zfs]
[ 383.814157] ? mutex_lock+0x9/0x30
[ 383.814160] ? spa_txg_history_init_io+0xfd/0x110 [zfs]
[ 383.814205] txg_sync_thread+0x214/0x3d0 [zfs]
[ 383.814246] ? txg_register_callbacks+0xa0/0xa0 [zfs]
[ 383.814283] ? __thread_exit+0x10/0x10 [spl]
[ 383.814289] thread_generic_wrapper+0x55/0x60 [spl]
[ 383.814294] kthread+0xd5/0x100
[ 383.814296] ? kthread_complete_and_exit+0x20/0x20
[ 383.814298] ret_from_fork+0x22/0x30
[ 383.814301] </TASK>
Your kernel is 6.1.0-x64v3-xanmod1
. I assume you have it compiled with -march=x86-64-v3
. That might be an issue.
Please repeat the test with a stock arch kernel and compile zfs-utils
and zfs-dkms
with default CFLAGS="-march=x86-64 -mtune=generic..."
. I would also recommend to recreate initramfs so that it contains the new modules.
@mabod I'll try it when I'm available to switch the kernel.
With c8d2ab0 reverted the same zfs send/recv seemed to work properly.
I assume you have it compiled with
-march=x86-64-v3
. That might be an issue
Wait so does that mean that anybody that compiles with -march=x86-64-v3
(or native or whatever else) should expect to run into issues?
I do not know. I am guessing and I try to verify with the help of others like you or @singingtelegram . I had issues with send/receive in the past which where caused by -march=zenver3
(https://github.com/openzfs/zfs/issues/13605 ). With this issue here I believe my first test of @gamanakis patch failed because I used a custom kernel with -march=x86-64-v3
and zfs-utils / zfs-dksm where compiled with the same CFLAGS. Today I switched back to the default CFLAGS="-march=x86-64 -mtune=generic..."
and I could not reproduce the failure anymore. That might be a coincidence. I don't know.
Happened under Debian with 5.10.0-20-amd64 and 2.1.7-1~bpo11+1.
kernel:[25314.473315] VERIFY3(0 == zap_remove(mos, dsobj, spa_feature_table[f].fi_guid, tx)) failed (0 == 2)
kernel:[25314.473367] PANIC at dsl_dataset.c:1116:dsl_dataset_deactivate_feature_impl()
Happened under Debian with 5.10.0-20-amd64 and 2.1.7-1~bpo11+1. kernel:[25314.473315] VERIFY3(0 == zap_remove(mos, dsobj, spa_feature_table[f].fi_guid, tx)) failed (0 == 2) kernel:[25314.473367] PANIC at dsl_dataset.c:1116:dsl_dataset_deactivate_feature_impl()
I got the same error with 2.1.7-1~bpo11+1
. It happened when syncoid backup dataset from zfs 2.0 pool to an encrypted zfs 2.1 pool, after received the oldest full snapshot and in the middle of receiving first incremental. Backup the same dataset to an un-encrypted zfs 2.1 destination is fine.
The same syncoid backup under 2.1.5-1~bpo11+1
also panic but with different error message:
kernel: VERIFY3(0 == zap_add(mos, dsl_dir_phys(pds)->dd_child_dir_zapobj, name, sizeof (uint64_t), 1, &ddobj, tx)) failed (0 == 17)
kernel: PANIC at dsl_dir.c:951:dsl_dir_create_sync()
Today I had a couple of more freezes although I am using the pull request from @gamanakis : https://github.com/openzfs/zfs/pull/14304
Dez 24 07:33:46 rakete kernel: VERIFY3(0 == zap_remove(mos, dsobj, spa_feature_table[f].fi_guid, tx)) failed (0 == 2)
Dez 24 07:33:46 rakete kernel: PANIC at dsl_dataset.c:1116:dsl_dataset_deactivate_feature_impl()
Dez 24 07:33:46 rakete kernel: Showing stack for process 2997
Dez 24 07:33:46 rakete kernel: CPU: 9 PID: 2997 Comm: txg_sync Tainted: P OE 5.15.85-272-tkg-cfs #1 4306e834cad2e7fa881b8beb0>
Dez 24 07:33:46 rakete kernel: Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS ULTRA/X570 AORUS ULTRA, BIOS F36d 07/20/2022
Dez 24 07:33:46 rakete kernel: Call Trace:
Dez 24 07:33:46 rakete kernel: <TASK>
Dez 24 07:33:46 rakete kernel: dump_stack_lvl+0x45/0x5b
Dez 24 07:33:46 rakete kernel: spl_panic+0xe9/0x101 [spl 50a5f53751636295d4130de5f7421dda161ad3f0]
Dez 24 07:33:46 rakete kernel: dsl_dataset_deactivate_feature_impl+0xfb/0x100 [zfs 841fb41d130b667f49823dadcbaf688267a5f2a4]
Dez 24 07:33:46 rakete kernel: dsl_dataset_clone_swap_sync_impl+0x8cb/0xab0 [zfs 841fb41d130b667f49823dadcbaf688267a5f2a4]
Dez 24 07:33:46 rakete kernel: dmu_recv_end_sync+0x18f/0x5a0 [zfs 841fb41d130b667f49823dadcbaf688267a5f2a4]
Dez 24 07:33:46 rakete kernel: dsl_sync_task_sync+0xa8/0xf0 [zfs 841fb41d130b667f49823dadcbaf688267a5f2a4]
Dez 24 07:33:46 rakete kernel: dsl_pool_sync+0x3f7/0x510 [zfs 841fb41d130b667f49823dadcbaf688267a5f2a4]
Dez 24 07:33:46 rakete kernel: spa_sync+0x567/0xfb0 [zfs 841fb41d130b667f49823dadcbaf688267a5f2a4]
Dez 24 07:33:46 rakete kernel: ? spa_txg_history_init_io+0x113/0x120 [zfs 841fb41d130b667f49823dadcbaf688267a5f2a4]
Dez 24 07:33:46 rakete kernel: txg_sync_thread+0x229/0x3f0 [zfs 841fb41d130b667f49823dadcbaf688267a5f2a4]
Dez 24 07:33:46 rakete kernel: ? txg_fini+0x260/0x260 [zfs 841fb41d130b667f49823dadcbaf688267a5f2a4]
Dez 24 07:33:46 rakete kernel: ? __thread_exit+0x20/0x20 [spl 50a5f53751636295d4130de5f7421dda161ad3f0]
Dez 24 07:33:46 rakete kernel: thread_generic_wrapper+0x5a/0x70 [spl 50a5f53751636295d4130de5f7421dda161ad3f0]
Dez 24 07:33:46 rakete kernel: kthread+0x118/0x140
Dez 24 07:33:46 rakete kernel: ? set_kthread_struct+0x50/0x50
Dez 24 07:33:46 rakete kernel: ret_from_fork+0x22/0x30
Dez 24 07:33:46 rakete kernel: </TASK>
In the meantime I tend to believe that I am struggling with two issues here. One issue is adressed by the PR, but the other issue is probably related to zstd compression. In cases where I am getting freezes the sending dataset has lz4 compression and the receiving dataset has zstd compression (and both are encrypted). May be related to this issue https://github.com/openzfs/zfs/issues/12785 although I am not using zstd-19 but regular zstd.
I was sending from a mostly zstd-enabled pool to a pool set to use lz4. Again, with c8d2ab0 reverted the same zfs send/recv seemed to work properly. (I ran into #12001 today when running the same syncoid command, which seems to be unrelated to this issue.)
I had the same freeze again today with all datasets on lz4 compression. So for me it has nothing todo with zstd compression.
@mabod are you using raw send in those encrypted datasets?
No, I do not use raw send. no "-w" option
My standard command is:
syncoid --sendoptions="L" --mbuffer-size=512M --no-sync-snap zpool1/src zpool2/dest
I encountered this when receiving a Docker dataset. When it is excluded everything works fine.
Source properties:
$ zfs get all rpool/var/lib/docker/xxx@autosnap_2022-12-26_13:00:35_hourly
NAME PROPERTY VALUE SOURCE
rpool/var/lib/docker/xxx@autosnap_2022-12-26_13:00:35_hourly type snapshot -
rpool/var/lib/docker/xxx@autosnap_2022-12-26_13:00:35_hourly creation E dets 26 15:00 2022 -
rpool/var/lib/docker/xxx@autosnap_2022-12-26_13:00:35_hourly used 0B -
rpool/var/lib/docker/xxx@autosnap_2022-12-26_13:00:35_hourly referenced 40.8M -
rpool/var/lib/docker/xxx@autosnap_2022-12-26_13:00:35_hourly compressratio 1.92x -
rpool/var/lib/docker/xxx@autosnap_2022-12-26_13:00:35_hourly devices off inherited from rpool
rpool/var/lib/docker/xxx@autosnap_2022-12-26_13:00:35_hourly exec on default
rpool/var/lib/docker/xxx@autosnap_2022-12-26_13:00:35_hourly setuid on default
rpool/var/lib/docker/xxx@autosnap_2022-12-26_13:00:35_hourly createtxg 22726688 -
rpool/var/lib/docker/xxx@autosnap_2022-12-26_13:00:35_hourly xattr sa inherited from rpool
rpool/var/lib/docker/xxx@autosnap_2022-12-26_13:00:35_hourly version 5 -
rpool/var/lib/docker/xxx@autosnap_2022-12-26_13:00:35_hourly utf8only on -
rpool/var/lib/docker/xxx@autosnap_2022-12-26_13:00:35_hourly normalization formD -
rpool/var/lib/docker/xxx@autosnap_2022-12-26_13:00:35_hourly casesensitivity sensitive -
rpool/var/lib/docker/xxx@autosnap_2022-12-26_13:00:35_hourly nbmand off default
rpool/var/lib/docker/xxx@autosnap_2022-12-26_13:00:35_hourly guid 12515425266654301958 -
rpool/var/lib/docker/xxx@autosnap_2022-12-26_13:00:35_hourly primarycache all default
rpool/var/lib/docker/xxx@autosnap_2022-12-26_13:00:35_hourly secondarycache all default
rpool/var/lib/docker/xxx@autosnap_2022-12-26_13:00:35_hourly defer_destroy off -
rpool/var/lib/docker/xxx@autosnap_2022-12-26_13:00:35_hourly userrefs 0 -
rpool/var/lib/docker/xxx@autosnap_2022-12-26_13:00:35_hourly objsetid 120819 -
rpool/var/lib/docker/xxx@autosnap_2022-12-26_13:00:35_hourly mlslabel none default
rpool/var/lib/docker/xxx@autosnap_2022-12-26_13:00:35_hourly refcompressratio 1.92x -
rpool/var/lib/docker/xxx@autosnap_2022-12-26_13:00:35_hourly written 0 -
rpool/var/lib/docker/xxx@autosnap_2022-12-26_13:00:35_hourly logicalreferenced 77.3M -
rpool/var/lib/docker/xxx@autosnap_2022-12-26_13:00:35_hourly acltype posix inherited from rpool
rpool/var/lib/docker/xxx@autosnap_2022-12-26_13:00:35_hourly context none default
rpool/var/lib/docker/xxx@autosnap_2022-12-26_13:00:35_hourly fscontext none default
rpool/var/lib/docker/xxx@autosnap_2022-12-26_13:00:35_hourly defcontext none default
rpool/var/lib/docker/xxx@autosnap_2022-12-26_13:00:35_hourly rootcontext none default
rpool/var/lib/docker/xxx@autosnap_2022-12-26_13:00:35_hourly encryption off default
rpool/var/lib/docker/xxx@autosnap_2022-12-26_13:00:35_hourly com.sun:auto-snapshot false inherited from rpool/var/lib/docker
$ zfs get all rpool/var/lib/docker/xxx
NAME PROPERTY VALUE SOURCE
rpool/var/lib/docker/xxx type filesystem -
rpool/var/lib/docker/xxx creation T juuli 19 16:38 2022 -
rpool/var/lib/docker/xxx used 40.8M -
rpool/var/lib/docker/xxx available 158G -
rpool/var/lib/docker/xxx referenced 40.8M -
rpool/var/lib/docker/xxx compressratio 1.92x -
rpool/var/lib/docker/xxx mounted no -
rpool/var/lib/docker/xxx quota none default
rpool/var/lib/docker/xxx reservation none default
rpool/var/lib/docker/xxx recordsize 128K default
rpool/var/lib/docker/xxx mountpoint legacy local
rpool/var/lib/docker/xxx sharenfs off default
rpool/var/lib/docker/xxx checksum on default
rpool/var/lib/docker/xxx compression lz4 inherited from rpool
rpool/var/lib/docker/xxx atime on default
rpool/var/lib/docker/xxx devices off inherited from rpool
rpool/var/lib/docker/xxx exec on default
rpool/var/lib/docker/xxx setuid on default
rpool/var/lib/docker/xxx readonly off default
rpool/var/lib/docker/xxx zoned off default
rpool/var/lib/docker/xxx snapdir visible inherited from rpool/var/lib
rpool/var/lib/docker/xxx aclmode discard default
rpool/var/lib/docker/xxx aclinherit restricted default
rpool/var/lib/docker/xxx createtxg 18674763 -
rpool/var/lib/docker/xxx canmount on default
rpool/var/lib/docker/xxx xattr sa inherited from rpool
rpool/var/lib/docker/xxx copies 1 default
rpool/var/lib/docker/xxx version 5 -
rpool/var/lib/docker/xxx utf8only on -
rpool/var/lib/docker/xxx normalization formD -
rpool/var/lib/docker/xxx casesensitivity sensitive -
rpool/var/lib/docker/xxx vscan off default
rpool/var/lib/docker/xxx nbmand off default
rpool/var/lib/docker/xxx sharesmb off default
rpool/var/lib/docker/xxx refquota none default
rpool/var/lib/docker/xxx refreservation none default
rpool/var/lib/docker/xxx guid 1370374951832482381 -
rpool/var/lib/docker/xxx primarycache all default
rpool/var/lib/docker/xxx secondarycache all default
rpool/var/lib/docker/xxx usedbysnapshots 1K -
rpool/var/lib/docker/xxx usedbydataset 40.8M -
rpool/var/lib/docker/xxx usedbychildren 0B -
rpool/var/lib/docker/xxx usedbyrefreservation 0B -
rpool/var/lib/docker/xxx logbias latency default
rpool/var/lib/docker/xxx objsetid 67998 -
rpool/var/lib/docker/xxx dedup off default
rpool/var/lib/docker/xxx mlslabel none default
rpool/var/lib/docker/xxx sync standard default
rpool/var/lib/docker/xxx dnodesize auto inherited from rpool
rpool/var/lib/docker/xxx refcompressratio 1.92x -
rpool/var/lib/docker/xxx written 0 -
rpool/var/lib/docker/xxx logicalused 77.3M -
rpool/var/lib/docker/xxx logicalreferenced 77.3M -
rpool/var/lib/docker/xxx volmode default default
rpool/var/lib/docker/xxx filesystem_limit none default
rpool/var/lib/docker/xxx snapshot_limit none default
rpool/var/lib/docker/xxx filesystem_count none default
rpool/var/lib/docker/xxx snapshot_count none default
rpool/var/lib/docker/xxx snapdev hidden default
rpool/var/lib/docker/xxx acltype posix inherited from rpool
rpool/var/lib/docker/xxx context none default
rpool/var/lib/docker/xxx fscontext none default
rpool/var/lib/docker/xxx defcontext none default
rpool/var/lib/docker/xxx rootcontext none default
rpool/var/lib/docker/xxx relatime on inherited from rpool
rpool/var/lib/docker/xxx redundant_metadata all default
rpool/var/lib/docker/xxx overlay on default
rpool/var/lib/docker/xxx encryption off default
rpool/var/lib/docker/xxx keylocation none default
rpool/var/lib/docker/xxx keyformat none default
rpool/var/lib/docker/xxx pbkdf2iters 0 default
rpool/var/lib/docker/xxx special_small_blocks 0 default
rpool/var/lib/docker/xxx com.sun:auto-snapshot false inherited from rpool/var/lib/docker
Destination properties:
$ zfs get all yyy/rpool/var/lib/docker/xxx
NAME PROPERTY VALUE SOURCE
yyy/rpool/var/lib/docker/xxx type filesystem -
yyy/rpool/var/lib/docker/xxx creation L okt 1 18:11 2022 -
yyy/rpool/var/lib/docker/xxx used 64.5M -
yyy/rpool/var/lib/docker/xxx available 2.75T -
yyy/rpool/var/lib/docker/xxx referenced 57.9M -
yyy/rpool/var/lib/docker/xxx compressratio 1.70x -
yyy/rpool/var/lib/docker/xxx mounted yes -
yyy/rpool/var/lib/docker/xxx quota none default
yyy/rpool/var/lib/docker/xxx reservation none default
yyy/rpool/var/lib/docker/xxx recordsize 128K default
yyy/rpool/var/lib/docker/xxx mountpoint /yyy/rpool/var/lib/docker/xxx default
yyy/rpool/var/lib/docker/xxx sharenfs off default
yyy/rpool/var/lib/docker/xxx checksum on default
yyy/rpool/var/lib/docker/xxx compression on inherited from yyy
yyy/rpool/var/lib/docker/xxx atime off inherited from yyy
yyy/rpool/var/lib/docker/xxx devices on default
yyy/rpool/var/lib/docker/xxx exec on default
yyy/rpool/var/lib/docker/xxx setuid on default
yyy/rpool/var/lib/docker/xxx readonly off default
yyy/rpool/var/lib/docker/xxx zoned off default
yyy/rpool/var/lib/docker/xxx snapdir hidden default
yyy/rpool/var/lib/docker/xxx aclmode discard default
yyy/rpool/var/lib/docker/xxx aclinherit restricted default
yyy/rpool/var/lib/docker/xxx createtxg 916772 -
yyy/rpool/var/lib/docker/xxx canmount on default
yyy/rpool/var/lib/docker/xxx xattr sa inherited from yyy
yyy/rpool/var/lib/docker/xxx copies 1 default
yyy/rpool/var/lib/docker/xxx version 5 -
yyy/rpool/var/lib/docker/xxx utf8only on -
yyy/rpool/var/lib/docker/xxx normalization formD -
yyy/rpool/var/lib/docker/xxx casesensitivity sensitive -
yyy/rpool/var/lib/docker/xxx vscan off default
yyy/rpool/var/lib/docker/xxx nbmand off default
yyy/rpool/var/lib/docker/xxx sharesmb off default
yyy/rpool/var/lib/docker/xxx refquota none default
yyy/rpool/var/lib/docker/xxx refreservation none default
yyy/rpool/var/lib/docker/xxx guid 988309690596550009 -
yyy/rpool/var/lib/docker/xxx primarycache all default
yyy/rpool/var/lib/docker/xxx secondarycache all default
yyy/rpool/var/lib/docker/xxx usedbysnapshots 6.57M -
yyy/rpool/var/lib/docker/xxx usedbydataset 57.9M -
yyy/rpool/var/lib/docker/xxx usedbychildren 0B -
yyy/rpool/var/lib/docker/xxx usedbyrefreservation 0B -
yyy/rpool/var/lib/docker/xxx logbias latency default
yyy/rpool/var/lib/docker/xxx objsetid 1259620 -
yyy/rpool/var/lib/docker/xxx dedup off default
yyy/rpool/var/lib/docker/xxx mlslabel none default
yyy/rpool/var/lib/docker/xxx sync standard default
yyy/rpool/var/lib/docker/xxx dnodesize legacy default
yyy/rpool/var/lib/docker/xxx refcompressratio 1.75x -
yyy/rpool/var/lib/docker/xxx written 6.59M -
yyy/rpool/var/lib/docker/xxx logicalused 84.1M -
yyy/rpool/var/lib/docker/xxx logicalreferenced 80.8M -
yyy/rpool/var/lib/docker/xxx volmode default default
yyy/rpool/var/lib/docker/xxx filesystem_limit none default
yyy/rpool/var/lib/docker/xxx snapshot_limit none default
yyy/rpool/var/lib/docker/xxx filesystem_count none default
yyy/rpool/var/lib/docker/xxx snapshot_count none default
yyy/rpool/var/lib/docker/xxx snapdev hidden default
yyy/rpool/var/lib/docker/xxx acltype off default
yyy/rpool/var/lib/docker/xxx context none default
yyy/rpool/var/lib/docker/xxx fscontext none default
yyy/rpool/var/lib/docker/xxx defcontext none default
yyy/rpool/var/lib/docker/xxx rootcontext none default
yyy/rpool/var/lib/docker/xxx relatime off default
yyy/rpool/var/lib/docker/xxx redundant_metadata all default
yyy/rpool/var/lib/docker/xxx overlay on default
yyy/rpool/var/lib/docker/xxx encryption aes-256-gcm -
yyy/rpool/var/lib/docker/xxx keylocation none default
yyy/rpool/var/lib/docker/xxx keyformat passphrase -
yyy/rpool/var/lib/docker/xxx pbkdf2iters 350000 -
yyy/rpool/var/lib/docker/xxx encryptionroot yyy -
yyy/rpool/var/lib/docker/xxx keystatus available -
yyy/rpool/var/lib/docker/xxx special_small_blocks 0 default
There are some differences like dnodesize, acltype, and of course encryption.
Just to summarize my findings so far:
The last couple of days I was running 2.1.7 with the PR from @gamanakis : https://github.com/openzfs/zfs/pull/14304 At first it looked really promising. No crashes at first. But then I had another freeze 2 days ago.
Then I assumed that zstd compression could eventually cause the issue. I have lz4 on the sender side and zstd on the receiver side. I moved all the receiver side to lz4. But then I got another freeze. So zstd is not the issue.
May be it is encryption related. Sender and receiver are encrypted with a keyfile. But I do not know.
Now I am back to square one and I have stock 2.1.7 running with the large-block commit reverted: https://github.com/openzfs/zfs/commit/c8d2ab05e1e6bf21185723fabdebbe5bb3374381
So far I have no freezes.
@mabod I think it would be helpful to know which feature triggers the bug. I assume the code in https://github.com/openzfs/zfs/commit/c8d2ab05e1e6bf21185723fabdebbe5bb3374381 activates a feature flag, which then ZFS tries to deactivate before it is actually committed in the ZAP.
Would you be able to run the following patch on top of the original 2.1.7 release? Then take a look at dmesg for lines like: "NOTICE: feature activated: 11"
diff --git a/module/zfs/dsl_dataset.c b/module/zfs/dsl_dataset.c
index 57a58f88c..fcdacd591 100644
--- a/module/zfs/dsl_dataset.c
+++ b/module/zfs/dsl_dataset.c
@@ -1770,6 +1770,7 @@ dsl_dataset_snapshot_sync_impl(dsl_dataset_t *ds, const char *snapname,
for (spa_feature_t f = 0; f < SPA_FEATURES; f++) {
if (zfeature_active(f, ds->ds_feature_activation[f]) &&
!(zfeature_active(f, ds->ds_feature[f]))) {
+ cmn_err(CE_NOTE, "feature activated: %d", f);
dsl_dataset_activate_feature(dsobj, f,
ds->ds_feature_activation[f], tx);
ds->ds_feature[f] =
System information
Describe the problem you're observing
When using zfs send to make a backup on a remote machine, the receiver throws a PANIC in one of the zfs functions and the file system deadlocks.
Describe how to reproduce the problem
Include any warning/errors/backtraces from the system logs