openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.39k stars 1.72k forks source link

Trouble with send/recv, dedup dataset, compressed sends #10237

Open dioni21 opened 4 years ago

dioni21 commented 4 years ago

System information

Type Version/Name
Distribution Name Fedora
Distribution Version Fedora release 31 (Thirty One)
Linux Kernel 5.5.17-200.fc31.x86_64
Architecture x86_64
ZFS Version 0.8.0-732_ga84c92f93 (master a84c92f93364116b5e7f4685eb6d665526251a51)
SPL Version 0.8.0-732_ga84c92f93

Everything compiled with configuration:

configure --enable-silent-rules --enable-dependency-tracking --enable-asan --enable-debuginfo --enable-debug --enable-debug-kmem --enable-debug-kmem-tracking --enable-code-coverage

Describe the problem you're observing

While doing a zfs send/recv, got a panic kernel message, and now almost everything zpool related is now frozen. zfs recv process is blocked in D state.

Describe how to reproduce the problem

Not sure yet. Why try again later.

Pool was created with the following command:

zpool create new -f \
  -O atime=off \
  -O compression=on \
  -O dedup=edonr,verify \
  -O dnodesize=auto \
  -O checksum=edonr \
  -O relatime=off \
  -O secondarycache=metadata \
  -O sharenfs=sec=sys,crossmnt,no_subtree_check,no_root_squash,rw=@192.168.0.0/16 \
  -O xattr=sa \
  -o listsnapshots=on \
  -o ashift=13 \
  -o comment='Jonny 10T Seagate' \
  -o autotrim=off \
  -o feature@hole_birth=disabled \
  -o feature@project_quota=disabled \
  -o feature@redaction_bookmarks=disabled \
  -o feature@redacted_datasets=disabled \
  mirror /dev/disk/by-id/ata-XXX-part1 /dev/disk/by-id/ata-YYY-part1 \
  special /dev/ssd480/special \
  log /dev/ssd480/zil \
  cache /dev/ssd480/cache

zfs create \
  -o secondarycache=all \
  -o special_small_blocks=16k \
  -o copies=2 \
  new/jonny

And some others dataset that were not yet initialized.

Command to transfer data:

zfs send --embed --compressed tank@snap | mbuffer -s 128k -b 128 | zfs receive -F -s -v new
zfs send --embed --compressed tank/jonny@snap | mbuffer -s 128k -b 128 | zfs receive -F -s -v new/jonny

The first transfer ran with any problems, but I think it has no deduped data at all.

While doing the second I decided to stop it and change the script. But the zfs recv never returned from the ^C

BTW: tank is an old pool in a failing disk. Config difference is bsically checksum=on and dedup=verify, so I assume the "new" stuff is edonr.

Will reboot now to see if I can solve this.

Include any warning/errors/backtraces from the system logs

Kernel logs from dmesg:

[Tue Apr 21 19:36:30 2020] VERIFY3(dabd->abd_size == sabd->abd_size) failed (4096 == 19456)
[Tue Apr 21 19:36:30 2020] PANIC at abd.c:1368:abd_cmp()
[Tue Apr 21 19:36:30 2020] Showing stack for process 94594
[Tue Apr 21 19:36:30 2020] CPU: 5 PID: 94594 Comm: z_wr_iss Tainted: P        W  OE     5.5.17-200.fc31.x86_64 #1
[Tue Apr 21 19:36:30 2020] Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./970A-D3P, BIOS FD 02/26/2016
[Tue Apr 21 19:36:30 2020] Call Trace:
[Tue Apr 21 19:36:30 2020]  dump_stack+0x66/0x90
[Tue Apr 21 19:36:30 2020]  spl_panic+0xd4/0x102 [spl]
[Tue Apr 21 19:36:30 2020]  ? __switch_to_asm+0x34/0x70
[Tue Apr 21 19:36:30 2020]  ? preempt_schedule_common+0x14/0x20
[Tue Apr 21 19:36:30 2020]  ? spl_kmem_cache_free+0xfd/0x1d0 [spl]
[Tue Apr 21 19:36:30 2020]  ? kmem_cache_free+0x289/0x2b0
[Tue Apr 21 19:36:30 2020]  ? spl_kmem_cache_free+0x105/0x1d0 [spl]
[Tue Apr 21 19:36:30 2020]  abd_cmp+0x3f/0x60 [zfs]
[Tue Apr 21 19:36:30 2020]  zio_ddt_collision+0x47b/0x4e0 [zfs]
[Tue Apr 21 19:36:30 2020]  zio_ddt_write+0x1fb/0x520 [zfs]
[Tue Apr 21 19:36:30 2020]  zio_execute+0xe6/0x220 [zfs]
[Tue Apr 21 19:36:30 2020]  taskq_thread+0x237/0x4a0 [spl]
[Tue Apr 21 19:36:30 2020]  ? wake_up_q+0xa0/0xa0
[Tue Apr 21 19:36:30 2020]  ? zio_execute_stack_check.constprop.0+0x10/0x10 [zfs]
[Tue Apr 21 19:36:30 2020]  kthread+0xf9/0x130
[Tue Apr 21 19:36:30 2020]  ? kasan_check_write.constprop.0+0x10/0x10 [spl]
[Tue Apr 21 19:36:30 2020]  ? kthread_park+0x90/0x90
[Tue Apr 21 19:36:30 2020]  ret_from_fork+0x22/0x40
[Tue Apr 21 19:36:30 2020] VERIFY3(dabd->abd_size == sabd->abd_size) failed (8192 == 131072)
[Tue Apr 21 19:36:30 2020] PANIC at abd.c:1368:abd_cmp()
[Tue Apr 21 19:36:30 2020] Showing stack for process 94592
[Tue Apr 21 19:36:30 2020] CPU: 0 PID: 94592 Comm: z_wr_iss Tainted: P        W  OE     5.5.17-200.fc31.x86_64 #1
[Tue Apr 21 19:36:30 2020] Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./970A-D3P, BIOS FD 02/26/2016
[Tue Apr 21 19:36:30 2020] Call Trace:
[Tue Apr 21 19:36:30 2020]  dump_stack+0x66/0x90
[Tue Apr 21 19:36:30 2020]  spl_panic+0xd4/0x102 [spl]
[Tue Apr 21 19:36:30 2020]  ? __wake_up_common_lock+0x8a/0xc0
[Tue Apr 21 19:36:30 2020]  ? atomic_read+0x9/0x10 [spl]
[Tue Apr 21 19:36:30 2020]  ? waitqueue_active+0x9/0x20 [spl]
[Tue Apr 21 19:36:30 2020]  ? cv_destroy_wakeup+0x45/0xa0 [spl]
[Tue Apr 21 19:36:30 2020]  ? atomic_read+0x9/0x10 [spl]
[Tue Apr 21 19:36:30 2020]  ? waitqueue_active+0x9/0x20 [spl]
[Tue Apr 21 19:36:30 2020]  ? __cv_destroy+0xc9/0x250 [spl]
[Tue Apr 21 19:36:30 2020]  ? spl_kmem_cache_free+0x105/0x1d0 [spl]
[Tue Apr 21 19:36:30 2020]  abd_cmp+0x3f/0x60 [zfs]
[Tue Apr 21 19:36:30 2020]  zio_ddt_collision+0x47b/0x4e0 [zfs]
[Tue Apr 21 19:36:30 2020]  zio_ddt_write+0x1fb/0x520 [zfs]
[Tue Apr 21 19:36:30 2020]  zio_execute+0xe6/0x220 [zfs]
[Tue Apr 21 19:36:30 2020]  taskq_thread+0x237/0x4a0 [spl]
[Tue Apr 21 19:36:30 2020]  ? wake_up_q+0xa0/0xa0
[Tue Apr 21 19:36:30 2020]  ? zio_execute_stack_check.constprop.0+0x10/0x10 [zfs]
[Tue Apr 21 19:36:30 2020]  kthread+0xf9/0x130
[Tue Apr 21 19:36:30 2020]  ? kasan_check_write.constprop.0+0x10/0x10 [spl]
[Tue Apr 21 19:36:30 2020]  ? kthread_park+0x90/0x90
[Tue Apr 21 19:36:30 2020]  ret_from_fork+0x22/0x40
[Tue Apr 21 19:36:30 2020] VERIFY3(dabd->abd_size == sabd->abd_size) failed (8192 == 131072)
[Tue Apr 21 19:36:30 2020] PANIC at abd.c:1368:abd_cmp()
[Tue Apr 21 19:36:30 2020] Showing stack for process 94593
[Tue Apr 21 19:36:30 2020] CPU: 7 PID: 94593 Comm: z_wr_iss Tainted: P        W  OE     5.5.17-200.fc31.x86_64 #1
[Tue Apr 21 19:36:30 2020] Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./970A-D3P, BIOS FD 02/26/2016
[Tue Apr 21 19:36:30 2020] Call Trace:
[Tue Apr 21 19:36:30 2020]  dump_stack+0x66/0x90
[Tue Apr 21 19:36:30 2020]  spl_panic+0xd4/0x102 [spl]
[Tue Apr 21 19:36:30 2020]  ? __wake_up_common_lock+0x8a/0xc0
[Tue Apr 21 19:36:30 2020]  ? atomic_read+0x9/0x10 [spl]
[Tue Apr 21 19:36:30 2020]  ? waitqueue_active+0x9/0x20 [spl]
[Tue Apr 21 19:36:30 2020]  ? cv_destroy_wakeup+0x45/0xa0 [spl]
[Tue Apr 21 19:36:30 2020]  ? atomic_read+0x9/0x10 [spl]
[Tue Apr 21 19:36:30 2020]  ? waitqueue_active+0x9/0x20 [spl]
[Tue Apr 21 19:36:30 2020]  ? __cv_destroy+0xc9/0x250 [spl]
[Tue Apr 21 19:36:30 2020]  ? spl_kmem_cache_free+0x105/0x1d0 [spl]
[Tue Apr 21 19:36:30 2020]  abd_cmp+0x3f/0x60 [zfs]
[Tue Apr 21 19:36:30 2020]  zio_ddt_collision+0x47b/0x4e0 [zfs]
[Tue Apr 21 19:36:30 2020]  zio_ddt_write+0x1fb/0x520 [zfs]
[Tue Apr 21 19:36:30 2020]  zio_execute+0xe6/0x220 [zfs]
[Tue Apr 21 19:36:30 2020]  taskq_thread+0x237/0x4a0 [spl]
[Tue Apr 21 19:36:30 2020]  ? wake_up_q+0xa0/0xa0
[Tue Apr 21 19:36:30 2020]  ? zio_execute_stack_check.constprop.0+0x10/0x10 [zfs]
[Tue Apr 21 19:36:30 2020]  kthread+0xf9/0x130
[Tue Apr 21 19:36:30 2020]  ? kasan_check_write.constprop.0+0x10/0x10 [spl]
[Tue Apr 21 19:36:30 2020]  ? kthread_park+0x90/0x90
[Tue Apr 21 19:36:30 2020]  ret_from_fork+0x22/0x40
[Tue Apr 21 19:36:30 2020] VERIFY3(dabd->abd_size == sabd->abd_size) failed (4096 == 10240)
[Tue Apr 21 19:36:30 2020] PANIC at abd.c:1368:abd_cmp()
[Tue Apr 21 19:36:30 2020] Showing stack for process 94590
[Tue Apr 21 19:36:30 2020] CPU: 6 PID: 94590 Comm: z_wr_iss Tainted: P        W  OE     5.5.17-200.fc31.x86_64 #1
[Tue Apr 21 19:36:30 2020] Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./970A-D3P, BIOS FD 02/26/2016
[Tue Apr 21 19:36:30 2020] Call Trace:
[Tue Apr 21 19:36:30 2020]  dump_stack+0x66/0x90
[Tue Apr 21 19:36:30 2020]  spl_panic+0xd4/0x102 [spl]
[Tue Apr 21 19:36:30 2020]  ? spl_kmem_cache_free+0xfd/0x1d0 [spl]
[Tue Apr 21 19:36:30 2020]  ? kmem_cache_free+0x289/0x2b0
[Tue Apr 21 19:36:30 2020]  ? spl_kmem_cache_free+0x105/0x1d0 [spl]
[Tue Apr 21 19:36:30 2020]  abd_cmp+0x3f/0x60 [zfs]
[Tue Apr 21 19:36:30 2020]  zio_ddt_collision+0x47b/0x4e0 [zfs]
[Tue Apr 21 19:36:30 2020]  ? _cond_resched+0x15/0x30
[Tue Apr 21 19:36:30 2020]  zio_ddt_write+0x1fb/0x520 [zfs]
[Tue Apr 21 19:36:30 2020]  zio_execute+0xe6/0x220 [zfs]
[Tue Apr 21 19:36:30 2020]  taskq_thread+0x237/0x4a0 [spl]
[Tue Apr 21 19:36:30 2020]  ? wake_up_q+0xa0/0xa0
[Tue Apr 21 19:36:30 2020]  ? zio_execute_stack_check.constprop.0+0x10/0x10 [zfs]
[Tue Apr 21 19:36:30 2020]  kthread+0xf9/0x130
[Tue Apr 21 19:36:30 2020]  ? kasan_check_write.constprop.0+0x10/0x10 [spl]
[Tue Apr 21 19:36:30 2020]  ? kthread_park+0x90/0x90
[Tue Apr 21 19:36:30 2020]  ret_from_fork+0x22/0x40
[Tue Apr 21 19:36:30 2020] VERIFY3(dabd->abd_size == sabd->abd_size) failed (8192 == 131072)
[Tue Apr 21 19:36:30 2020] PANIC at abd.c:1368:abd_cmp()
[Tue Apr 21 19:36:30 2020] Showing stack for process 94591
[Tue Apr 21 19:36:30 2020] CPU: 7 PID: 94591 Comm: z_wr_iss Tainted: P        W  OE     5.5.17-200.fc31.x86_64 #1
[Tue Apr 21 19:36:30 2020] Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./970A-D3P, BIOS FD 02/26/2016
[Tue Apr 21 19:36:30 2020] Call Trace:
[Tue Apr 21 19:36:30 2020]  dump_stack+0x66/0x90
[Tue Apr 21 19:36:30 2020]  spl_panic+0xd4/0x102 [spl]
[Tue Apr 21 19:36:30 2020]  ? __wake_up_common_lock+0x8a/0xc0
[Tue Apr 21 19:36:30 2020]  ? atomic_read+0x9/0x10 [spl]
[Tue Apr 21 19:36:30 2020]  ? waitqueue_active+0x9/0x20 [spl]
[Tue Apr 21 19:36:30 2020]  ? cv_destroy_wakeup+0x45/0xa0 [spl]
[Tue Apr 21 19:36:30 2020]  ? atomic_read+0x9/0x10 [spl]
[Tue Apr 21 19:36:30 2020]  ? waitqueue_active+0x9/0x20 [spl]
[Tue Apr 21 19:36:30 2020]  ? __cv_destroy+0xc9/0x250 [spl]
[Tue Apr 21 19:36:30 2020]  ? spl_kmem_cache_free+0x105/0x1d0 [spl]
[Tue Apr 21 19:36:30 2020]  abd_cmp+0x3f/0x60 [zfs]
[Tue Apr 21 19:36:30 2020]  zio_ddt_collision+0x47b/0x4e0 [zfs]
[Tue Apr 21 19:36:30 2020]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[Tue Apr 21 19:36:30 2020]  zio_ddt_write+0x1fb/0x520 [zfs]
[Tue Apr 21 19:36:30 2020]  zio_execute+0xe6/0x220 [zfs]
[Tue Apr 21 19:36:30 2020]  taskq_thread+0x237/0x4a0 [spl]
[Tue Apr 21 19:36:30 2020]  ? wake_up_q+0xa0/0xa0
[Tue Apr 21 19:36:30 2020]  ? zio_execute_stack_check.constprop.0+0x10/0x10 [zfs]
[Tue Apr 21 19:36:30 2020]  kthread+0xf9/0x130
[Tue Apr 21 19:36:30 2020]  ? kasan_check_write.constprop.0+0x10/0x10 [spl]
[Tue Apr 21 19:36:30 2020]  ? kthread_park+0x90/0x90
[Tue Apr 21 19:36:30 2020]  ret_from_fork+0x22/0x40
[Tue Apr 21 19:36:30 2020] VERIFY3(dabd->abd_size == sabd->abd_size) failed (4096 == 131072)
[Tue Apr 21 19:36:30 2020] PANIC at abd.c:1368:abd_cmp()
[Tue Apr 21 19:36:30 2020] Showing stack for process 94595
[Tue Apr 21 19:36:30 2020] CPU: 1 PID: 94595 Comm: z_wr_iss Tainted: P        W  OE     5.5.17-200.fc31.x86_64 #1
[Tue Apr 21 19:36:30 2020] Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./970A-D3P, BIOS FD 02/26/2016
[Tue Apr 21 19:36:30 2020] Call Trace:
[Tue Apr 21 19:36:30 2020]  dump_stack+0x66/0x90
[Tue Apr 21 19:36:30 2020]  spl_panic+0xd4/0x102 [spl]
[Tue Apr 21 19:36:30 2020]  ? spl_kmem_cache_free+0xfd/0x1d0 [spl]
[Tue Apr 21 19:36:30 2020]  ? kmem_cache_free+0x289/0x2b0
[Tue Apr 21 19:36:30 2020]  ? spl_kmem_cache_free+0x105/0x1d0 [spl]
[Tue Apr 21 19:36:30 2020]  abd_cmp+0x3f/0x60 [zfs]
[Tue Apr 21 19:36:30 2020]  zio_ddt_collision+0x47b/0x4e0 [zfs]
[Tue Apr 21 19:36:30 2020]  ? _cond_resched+0x15/0x30
[Tue Apr 21 19:36:30 2020]  zio_ddt_write+0x1fb/0x520 [zfs]
[Tue Apr 21 19:36:30 2020]  zio_execute+0xe6/0x220 [zfs]
[Tue Apr 21 19:36:30 2020]  taskq_thread+0x237/0x4a0 [spl]
[Tue Apr 21 19:36:30 2020]  ? wake_up_q+0xa0/0xa0
[Tue Apr 21 19:36:30 2020]  ? zio_execute_stack_check.constprop.0+0x10/0x10 [zfs]
[Tue Apr 21 19:36:30 2020]  kthread+0xf9/0x130
[Tue Apr 21 19:36:30 2020]  ? kasan_check_write.constprop.0+0x10/0x10 [spl]
[Tue Apr 21 19:36:30 2020]  ? kthread_park+0x90/0x90
[Tue Apr 21 19:36:30 2020]  ret_from_fork+0x22/0x40
ahrens commented 4 years ago

Here's an idea of how this could be happening:

arc_write() does (annotated):

    zio = zio_write(...,
        abd_get_from_buf(buf->b_data, HDR_GET_LSIZE(hdr)), //abd
        HDR_GET_LSIZE(hdr), //lsize
        arc_buf_size(buf), //psize

If this is a dedup,verify write, this zio will go through zio_ddt_collision().

The problem comes when we are writing already-compressed data, due to receiving from zfs send send --compressed. In this case, the abd that arc_write() creates is of size HDR_GET_LSIZE(hdr), the logical (uncompressed) size. But the buffer is actually only arc_buf_size(), which is the smaller, "physical", already-compressed size. So the abd is referencing past the end of the buffer, into potentially unmapped virtual addresses, but for the most part that doesn't matter because we don't read past the psize (arc_buf_size()).

However, zio_ddt_collision() has this code:

        zio_t *lio = dde->dde_lead_zio[p];
...
            return (lio->io_orig_size != zio->io_orig_size ||
                abd_cmp(zio->io_orig_abd, lio->io_orig_abd) != 0);

What's happening here is that we are doing a dedup,verify write, and we've found a hit in the DDT, so we are going to verify that the data referenced by the entry in the dedup table (the DDE) is the same as the data we are writing. Typically we will read the data off disk and then compare it to what we are writing. But in this case, we are still in the middle of writing out the block that the DDE will reference (in the currently-syncing txg, while we are writing in the next (open) txg). So rather than reading the data off disk, we compare with the data that the earlier zio is writing.

This code assumes that if the two zio's io_orig_size (i.e. psize) is the same, then their abd's will be the same size. In this case of arc_write() using a wrong-size abd for an already-compressed write, typically both the old and new zio's would have gone through the same code paths and so both of them would have io_orig_size being the compressed size and io_orig_abd->abd_size being the uncompressed size. This would lead to incorrectly thinking that there is a collision (same hash but different buffer contents) because we are comparing past the end of the actual (compressed) buffer.

I'm not yet sure how the later zio ends up with the correct (smaller, compressed) io_orig_abd->abd_size.

FYI, the earlier zio (dde_lead_zio) is created by zio_ddt_write (paraphrasing):

        dde->dde_lead_zio[p] = zio_write(...,
            zio->io_orig_abd, //abd
            zio->io_orig_size, //lsize
            zio->io_orig_size, //psize

So it preserves the incorrect abd size. It does seem odd to me that it sets the lsize of the new zio to be the same as the psize, but maybe that doesn't matter.

Is there any chance you could get a crash dump that we could use to check if this hypothesis is correct, and see if we can explain the difference in abd sizes of the two zio's?

Incidentally, what is your use case for dedup=edonr,verify, as opposed to using a stronger hash function and not verifying? Dedup verification is not widely used and is a feature that we would potentially consider deprecating and eventually removing, so I'd like to understand if there are workloads for which it is much better than the alternatives.

dioni21 commented 4 years ago

Rebooted, try again, stopped with the same panic.

Not sure if this could be useful to debug:

new/jonny receive_resume_token 1-111f3c648d-f8-789c636064000310a500c4ec50360710e72765a526973030842941d460c8a7a515a79630407910793624f9a4ca92d46220ad71bc441c9bfe92fcf4d2cc140686f2aa6d7e895c2fbc0c90e439c1f27989b9a9403a312f5b3f2b3f2fafd2212db3a8b884010a24e03643f4a4e626a5a6e46783ede146124fcecf2d284a2d2e064ac101007bdd20a1

dioni21 commented 4 years ago

@ahrens I did not read your answer when I sent my last comment. And now I just did two whole new formats, one with checksum=skein dedup=skein,verify and another with checksum=on dedup=verify. and both got the same problem.

I will try again without --compressed

BTW: The panic message was in dmesg, but the system did not crash! Can I force a crash dump in this state? How? By setting zpool failmode to panic and repeating actions? Would it be useful yet?

Why dedup=verify? Because I like dedup, and I am very paranoid about my data. :joy:

Hashing functions are very good to identify error and forgery, but nothing compares to a full byte-by-byte verify.

I have just recently reread the manual and found about those new checksum methods, and wanted to try them.

dioni21 commented 4 years ago

Last one for today, as per @ahrens suggestion, I disabled zfs send options --embed --compressed and now I dont get the panic, even using dedup=edonr,verify checksum=edonr

Will leave system transferring 512GB of data and sleep for now.

But if I can help by generating a crash dump, I may restart from zero once again.

dioni21 commented 4 years ago

Everything went well, now. Command used:

zfs send tank/jonny@snap | mbuffer -s 128k -b 128 | zfs receive -F -s -v new/jonny

Result:

summary: 536 GiByte in 1h 36min 14.9sec - average of 95.1 MiB/s

NAME               USED  AVAIL     REFER  MOUNTPOINT
new/jonny         1.01T  7.80T     1.01T  /new/jonny
new/jonny@snap    1.03M      -     1.01T  -
tank/jonny        1.01T  1.66T     1.01T  /tank/jonny
tank/jonny@snap   1.22G      -     1.01T  -

Will transfer the other datasets and see if I can finish them. Thanks for your help, @ahrens

stale[bot] commented 3 years ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

scineram commented 3 years ago

Seems dedup bug is still there.

stale[bot] commented 2 years ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.