openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.61k stars 1.75k forks source link

zfs send segfault in fletcher_4_native for encrypted replication (-Rw) sends #13620

Open implr opened 2 years ago

implr commented 2 years ago

System information

Type Version/Name
Distribution Name gentoo
Distribution Version ~amd64
Kernel Version 5.18.8-gentoo (also fails on 5.17.9)
Architecture amd64
OpenZFS Version zfs-2.1.5-r2-gentoo

zfs send -Rw dataset@snap consistently crashes before writing out anything. Initially noticed this with this dmesg message:

[  489.458480] traps: zfs[16083] general protection fault ip:7f0bd178b940 sp:7ffe5c380440 error:0 in libzfs.so.4.1.0[7f0bd174f000+44000]

Full backtrace:

# gdb /sbin/zfs
GNU gdb (Gentoo 12.1 vanilla) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://bugs.gentoo.org/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /sbin/zfs...
Reading symbols from /usr/lib/debug//sbin/zfs.debug...
(gdb) r send  -Rw  zslow/crypt@tape3-220703 > /dev/null
Starting program: /sbin/zfs send  -Rw  zslow/crypt@tape3-220703 > /dev/null
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
fletcher_4_native (buf=0x555555596310, size=3164, ctx_template=<optimized out>, zcp=0x7fffffffbdb0) at ../../module/zcommon/zfs_fletcher.c:482
482                             fletcher_4_scalar_native((fletcher_4_ctx_t *)zcp,
(gdb) bt
#0  fletcher_4_native (buf=0x555555596310, size=3164, ctx_template=<optimized out>, zcp=0x7fffffffbdb0) at ../../module/zcommon/zfs_fletcher.c:482
#1  0x00007ffff7f5c9f3 in fletcher_4_incremental_impl (zcp=<optimized out>, size=3164, buf=<optimized out>, native=<optimized out>) at ../../module/zcommon/zfs_fletcher.c:565
#2  fletcher_4_incremental_native (buf=buf@entry=0x555555596310, size=size@entry=3164, data=data@entry=0x7fffffffbe80) at ../../module/zcommon/zfs_fletcher.c:584
#3  0x00007ffff7f471d1 in dump_record (outfd=0, zc=0x7fffffffbe80, payload_len=3164, payload=0x555555596310, drr=0x7fffffffbea0) at libzfs_sendrecv.c:106
#4  send_prelim_records (zhp=zhp@entry=0x555555584950, from=from@entry=0x0, fd=fd@entry=1, gather_props=<optimized out>, recursive=<optimized out>, verbose=verbose@entry=B_FALSE, dryrun=<optimized out>, raw=<optimized out>, replicate=<optimized out>, skipmissing=<optimized out>, backup=<optimized out>, 
    holds=<optimized out>, props=<optimized out>, doall=<optimized out>, fssp=<optimized out>, fsavlp=<optimized out>) at libzfs_sendrecv.c:2087
#5  0x00007ffff7f4c8db in zfs_send (zhp=zhp@entry=0x55555557f690, fromsnap=fromsnap@entry=0x0, tosnap=tosnap@entry=0x55555557f5fc "tape3-220703", flags=flags@entry=0x7fffffffd090, outfd=outfd@entry=1, filter_func=filter_func@entry=0x0, cb_arg=<optimized out>, debugnvp=<optimized out>) at libzfs_sendrecv.c:2179
#6  0x0000555555562dec in zfs_do_send (argc=<optimized out>, argv=<optimized out>) at zfs_main.c:4725
#7  0x000055555555b37d in main (argc=4, argv=<optimized out>) at zfs_main.c:8711
rincebrain commented 2 years ago

I think this is just #13605, which in that person's case was a problem when they used overridden CFLAGS for -march - in particular, I suspect what's happening is that it's compiling the non-SIMD version of the code to a SIMD version, but then something unsafe is ensuing because it's using it somewhere that that would not be safe.

implr commented 2 years ago

That is possible, I missed that issue. I'm building with -O2 -march=native -pipe -ggdb, which would be equivalent to znver2 in my case. I've been using those flags for a year+ though, so it's either caused by my recent upgrade to gcc12, or something in zfs.

I'll try without -march.

implr commented 2 years ago

Seems to work fine now without march=native. Filed https://bugs.gentoo.org/856373 with Gentoo.

On Mon, 4 Jul 2022 at 07:36, Rich Ercolani @.***> wrote:

I think this is just #13605 https://github.com/openzfs/zfs/issues/13605, which in that person's case was a problem when they used overridden CFLAGS for -march - in particular, I suspect what's happening is that it's compiling the non-SIMD version of the code to a SIMD version, but then something unsafe is ensuing because it's using it somewhere that that would not be safe.

— Reply to this email directly, view it on GitHub https://github.com/openzfs/zfs/issues/13620#issuecomment-1173370315, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAVQYN4HFPTB7MFNZLMBMELVSJZ4PANCNFSM52Q22BMA . You are receiving this because you authored the thread.Message ID: @.***>

KungFuJesus commented 2 years ago

I think this is just #13605, which in that person's case was a problem when they used overridden CFLAGS for -march - in particular, I suspect what's happening is that it's compiling the non-SIMD version of the code to a SIMD version, but then something unsafe is ensuing because it's using it somewhere that that would not be safe.

Hmm userspace auto vectorization I would imagine is fair game, no? This seems like it could be a compiler bug or possibly some undefined behavior?

rincebrain commented 2 years ago

In theory, yes.

I have a few guesses about what broke, but haven't looked into it yet.

stale[bot] commented 1 year ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.