ztest crashes with default zfs_abd_scatter_min_size

rincebrain commented 2 years ago

System information

Type	Version/Name
Distribution Name	Debian
Distribution Version	11
Kernel Version	not relevant
Architecture	x86_64
OpenZFS Version	ded851b2

Describe the problem you're observing

ztest crashes an awful lot.

Most of the crashes, IME, look something like:

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff7737537 in __GI_abort () at abort.c:79
#2  0x00007ffff7ae6923 in vpanic (fmt=0x7ffff7d74540 "Got SIGSEGV at address: 0x%lx\n", adx=adx@entry=0x7fffffffce58) at kernel.c:612
#3  0x00007ffff7ae69bb in panic (fmt=fmt@entry=0x7ffff7d74540 "Got SIGSEGV at address: 0x%lx\n") at kernel.c:621
#4  0x00007ffff7afaeb6 in arc_buf_sigsegv (sig=<optimized out>, si=<optimized out>, unused=<optimized out>) at ../../module/zfs/arc.c:1515
#5  <signal handler called>
#6  __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:437
#7  0x00007ffff7aef9be in abd_copy_to_buf_off_cb (buf=<optimized out>, size=size@entry=4096, private=private@entry=0x7fffffffd5a8) at ../../module/zfs/abd.c:828
#8  0x00007ffff7af1044 in abd_iterate_func (private=0x7fffffffd5a8, func=0x7ffff7aef9a0 <abd_copy_to_buf_off_cb>, size=4096, off=<optimized out>, abd=0x7fffd8021c90) at ../../module/zfs/abd.c:805
#9  abd_iterate_func (abd=0x7fffd8021c90, off=<optimized out>, size=<optimized out>, func=0x7ffff7aef9a0 <abd_copy_to_buf_off_cb>, private=0x7fffffffd5a8) at ../../module/zfs/abd.c:780
#10 0x00007ffff7af1278 in abd_copy_to_buf_off (buf=<optimized out>, abd=<optimized out>, off=off@entry=0, size=<optimized out>) at ../../module/zfs/abd.c:842
#11 0x00007ffff7b0239e in abd_copy_to_buf (size=<optimized out>, abd=<optimized out>, buf=<optimized out>) at ../../include/sys/abd.h:159
#12 arc_buf_fill (buf=0x555555b35a90, spa=spa@entry=0x5555556569a0, zb=zb@entry=0x7fffffffd6a0, flags=flags@entry=0) at ../../module/zfs/arc.c:2067
#13 0x00007ffff7b0327d in arc_untransform (buf=<optimized out>, spa=0x5555556569a0, zb=zb@entry=0x7fffffffd6a0, in_place=in_place@entry=B_FALSE) at ../../module/zfs/arc.c:2171
#14 0x00007ffff7b366f6 in dmu_objset_own_impl (ds=ds@entry=0x5555556af3b0, type=type@entry=DMU_OST_ANY, readonly=readonly@entry=B_TRUE, decrypt=decrypt@entry=B_TRUE, osp=osp@entry=0x7fffffffd8c8, tag=<optimized out>)
    at ../../module/zfs/dmu_objset.c:774
#15 0x00007ffff7b3abed in dmu_objset_own_impl (tag=0x555555573cf0 <__func__.9>, osp=0x7fffffffd8c8, decrypt=B_TRUE, readonly=B_TRUE, type=DMU_OST_ANY, ds=0x5555556af3b0) at ../../module/zfs/dmu_objset.c:757
#16 dmu_objset_own (name=name@entry=0x5555565cbfa0 "ztest/ds_4", type=type@entry=DMU_OST_ANY, readonly=readonly@entry=B_TRUE, decrypt=decrypt@entry=B_TRUE, tag=tag@entry=0x555555573cf0 <__func__.9>, osp=osp@entry=0x7fffffffd8c8)
    at ../../module/zfs/dmu_objset.c:808
#17 0x0000555555563fcb in ztest_dmu_objset_own (name=name@entry=0x5555565cbfa0 "ztest/ds_4", type=type@entry=DMU_OST_ANY, readonly=readonly@entry=B_TRUE, tag=tag@entry=0x555555573cf0 <__func__.9>, osp=osp@entry=0x7fffffffd8c8,
    decrypt=B_TRUE) at ztest.c:1602
#18 0x0000555555567d9c in ztest_replay_zil_cb (name=name@entry=0x5555565cbfa0 "ztest/ds_4", arg=arg@entry=0x0) at ztest.c:7255
#19 0x00007ffff7b35d0d in dmu_objset_find_impl (spa=spa@entry=0x5555556569a0, name=name@entry=0x5555565cbfa0 "ztest/ds_4", func=func@entry=0x555555567d70 <ztest_replay_zil_cb>, arg=arg@entry=0x0, flags=flags@entry=2)
    at ../../module/zfs/dmu_objset.c:2951
#20 0x00007ffff7b35e40 in dmu_objset_find_impl (spa=0x5555556569a0, name=name@entry=0x55555557a960 <ztest_opts> "ztest", func=func@entry=0x555555567d70 <ztest_replay_zil_cb>, arg=arg@entry=0x0, flags=flags@entry=2)
    at ../../module/zfs/dmu_objset.c:2894
#21 0x00007ffff7b3b303 in dmu_objset_find (name=name@entry=0x55555557a960 <ztest_opts> "ztest", func=func@entry=0x555555567d70 <ztest_replay_zil_cb>, arg=arg@entry=0x0, flags=flags@entry=2) at ../../module/zfs/dmu_objset.c:2967
#22 0x000055555555cdc4 in ztest_run (zs=0x7ffff7ffb738) at ztest.c:7563
#23 main (argc=<optimized out>, argv=<optimized out>) at ztest.c:8062

After a round of bisecting, I ended up at 87c25d56, which I would not have guessed, but here we are. And lo, if you extend ztest to set zfs_abd_scatter_min_size to 4097 on x86_64, it goes from crashing practically always to crashing never so far.

If we ask valgrind, first it complains a lot about uninitialized values in the crypto code being read a bunch, but if you zero those, it becomes limited to eventually spitting out:

==129414== Thread 109:
==129414== Conditional jump or move depends on uninitialised value(s)
==129414==    at 0x483EEEE: bcmp (vg_replace_strmem.c:1111)
==129414==    by 0x48ABA33: abd_cmp_buf_off_cb (abd.c:852)
==129414==    by 0x48AD043: abd_iterate_func (abd.c:805)
==129414==    by 0x48AD043: abd_iterate_func (abd.c:780)
==129414==    by 0x48AD304: abd_cmp_buf_off (abd.c:866)
==129414==    by 0x48AD377: abd_cmp_buf (abd.h:165)
==129414==    by 0x48AD377: abd_return_buf (abd.c:673)
==129414==    by 0x48C0F42: arc_read_done (arc.c:5692)
==129414==    by 0x4A606A3: zio_done (zio.c:4835)
==129414==    by 0x4A54968: __zio_execute (zio.c:2209)
==129414==    by 0x4A54968: zio_execute (zio.c:2122)
==129414==    by 0x48A3961: taskq_thread (taskq.c:237)
==129414==    by 0x4EFFEA6: start_thread (pthread_create.c:477)
==129414==    by 0x5018DEE: clone (clone.S:95)
==129414==  Uninitialised value was created by a heap allocation
==129414==    at 0x483AEB8: memalign (vg_replace_malloc.c:906)
==129414==    by 0x483AFCE: posix_memalign (vg_replace_malloc.c:1070)
==129414==    by 0x48AF0BF: umem_alloc_aligned (umem.h:105)
==129414==    by 0x48AF0BF: abd_alloc_chunks (abd_os.c:579)
==129414==    by 0x48ABF8C: abd_alloc (abd.c:192)
==129414==    by 0x48BC838: arc_hdr_alloc_abd (arc.c:3191)
==129414==    by 0x48C24DD: arc_read (arc.c:6188)
==129414==    by 0x4A4C5BF: zil_read_log_block (zil.c:241)
==129414==    by 0x4A4C5BF: zil_parse (zil.c:398)
==129414==    by 0x4A4D659: zil_check_log_chain (zil.c:975)
==129414==    by 0x48F225A: dmu_objset_find_dp_impl (dmu_objset.c:2725)
==129414==    by 0x48F2625: dmu_objset_find_dp_cb (dmu_objset.c:2758)
==129414==    by 0x48A3961: taskq_thread (taskq.c:237)
==129414==    by 0x4EFFEA6: start_thread (pthread_create.c:477)
==129414==
==129414==
==129414== Process terminating with default action of signal 6 (SIGABRT): dumping core
==129414==    at 0x4F56CE1: raise (raise.c:51)
==129414==    by 0x4F40536: abort (abort.c:79)
==129414==    by 0x48A2922: vpanic (kernel.c:612)
==129414==    by 0x48A29BA: panic (kernel.c:621)
==129414==    by 0x48B6EB5: arc_buf_sigsegv (arc.c:1515)
==129414==    by 0x4F0B13F: ??? (in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so)
==129414==    by 0x483F7F2: memmove (vg_replace_strmem.c:1270)
==129414==    by 0x48AB9BD: abd_copy_to_buf_off_cb (abd.c:828)
==129414==    by 0x48AD043: abd_iterate_func (abd.c:805)
==129414==    by 0x48AD043: abd_iterate_func (abd.c:780)
==129414==    by 0x48AD277: abd_copy_to_buf_off (abd.c:842)
==129414==    by 0x48BE39D: abd_copy_to_buf (abd.h:159)
==129414==    by 0x48BE39D: arc_buf_fill (arc.c:2067)
==129414==    by 0x48BF27C: arc_untransform (arc.c:2171)

Obviously we could just...make ztest do that for now, but that seems problematic, and it's not presently clear to me whether the logical flaw is in the umem implementations of things or elsewhere? (Will continue looking, of course, but.)

Describe how to reproduce the problem

Above.

Include any warning/errors/backtraces from the system logs

Above.

rincebrain commented 2 years ago

Oh good, past me dug into this already. #12214

rincebrain commented 2 years ago

Well, I dug into why it crashes using ZFS_DEBUG="watch", but it turns out the reason it crashes there is not the reason it dies so often in the CI. (The reason it crashes there is that it winds up with a buffer that's still only PROT_READ not also PROT_WRITE, which is a neat trick, since it's marked compressed, so I think it should never be frozen...still looking into where the "right" place to adjust that is, since nominally this shouldn't ever arise...)

Time to go down a rabbit hole again.

behlendorf commented 2 years ago

Based on the valgrind output it looks to me like the flaw here is in arc_read_done(). In the event that the read fails the the abd pages may contain uninitialized data, but we unconditionally run the BP_PROTECTED(bp) block which accesses those pages. It seems to me we should only be doing this when io_error == 0.

This only causes a crash in ztest since in the kernel those pages will be mapped and contain something. In user space we'll be doing the I/O with vdev_file_io_start() / vdev_file_io_done() which rely on umem / posix_memalign() which won't initialize anything (and valgrind rightly complains about that). There's probably a reasonable argument to be made that we should be zero filling the pages on a read error just to be safe.

I'm still not quite sure why setting zfs_abd_scatter_min_size to 4097 prevents this issue. Perhaps when allocating anything less than a page (for a linear abd) posix_memalign() internally does initialize the buffer.

rincebrain commented 2 years ago

Huh, so you think the initial take on it was correct?

ISTR that PR not stopping the crash, though, which was why I closed it...

Rich

On Mon, Nov 29, 2021 at 3:27 PM Brian Behlendorf @.***> wrote:

Based on the valgrind output it looks to me like the flaw here is in arc_read_done(). In the event that the read fails the the abd pages may contain uninitialized data, but we unconditionally run the BP_PROTECTED(bp) block https://github.com/openzfs/zfs/blob/master/module/zfs/arc.c#L5679-L5697 which accesses those pages. It seems to me we should only be doing this when io_error == 0.

This only causes a crash in ztest since in the kernel those pages will be mapped and contain something. In user space we'll be doing the I/O with vdev_file_io_start() / vdev_file_io_done() which rely on umem / posix_memalign() which won't initialize anything (and valgrind rightly complains about that). There's probably a reasonable argument to be made that we should be zero filling the pages on a read error just to be safe.

I'm still not quite sure why setting zfs_abd_scatter_min_size to 4097 prevents this issue. Perhaps when allocating anything less than a page (for a linear abd) posix_memalign() internally does initialize the buffer.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/openzfs/zfs/issues/12793#issuecomment-981992921, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABUI7LVURBNYFIFSRCWVC3UOPO3XANCNFSM5IZ5ELVA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

behlendorf commented 2 years ago

It sure does look like a real problem. But based on your testing perhaps not the root cause of this segfault...

rincebrain commented 2 years ago

I'll go rerun the experiment just to be sure; as I said in #12214, I don't think the change is incorrect, so merging it would be fine if you agree that it's a real issue that should be avoided, it just didn't appear to solve this when I most recently tested.

Having glanced at the recent zloop failures quickly, it looked like the common failure modes were a ztest interaction with indirect vdevs that might be specific to the vdev properties PR, here, and a weird failure with writing I don't immediately understand, here; neither of which seems to be from this.

Now I'm trying to recall if I got here because I wanted to use ZFS_DEBUG="watch" for something else and hit this, or hit a crash not using it and then got this backtrace using it...hm.

(...maybe I should script perusing the zloop test logs...)

stale[bot] commented 1 year ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

openzfs / zfs