Kernel bug when flushing memory caches for hugepages from Linux 6.3.1 to 6.7.4

RodoMa92 commented 1 year ago

System information

Type	Version/Name
Distribution Name	Arch Linux
Distribution Version	updated to latest
Kernel Version	6.6.9-arch1-1
Architecture	X86_64
OpenZFS Version	zfs-kmod-2.2.2-1

Describe the problem you're observing

While executing the prepare script for a QEMU virtual machine, if I'm on kernel version 6.3.1 up to the latest 6.4.7 the script crashes with the following stack trace (this log is for a crash on 6.3.9, but I have tested all extremes above and the error is almost the same one as below):

[ 2682.534320] bash (54689): drop_caches: 3
[ 2682.624207] ------------[ cut here ]------------
[ 2682.624211] kernel BUG at mm/migrate.c:662!
[ 2682.624219] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[ 2682.624223] CPU: 2 PID: 54689 Comm: bash Tainted: P           OE      6.3.9-arch1-1 #1 124dc55df4f5272ccb409f39ef4872fc2b3376a2
[ 2682.624226] Hardware name: System manufacturer System Product Name/ROG STRIX B450-F GAMING, BIOS 5102 05/31/2023
[ 2682.624228] RIP: 0010:migrate_folio_extra+0x6c/0x70
[ 2682.624234] Code: de 48 89 ef e8 35 e2 ff ff 5b 44 89 e0 5d 41 5c 41 5d e9 e7 6d 9d 00 e8 22 e2 ff ff 44 89 e0 5b 5d 41 5c 41 5d e9 d4 6d 9d 00 <0f> 0b 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f
[ 2682.624236] RSP: 0018:ffffb4685b5038f8 EFLAGS: 00010282
[ 2682.624238] RAX: 02ffff0000008025 RBX: ffffd9f684f02740 RCX: 0000000000000002
[ 2682.624240] RDX: ffffd9f684f02740 RSI: ffffd9f68d958dc0 RDI: ffff99d8d1cfe728
[ 2682.624241] RBP: ffff99d8d1cfe728 R08: 0000000000000000 R09: 0000000000000000
[ 2682.624242] R10: ffffd9f68d958dc8 R11: 0000000004020000 R12: ffffd9f68d958dc0
[ 2682.624243] R13: 0000000000000002 R14: ffffd9f684f02740 R15: ffffb4685b5039b8
[ 2682.624245] FS:  00007f78b8182740(0000) GS:ffff99de9ea80000(0000) knlGS:0000000000000000
[ 2682.624246] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2682.624248] CR2: 00007fe9a0001960 CR3: 000000011e406000 CR4: 00000000003506e0
[ 2682.624249] Call Trace:
[ 2682.624251]  <TASK>
[ 2682.624253]  ? die+0x36/0x90
[ 2682.624258]  ? do_trap+0xda/0x100
[ 2682.624261]  ? migrate_folio_extra+0x6c/0x70
[ 2682.624263]  ? do_error_trap+0x6a/0x90
[ 2682.624266]  ? migrate_folio_extra+0x6c/0x70
[ 2682.624268]  ? exc_invalid_op+0x50/0x70
[ 2682.624271]  ? migrate_folio_extra+0x6c/0x70
[ 2682.624273]  ? asm_exc_invalid_op+0x1a/0x20
[ 2682.624278]  ? migrate_folio_extra+0x6c/0x70
[ 2682.624280]  move_to_new_folio+0x136/0x150
[ 2682.624283]  migrate_pages_batch+0x913/0xd30
[ 2682.624285]  ? __pfx_compaction_free+0x10/0x10
[ 2682.624289]  ? __pfx_remove_migration_pte+0x10/0x10
[ 2682.624292]  migrate_pages+0xc61/0xde0
[ 2682.624295]  ? __pfx_compaction_alloc+0x10/0x10
[ 2682.624296]  ? __pfx_compaction_free+0x10/0x10
[ 2682.624300]  compact_zone+0x865/0xda0
[ 2682.624303]  compact_node+0x88/0xc0
[ 2682.624306]  sysctl_compaction_handler+0x46/0x80
[ 2682.624308]  proc_sys_call_handler+0x1bd/0x2e0
[ 2682.624312]  vfs_write+0x239/0x3f0
[ 2682.624316]  ksys_write+0x6f/0xf0
[ 2682.624317]  do_syscall_64+0x60/0x90
[ 2682.624322]  ? syscall_exit_to_user_mode+0x1b/0x40
[ 2682.624324]  ? do_syscall_64+0x6c/0x90
[ 2682.624327]  ? syscall_exit_to_user_mode+0x1b/0x40
[ 2682.624329]  ? exc_page_fault+0x7c/0x180
[ 2682.624330]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[ 2682.624333] RIP: 0033:0x7f78b82f5bc4
[ 2682.624355] Code: 15 99 11 0e 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 80 3d 3d 99 0e 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 48 83 ec 28 48 89 54 24 18 48
[ 2682.624356] RSP: 002b:00007ffd9d25ed18 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[ 2682.624358] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f78b82f5bc4
[ 2682.624359] RDX: 0000000000000002 RSI: 000055c97c5f05c0 RDI: 0000000000000001
[ 2682.624360] RBP: 000055c97c5f05c0 R08: 0000000000000073 R09: 0000000000000001
[ 2682.624362] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002
[ 2682.624363] R13: 00007f78b83d86a0 R14: 0000000000000002 R15: 00007f78b83d3ca0
[ 2682.624365]  </TASK>
[ 2682.624366] Modules linked in: vhost_net vhost vhost_iotlb tap tun snd_seq_dummy snd_hrtimer snd_seq xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat ip6table_filter ip6_tables iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter bridge stp llc intel_rapl_msr intel_rapl_common edac_mce_amd kvm_amd snd_hda_codec_realtek snd_hda_codec_generic kvm snd_hda_codec_hdmi snd_usb_audio btusb btrtl snd_hda_intel btbcm snd_intel_dspcfg crct10dif_pclmul btintel crc32_pclmul snd_intel_sdw_acpi btmtk vfat polyval_clmulni snd_usbmidi_lib polyval_generic fat snd_hda_codec ext4 gf128mul snd_rawmidi eeepc_wmi bluetooth ghash_clmulni_intel snd_hda_core sha512_ssse3 asus_wmi snd_seq_device aesni_intel mc ledtrig_audio snd_hwdep crc32c_generic crypto_simd snd_pcm sparse_keymap crc32c_intel igb ecdh_generic platform_profile sp5100_tco cryptd snd_timer mbcache rapl rfkill wmi_bmof pcspkr dca asus_wmi_sensors snd i2c_piix4 zenpower(OE) ccp
[ 2682.624417]  jbd2 crc16 soundcore gpio_amdpt gpio_generic mousedev acpi_cpufreq joydev mac_hid dm_multipath i2c_dev crypto_user loop fuse dm_mod bpf_preload ip_tables x_tables usbhid zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) nouveau nvme nvme_core xhci_pci nvme_common xhci_pci_renesas vfio_pci vfio_pci_core irqbypass vfio_iommu_type1 vfio iommufd amdgpu i2c_algo_bit drm_ttm_helper ttm mxm_wmi video wmi drm_buddy gpu_sched drm_display_helper cec
[ 2682.624456] ---[ end trace 0000000000000000 ]---
[ 2682.624457] RIP: 0010:migrate_folio_extra+0x6c/0x70
[ 2682.624461] Code: de 48 89 ef e8 35 e2 ff ff 5b 44 89 e0 5d 41 5c 41 5d e9 e7 6d 9d 00 e8 22 e2 ff ff 44 89 e0 5b 5d 41 5c 41 5d e9 d4 6d 9d 00 <0f> 0b 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f
[ 2682.624463] RSP: 0018:ffffb4685b5038f8 EFLAGS: 00010282
[ 2682.624465] RAX: 02ffff0000008025 RBX: ffffd9f684f02740 RCX: 0000000000000002
[ 2682.624466] RDX: ffffd9f684f02740 RSI: ffffd9f68d958dc0 RDI: ffff99d8d1cfe728
[ 2682.624467] RBP: ffff99d8d1cfe728 R08: 0000000000000000 R09: 0000000000000000
[ 2682.624469] R10: ffffd9f68d958dc8 R11: 0000000004020000 R12: ffffd9f68d958dc0
[ 2682.624470] R13: 0000000000000002 R14: ffffd9f684f02740 R15: ffffb4685b5039b8
[ 2682.624472] FS:  00007f78b8182740(0000) GS:ffff99de9ea80000(0000) knlGS:0000000000000000
[ 2682.624473] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2682.624475] CR2: 00007fe9a0001960 CR3: 000000011e406000 CR4: 00000000003506e0

I had my previous installation on top of BtrFS available before switching to ZFS as root, so I could test the same thing under another filesystem. This do not happens on the latest kernel.

I'm using a zvol as a backing store for the VM, the libvirt xml is also attached below, with minor fields redacted.

Describe how to reproduce the problem

1) Use a kernel >= 6.3.1 2) Load in virtualbox kernel drivers (vboxdrv vboxnetadp vboxnetflt) 2) Have an encrypted dataset mounted on the system and a zvol created on that filesystem (not sure if both are needed as a precondition) 3) Execute this command as root: sync; echo 3 > /proc/sys/vm/drop_caches; echo 1 > /proc/sys/vm/compact_memory 4) Crash is triggered It seems that the combination of drop_caches and compact_memory interact someway with ZFS + this commit plus having VirtualBox driver loaded clashes with each other and causes a kernel oops.

I'm using the above for managing VMs memory hugepage allocation.

Using anything older than 6.3 the code works perfectly fine and I can bootup the VM each time with no issues. I'm currently using 6.1 LTS, and I have no problems with the VM in itself.

The above is needed to being able to allocate 1GB hugepages correctly, otherwise after the first bootup the memory is too fragmented to allocate 1 GB chunks without compressing the memory first, and the VM fail to boot up properly. This sometimes cause some host system instability and issues with the host shutting down cleanly.

qemu.tar.gz GamingWin11.tar.gz

rincebrain commented 1 year ago

The timing of when that breaks make me wonder about something related to https://lwn.net/Articles/937943/, but who knows.

rincebrain commented 1 year ago

Also, as an aside, can't you tell the kernel at boot time to pre-reserve 1G hugepages for you?

Not that this isn't a bug, but just as a workaround for your use case atm.

RodoMa92 commented 1 year ago

Also, as an aside, can't you tell the kernel at boot time to pre-reserve 1G hugepages for you?

Not that this isn't a bug, but just as a workaround for your use case atm.

Yeah, sure, but then the memory is locked. I would like to keep using it if the VM is not in use. I rarely use Windows anyway these days :P

For now remaining on 6.1 LTS is good enough, at least until someone can track down this issue.

I've noticed that now even just the actual initial call crash the kernel, it might be related to the changes in the mm subsystem. But that's just speculation from me, since I can't reproduce the issue with the stock kernel in the same way (not that this exclude a kernel bug either, to be clear)

Testing mainline shortly to check if they have already fixed this issue.

rincebrain commented 1 year ago

What do you mean, you can't reproduce it with stock? It works on vanilla 6.3.x/6.4.x but not Arch's patchset?

RodoMa92 commented 1 year ago

It works fine without zfs-dkms installed. It doesn't work with zfs-dkms installed.

rincebrain commented 1 year ago

Does it break with ZFS loaded and no pools imported?

RodoMa92 commented 1 year ago

Does it break with ZFS loaded and no pools imported?

Can try it shortly on my old BtrFS install, I'll end testing mainline first :P

RodoMa92 commented 1 year ago

Well, heck, dkms doesn't build. Testing with just the module loaded now.

RodoMa92 commented 1 year ago

Interesting, without mounting my zfs encrypted root drive it doesn't seem to trigger it. I'll do further testing tomorrow.

numinit commented 1 year ago

I can reproduce this too. Good find.

numinit commented 1 year ago

BTW I think the issue has to do with compacting memory. I can still reserve hugepages just fine.

RodoMa92 commented 1 year ago

BTW I think the issue has to do with compacting memory. I can still reserve hugepages just fine.

Yeah, the trigger is definitely memory compact, not hugepage allocation by itself. Good to know that I'm not the only one with this issue :)

RodoMa92 commented 1 year ago

Starting bisect now, I'll update it shortly if I can find something from it. At least it should narrow down the possible changes.

RodoMa92 commented 1 year ago

I finally have the result of the bisection:

5dfab109d5193e6c224d96cabf90e9cc2c039884 is the first bad commit
commit 5dfab109d5193e6c224d96cabf90e9cc2c039884
Author: Huang Ying <ying.huang@intel.com>
Date:   Mon Feb 13 20:34:40 2023 +0800

    migrate_pages: batch _unmap and _move

    In this patch the _unmap and _move stage of the folio migration is
    batched.  That for, previously, it is,

      for each folio
        _unmap()
        _move()

    Now, it is,

      for each folio
        _unmap()
      for each folio
        _move()

    Based on this, we can batch the TLB flushing and use some hardware
    accelerator to copy folios between batched _unmap and batched _move
    stages.

    Link: https://lkml.kernel.org/r/20230213123444.155149-6-ying.huang@intel.com
    Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
    Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
    Cc: Oscar Salvador <osalvador@suse.de>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Bharata B Rao <bharata@amd.com>
    Cc: Alistair Popple <apopple@nvidia.com>
    Cc: Xin Hao <xhao@linux.alibaba.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

 mm/migrate.c | 214 ++++++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 189 insertions(+), 25 deletions(-)

I'll try to revert this on top of the latest kernel to see if it still works fine.

git bisect log here

git bisect start
# status: waiting for both good and bad commits
# good: [a5c95ca18a98d742d0a4a04063c32556b5b66378] Merge tag 'drm-next-2023-02-23' of git://anongit.freedesktop.org/drm/drm
git bisect good a5c95ca18a98d742d0a4a04063c32556b5b66378
# status: waiting for bad commit, 1 good commit known
# bad: [3822a7c40997dc86b1458766a3f146d62393f084] Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
git bisect bad 3822a7c40997dc86b1458766a3f146d62393f084
# good: [620932cd285208ef3009ac338b1eeed13ccd1753] mm/damon/dbgfs: print DAMON debugfs interface deprecation message
git bisect good 620932cd285208ef3009ac338b1eeed13ccd1753
# good: [7482c19173b7eb044d476b3444d7ee55bc669d03] selftests: arm64: Fix incorrect kernel headers search path
git bisect good 7482c19173b7eb044d476b3444d7ee55bc669d03
# good: [81ce2ebd194cf32027854ce1c703b7fd129c86b8] mm/slab.c: cleanup is_debug_pagealloc_cache()
git bisect good 81ce2ebd194cf32027854ce1c703b7fd129c86b8
# good: [65c084d848cd717d5913032dfa9e9c62ed33babd] leds: blinkm: Convert to i2c's .probe_new()
git bisect good 65c084d848cd717d5913032dfa9e9c62ed33babd
# good: [6a60dd2e876913be55e17e53ee57e1fe09448238] perf vendor events arm64: Add TLB metrics for neoverse-n2-v2
git bisect good 6a60dd2e876913be55e17e53ee57e1fe09448238
# good: [869b9eddf0b38a22c27a400e2fa849d2ff2aa7e1] mfd: intel-m10-bmc: Add PMCI driver
git bisect good 869b9eddf0b38a22c27a400e2fa849d2ff2aa7e1
# good: [45204677d427b7d0ed11930bd5be4a42893d1c93] perf symbols: Allow for .plt entries with no symbol
git bisect good 45204677d427b7d0ed11930bd5be4a42893d1c93
# good: [3a396f9859755e822775319516cd71dabc2b4e69] backlight: sky81452: Fix sky81452_bl_platform_data kernel-doc
git bisect good 3a396f9859755e822775319516cd71dabc2b4e69
# good: [a912f5975ffc82d52bbb5937eafe367d44db711c] perf test: Replace legacy `...` with $(...)
git bisect good a912f5975ffc82d52bbb5937eafe367d44db711c
# skip: [2b79eb73e2c4b362a2a261b7b2f718385fb478e4] Merge tag 'probes-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
git bisect skip 2b79eb73e2c4b362a2a261b7b2f718385fb478e4
# good: [db95818e888a927456686518880ed0145b1f20ce] perf pmu-events: Add separate metric from pmu_event
git bisect good db95818e888a927456686518880ed0145b1f20ce
# skip: [cd43b5068647f47d6936ffef4d15d99518fcab94] Merge tag 'slab-for-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
git bisect skip cd43b5068647f47d6936ffef4d15d99518fcab94
# good: [cf1d2ffcc6f17b422239f6ab34b078945d07f9aa] efi: Discover BTI support in runtime services regions
git bisect good cf1d2ffcc6f17b422239f6ab34b078945d07f9aa
# skip: [0df82189bc42037678fa590a77ed0116f428c90d] Merge tag 'perf-tools-for-v6.3-1-2023-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
git bisect skip 0df82189bc42037678fa590a77ed0116f428c90d
# good: [1470a108a60e8c0c4d19da10117c9b98f0078654] perf c2c: Add report option to show false sharing in adjacent cachelines
git bisect good 1470a108a60e8c0c4d19da10117c9b98f0078654
# good: [c2d3cf3653a8ff6e4b402d55e7f84790ac08a8ad] selftests: filesystems: Fix incorrect kernel headers search path
git bisect good c2d3cf3653a8ff6e4b402d55e7f84790ac08a8ad
# skip: [d8763154455e92a2ffed256e48fa46bb35ef3bdf] Merge tag 'printk-for-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
git bisect skip d8763154455e92a2ffed256e48fa46bb35ef3bdf
# good: [1f428356c38dcbe49fd2f1c488b41e88720ead92] rtla: Add hwnoise tool
git bisect good 1f428356c38dcbe49fd2f1c488b41e88720ead92
# bad: [f9366f4c2a29d14f5992b195e268240c2deb116e] include/linux/migrate.h: remove unneeded externs
git bisect bad f9366f4c2a29d14f5992b195e268240c2deb116e
# good: [42012e0436d44aeb2e68f11a28ddd0ad3f38b61f] migrate_pages: restrict number of pages to migrate in batch
git bisect good 42012e0436d44aeb2e68f11a28ddd0ad3f38b61f
# bad: [9325ddf90ec3a801c09da374b74532d4589a7346] m68k/nommu: add missing definition of ARCH_PFN_OFFSET
git bisect bad 9325ddf90ec3a801c09da374b74532d4589a7346
# bad: [6f7d760e86fa84862d749e36ebd29abf31f4f883] migrate_pages: move THP/hugetlb migration support check to simplify code
git bisect bad 6f7d760e86fa84862d749e36ebd29abf31f4f883
# bad: [80562ba0d8378e89fe5836c28ea56c2aab3014e8] migrate_pages: move migrate_folio_unmap()
git bisect bad 80562ba0d8378e89fe5836c28ea56c2aab3014e8
# bad: [5dfab109d5193e6c224d96cabf90e9cc2c039884] migrate_pages: batch _unmap and _move
git bisect bad 5dfab109d5193e6c224d96cabf90e9cc2c039884
# good: [64c8902ed4418317cd416c566f896bd4a92b2efc] migrate_pages: split unmap_and_move() to _unmap() and _move()
git bisect good 64c8902ed4418317cd416c566f896bd4a92b2efc
# first bad commit: [5dfab109d5193e6c224d96cabf90e9cc2c039884] migrate_pages: batch _unmap and _move

RodoMa92 commented 1 year ago

LMK how this should proceed, and to which people this need to be reported (if it's purely a ZFS kernel bug or if it can affect other part of the kernel and this need to be reported directly to the Linux developers)

RodoMa92 commented 1 year ago

Yeah, unfortunately too much changes in the mm subsystem has been applied, so I really can't easily revert that change on top of the latest linux kernel. Please let me know how this should proceed.

satmandu commented 1 year ago

This should be reported as a kernel bug, no, especially as this provides a simple reproducer? Does this occur with the 6.5-rc kernels too?

RodoMa92 commented 1 year ago

Can't reproduce without ZFS on root encrypted, so I can't prove is a kernel regression. Highly likely if you ask me, but that's just my opinion at this point.

RodoMa92 commented 1 year ago

Just tested 10 bootup and shutdown cycles with ZFS loaded not on root and a dataset imported (non encrypted) and this didn't cause any issues. If someone from OpenZFS could take a look on why calling drop_caches cause an oops in kernel with ZFS on encrypted root it would be appreciated.

igrekster commented 1 year ago

It also happens on a non encrypted root, but with an encrypted data set present. So it looks like it is related to encryption.

RodoMa92 commented 1 year ago

Thanks a lot for the report, this at least narrows down the area.

numinit commented 1 year ago

I have encrypted datasets too and can confirm this happens.

RodoMa92 commented 1 year ago

This is still an issue on the latest main 6.5.1.

Can anyone from the team try to debug what's going wrong with encryption on the latest kernels?

Steps to repro: 1) Use a kernel >= 6.3.1 2) Have an encrypted dataset mounted on the system 3) Execute this command as root

echo 3 > /proc/sys/vm/drop_caches

Marco.

ipaqmaster commented 1 year ago

Also experiencing this one in my own vfio script lately. The call of 3 > drop_caches then 1 > compact_memory consistently results in a kernel oops of kernel BUG at mm/migrate.c:656! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI.

RodoMa92 commented 1 year ago

Panic as a whole crash and reboot? I only experienced kernel oops, not panics, but the error is identical to mine.

ipaqmaster commented 1 year ago

Apologies, yeah an oops. Not a full panic. Editing to clarify in original comment. Though I did find many things on the system became unstable enough to warrant a reboot. Various processes and logging systems no longer working or hanging on calls. Launching graphical applications taking much longer or hanging on some other subprocess which got stuck.

RodoMa92 commented 1 year ago

Apologies, yeah an oops. Not a full panic. Editing to clarify in original comment. Though I did find many things on the system became unstable enough to warrant a reboot. Various processes and logging systems no longer working or hanging on calls. Launching graphical applications taking much longer or hanging on some other subprocess which got stuck.

Yes, the same identical things happens on my end as well. It's probably some residual corrupted state (?) in the kernel itself that's causing stuff to stop working.

65a commented 1 year ago

Since my issue was linked to this one, I tried your echo 3 > /proc/vm/drop_caches on 6.5.3 arch, running zfs zfs-2.2.99-87_g8af8d2abb1 on an AMD CPU system with AVX2 but not AVX512. Nothing bad happened, which is hopefully a helpful datapoint.

RodoMa92 commented 1 year ago

Since my issue was linked to this one, I tried your echo 3 > /proc/vm/drop_caches on 6.5.3 arch, running zfs zfs-2.2.99-87_g8af8d2abb1 on an AMD CPU system with AVX2 but not AVX512. Nothing bad happened, which is hopefully a helpful datapoint.

Do you have a encrypted dataset working when you ran this? This seems to be the triggering cause here.

My system is basically identical as features goes.

65a commented 1 year ago

Yes, this system boots from ZFS and has an encrypted dataset. It is running nearly identical software to the system in my other issue, but the only real differences are the other system has a different pool, a Sapphire Rapids Xeon, and ECC memory, whereas this is a Ryzen 6900HS. Both pools were created at different times, but both pools are several years old as well. I wonder if your issue affects pools created after the encrypted metadata problem I reported a long time ago?

snehring commented 1 year ago

I've had an older physical system configured for testing other zfs related things recently, so I figured I'd try to reproduce this today since we have a couple important encrypted datasets in our environment.

The host is running fedora 38 with kernel 6.4.15 and zfs 2.2.0-rc4. Hardware wise it's an older dell workstation with 2 Xeon E5-2643v3 cpus (avx2 at the latest) and 128G of ecc ram.

I created an encrypted dataset, ensured it was mounted and ran echo 3 > /proc/vm/drop_caches and it didn't crash. I then wrote some random data to the dataset and repeated with no crash. Tried rebooting, mounting, writing, and repeating the echo to drop_caches without any failure.

I do have this host configured with mitigations=off, let me see if that makes a difference. Edit: no change.

Please let me know if I'm missing something obvious or if there's anything else you'd like me to try.

rincebrain commented 1 year ago

If it's only breaking on SPR, maybe #14989?

RodoMa92 commented 1 year ago

This is extremely odd, since it's almost deterministic on my side. Maybe you need some hugepages allocated before being able to trigger it then?

I'll take a look tomorrow if I can give you a way to allocate them manually, release them and then flush caches to see if that triggers it otherwise I'm seriously confused by this.

65a commented 1 year ago

@rincebrain I think @RodoMa92 is running an AMD system, so this is something else, but that did cause my issue I think (and thanks for finding and fixing that!). @RodoMa92 if it helps, I do not have any hugepages configured on my system when I tested dropping the caches. If someone wants to test with fixed hugepages, https://wiki.archlinux.org/title/KVM#Enabling_huge_pages is a straightforward way to allocate some. If I manage to recover my data tonight, I'm glad to test again with this enabled.

RodoMa92 commented 1 year ago

Thanks @65a for the pointer, updated test for repro (BTW, I'm running on a B450 Chipset with a R5 2600):

This is still an issue on the latest main 6.5.1.

Can anyone from the team try to debug what's going wrong with encryption on the latest kernels?

Steps to repro: 1) Use a kernel >= 6.3.1 2) Have an encrypted dataset mounted on the system 3) Execute this command as root for allocating some hugepages first, then defrag memory afterwards:

echo 550 > /proc/sys/vm/nr_hugepages
echo 3 > /proc/sys/vm/drop_caches

Marco.

65a commented 1 year ago

I tried this on the same pool/machine as before, and still no issue in logs or dmesg. Does your machine have a lot of memory pressure? I wonder if that is part of it? Possibly a kernel version related issue that got fixed by 6.5.3?

snehring commented 1 year ago

Same here, allocating huge pages doesn't seem to change anything. I'll try with some actual vms as well.

igrekster commented 1 year ago

This happens before a VM gets started. I'm also using THP, but not sure if zfs modules can get huge pages through that.

RodoMa92 commented 1 year ago

Maybe capping the max ARC cache to 8 GB (the machine has 32 GB available) also has an influence to this? At this point I'm throwing stuff at the wall to see what stick. I might try to retest this on the latest git + 6.5.3, but I'll need to take a backup since the other corruption bug reported by @65a

RodoMa92 commented 1 year ago

Regarding memory pressure it's quite hard to be the cause since I can literally reboot the OS and run the VM to get the issue.

65a commented 1 year ago

I also just tested just the cache drop on 6.4.10 runing zfs-2.2.99-64_gcae502c175, but it's a custom config kernel with some hardening options enabled, and some encrypted datasets. It is not ZFS root (my first machine was). I can't reproduce there either.

snehring commented 1 year ago

I'll keep trying things on and off, but so far I've had no luck. Have about 8G worth of huge pages allocated and assigned to a VM, arc limited to 8G, still 6.4.15 on 2.2.0-rc4.

RodoMa92 commented 1 year ago

Veeery interesting, it's gone on my end also on 6.5.3. I'll test with some bootup and shutdown of VMs, but usually just calling it would cause a crash. Using zfs-git + 6.5.3 seems to have it fixed somewhere?

RodoMa92 commented 1 year ago

Calling from the script instead immediately triggers it, hmm. Odd.

RodoMa92 commented 1 year ago

Finally I have a minimal repro that works here: 1) Use a kernel >= 6.3.1 2) Have an encrypted dataset mounted on the system 3) Execute this command as root: sync; echo 3 > /proc/sys/vm/drop_caches; echo 1 > /proc/sys/vm/compact_memory It seems that the combination of drop_caches and compact_memory interact someway with ZFS + this commit and causes a kernel oops.

Hope that with this all of you will be able to reproduce this issue in a deterministic way like I am.

Marco.

RodoMa92 commented 1 year ago

Even odder, the oops now just get triggered the first time is executed, then it will not oops again afterwards and everything works fine. I'm not testing it too much since I'm using a zvol backed Windows drive with my KVM VM, so I don't want to get too close to @65a bug reported.

snehring commented 1 year ago

I'm still not able to reproduce this on my system. I did the following in a script:

set -e
zfs load-key store/test/enc <<< crashmebaby
zfs mount store/test/enc
mount -t zfs | grep enc
echo 550 > /proc/sys/vm/nr_hugepages
for i in {1..20}; do sync; echo 3 > /proc/sys/vm/drop_caches; echo 1 > /proc/sys/vm/compact_memory; done
zfs umount store/test/enc
zfs unload-key store/test/enc
echo 0 > /proc/sys/vm/nr_hugepages

RodoMa92 commented 1 year ago

This is probably my dumbest hack yet, but if it works it works:

    sync
    echo 3 > /proc/sys/vm/drop_caches
    if [[ ! -f "/tmp/stupid_wait" ]]; then
        sleep 30
        touch /tmp/stupid_wait
    fi
    echo 1 > /proc/sys/vm/compact_memory

Not really a proper fix, but it will work for now.

RodoMa92 commented 1 year ago

I'm still not able to reproduce this on my system. I did the following in a script:

set -e
zfs load-key store/test/enc <<< crashmebaby
zfs mount store/test/enc
mount -t zfs | grep enc
echo 550 > /proc/sys/vm/nr_hugepages
for i in {1..20}; do sync; echo 3 > /proc/sys/vm/drop_caches; echo 1 > /proc/sys/vm/compact_memory; done
zfs umount store/test/enc
zfs unload-key store/test/enc
echo 0 > /proc/sys/vm/nr_hugepages

I have no hecking clue why it happens on my end, then. I'll try to remove zenpower from my kernel to remove the only OOT module in kernel and report back shortly.

snehring commented 1 year ago

It's possible the mm weirdness doesn't come in to play on the older hardware I have. I don't have anything newer I can test on at the moment. B450 is zen2, correct?

openzfs / zfs