Closed RodoMa92 closed 1 week ago
The timing of when that breaks make me wonder about something related to https://lwn.net/Articles/937943/, but who knows.
Also, as an aside, can't you tell the kernel at boot time to pre-reserve 1G hugepages for you?
Not that this isn't a bug, but just as a workaround for your use case atm.
Also, as an aside, can't you tell the kernel at boot time to pre-reserve 1G hugepages for you?
Not that this isn't a bug, but just as a workaround for your use case atm.
Yeah, sure, but then the memory is locked. I would like to keep using it if the VM is not in use. I rarely use Windows anyway these days :P
For now remaining on 6.1 LTS is good enough, at least until someone can track down this issue.
I've noticed that now even just the actual initial call crash the kernel, it might be related to the changes in the mm subsystem. But that's just speculation from me, since I can't reproduce the issue with the stock kernel in the same way (not that this exclude a kernel bug either, to be clear)
Testing mainline shortly to check if they have already fixed this issue.
What do you mean, you can't reproduce it with stock? It works on vanilla 6.3.x/6.4.x but not Arch's patchset?
It works fine without zfs-dkms installed. It doesn't work with zfs-dkms installed.
Does it break with ZFS loaded and no pools imported?
Does it break with ZFS loaded and no pools imported?
Can try it shortly on my old BtrFS install, I'll end testing mainline first :P
Well, heck, dkms doesn't build. Testing with just the module loaded now.
Interesting, without mounting my zfs encrypted root drive it doesn't seem to trigger it. I'll do further testing tomorrow.
I can reproduce this too. Good find.
BTW I think the issue has to do with compacting memory. I can still reserve hugepages just fine.
BTW I think the issue has to do with compacting memory. I can still reserve hugepages just fine.
Yeah, the trigger is definitely memory compact, not hugepage allocation by itself. Good to know that I'm not the only one with this issue :)
Starting bisect now, I'll update it shortly if I can find something from it. At least it should narrow down the possible changes.
I finally have the result of the bisection:
5dfab109d5193e6c224d96cabf90e9cc2c039884 is the first bad commit
commit 5dfab109d5193e6c224d96cabf90e9cc2c039884
Author: Huang Ying <ying.huang@intel.com>
Date: Mon Feb 13 20:34:40 2023 +0800
migrate_pages: batch _unmap and _move
In this patch the _unmap and _move stage of the folio migration is
batched. That for, previously, it is,
for each folio
_unmap()
_move()
Now, it is,
for each folio
_unmap()
for each folio
_move()
Based on this, we can batch the TLB flushing and use some hardware
accelerator to copy folios between batched _unmap and batched _move
stages.
Link: https://lkml.kernel.org/r/20230213123444.155149-6-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Xin Hao <xhao@linux.alibaba.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/migrate.c | 214 ++++++++++++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 189 insertions(+), 25 deletions(-)
I'll try to revert this on top of the latest kernel to see if it still works fine.
git bisect log here
git bisect start
# status: waiting for both good and bad commits
# good: [a5c95ca18a98d742d0a4a04063c32556b5b66378] Merge tag 'drm-next-2023-02-23' of git://anongit.freedesktop.org/drm/drm
git bisect good a5c95ca18a98d742d0a4a04063c32556b5b66378
# status: waiting for bad commit, 1 good commit known
# bad: [3822a7c40997dc86b1458766a3f146d62393f084] Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
git bisect bad 3822a7c40997dc86b1458766a3f146d62393f084
# good: [620932cd285208ef3009ac338b1eeed13ccd1753] mm/damon/dbgfs: print DAMON debugfs interface deprecation message
git bisect good 620932cd285208ef3009ac338b1eeed13ccd1753
# good: [7482c19173b7eb044d476b3444d7ee55bc669d03] selftests: arm64: Fix incorrect kernel headers search path
git bisect good 7482c19173b7eb044d476b3444d7ee55bc669d03
# good: [81ce2ebd194cf32027854ce1c703b7fd129c86b8] mm/slab.c: cleanup is_debug_pagealloc_cache()
git bisect good 81ce2ebd194cf32027854ce1c703b7fd129c86b8
# good: [65c084d848cd717d5913032dfa9e9c62ed33babd] leds: blinkm: Convert to i2c's .probe_new()
git bisect good 65c084d848cd717d5913032dfa9e9c62ed33babd
# good: [6a60dd2e876913be55e17e53ee57e1fe09448238] perf vendor events arm64: Add TLB metrics for neoverse-n2-v2
git bisect good 6a60dd2e876913be55e17e53ee57e1fe09448238
# good: [869b9eddf0b38a22c27a400e2fa849d2ff2aa7e1] mfd: intel-m10-bmc: Add PMCI driver
git bisect good 869b9eddf0b38a22c27a400e2fa849d2ff2aa7e1
# good: [45204677d427b7d0ed11930bd5be4a42893d1c93] perf symbols: Allow for .plt entries with no symbol
git bisect good 45204677d427b7d0ed11930bd5be4a42893d1c93
# good: [3a396f9859755e822775319516cd71dabc2b4e69] backlight: sky81452: Fix sky81452_bl_platform_data kernel-doc
git bisect good 3a396f9859755e822775319516cd71dabc2b4e69
# good: [a912f5975ffc82d52bbb5937eafe367d44db711c] perf test: Replace legacy `...` with $(...)
git bisect good a912f5975ffc82d52bbb5937eafe367d44db711c
# skip: [2b79eb73e2c4b362a2a261b7b2f718385fb478e4] Merge tag 'probes-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
git bisect skip 2b79eb73e2c4b362a2a261b7b2f718385fb478e4
# good: [db95818e888a927456686518880ed0145b1f20ce] perf pmu-events: Add separate metric from pmu_event
git bisect good db95818e888a927456686518880ed0145b1f20ce
# skip: [cd43b5068647f47d6936ffef4d15d99518fcab94] Merge tag 'slab-for-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
git bisect skip cd43b5068647f47d6936ffef4d15d99518fcab94
# good: [cf1d2ffcc6f17b422239f6ab34b078945d07f9aa] efi: Discover BTI support in runtime services regions
git bisect good cf1d2ffcc6f17b422239f6ab34b078945d07f9aa
# skip: [0df82189bc42037678fa590a77ed0116f428c90d] Merge tag 'perf-tools-for-v6.3-1-2023-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
git bisect skip 0df82189bc42037678fa590a77ed0116f428c90d
# good: [1470a108a60e8c0c4d19da10117c9b98f0078654] perf c2c: Add report option to show false sharing in adjacent cachelines
git bisect good 1470a108a60e8c0c4d19da10117c9b98f0078654
# good: [c2d3cf3653a8ff6e4b402d55e7f84790ac08a8ad] selftests: filesystems: Fix incorrect kernel headers search path
git bisect good c2d3cf3653a8ff6e4b402d55e7f84790ac08a8ad
# skip: [d8763154455e92a2ffed256e48fa46bb35ef3bdf] Merge tag 'printk-for-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
git bisect skip d8763154455e92a2ffed256e48fa46bb35ef3bdf
# good: [1f428356c38dcbe49fd2f1c488b41e88720ead92] rtla: Add hwnoise tool
git bisect good 1f428356c38dcbe49fd2f1c488b41e88720ead92
# bad: [f9366f4c2a29d14f5992b195e268240c2deb116e] include/linux/migrate.h: remove unneeded externs
git bisect bad f9366f4c2a29d14f5992b195e268240c2deb116e
# good: [42012e0436d44aeb2e68f11a28ddd0ad3f38b61f] migrate_pages: restrict number of pages to migrate in batch
git bisect good 42012e0436d44aeb2e68f11a28ddd0ad3f38b61f
# bad: [9325ddf90ec3a801c09da374b74532d4589a7346] m68k/nommu: add missing definition of ARCH_PFN_OFFSET
git bisect bad 9325ddf90ec3a801c09da374b74532d4589a7346
# bad: [6f7d760e86fa84862d749e36ebd29abf31f4f883] migrate_pages: move THP/hugetlb migration support check to simplify code
git bisect bad 6f7d760e86fa84862d749e36ebd29abf31f4f883
# bad: [80562ba0d8378e89fe5836c28ea56c2aab3014e8] migrate_pages: move migrate_folio_unmap()
git bisect bad 80562ba0d8378e89fe5836c28ea56c2aab3014e8
# bad: [5dfab109d5193e6c224d96cabf90e9cc2c039884] migrate_pages: batch _unmap and _move
git bisect bad 5dfab109d5193e6c224d96cabf90e9cc2c039884
# good: [64c8902ed4418317cd416c566f896bd4a92b2efc] migrate_pages: split unmap_and_move() to _unmap() and _move()
git bisect good 64c8902ed4418317cd416c566f896bd4a92b2efc
# first bad commit: [5dfab109d5193e6c224d96cabf90e9cc2c039884] migrate_pages: batch _unmap and _move
LMK how this should proceed, and to which people this need to be reported (if it's purely a ZFS kernel bug or if it can affect other part of the kernel and this need to be reported directly to the Linux developers)
Yeah, unfortunately too much changes in the mm subsystem has been applied, so I really can't easily revert that change on top of the latest linux kernel. Please let me know how this should proceed.
This should be reported as a kernel bug, no, especially as this provides a simple reproducer? Does this occur with the 6.5-rc kernels too?
Can't reproduce without ZFS on root encrypted, so I can't prove is a kernel regression. Highly likely if you ask me, but that's just my opinion at this point.
Just tested 10 bootup and shutdown cycles with ZFS loaded not on root and a dataset imported (non encrypted) and this didn't cause any issues. If someone from OpenZFS could take a look on why calling drop_caches cause an oops in kernel with ZFS on encrypted root it would be appreciated.
It also happens on a non encrypted root, but with an encrypted data set present. So it looks like it is related to encryption.
Thanks a lot for the report, this at least narrows down the area.
I have encrypted datasets too and can confirm this happens.
This is still an issue on the latest main 6.5.1.
Can anyone from the team try to debug what's going wrong with encryption on the latest kernels?
Steps to repro: 1) Use a kernel >= 6.3.1 2) Have an encrypted dataset mounted on the system 3) Execute this command as root
echo 3 > /proc/sys/vm/drop_caches
Marco.
Also experiencing this one in my own vfio script lately. The call of 3 > drop_caches
then 1 > compact_memory
consistently results in a kernel oops of kernel BUG at mm/migrate.c:656!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
.
Panic as a whole crash and reboot? I only experienced kernel oops, not panics, but the error is identical to mine.
Apologies, yeah an oops. Not a full panic. Editing to clarify in original comment. Though I did find many things on the system became unstable enough to warrant a reboot. Various processes and logging systems no longer working or hanging on calls. Launching graphical applications taking much longer or hanging on some other subprocess which got stuck.
Apologies, yeah an oops. Not a full panic. Editing to clarify in original comment. Though I did find many things on the system became unstable enough to warrant a reboot. Various processes and logging systems no longer working or hanging on calls. Launching graphical applications taking much longer or hanging on some other subprocess which got stuck.
Yes, the same identical things happens on my end as well. It's probably some residual corrupted state (?) in the kernel itself that's causing stuff to stop working.
Since my issue was linked to this one, I tried your echo 3 > /proc/vm/drop_caches
on 6.5.3 arch, running zfs zfs-2.2.99-87_g8af8d2abb1
on an AMD CPU system with AVX2 but not AVX512. Nothing bad happened, which is hopefully a helpful datapoint.
Since my issue was linked to this one, I tried your
echo 3 > /proc/vm/drop_caches
on 6.5.3 arch, running zfszfs-2.2.99-87_g8af8d2abb1
on an AMD CPU system with AVX2 but not AVX512. Nothing bad happened, which is hopefully a helpful datapoint.
Do you have a encrypted dataset working when you ran this? This seems to be the triggering cause here.
My system is basically identical as features goes.
Yes, this system boots from ZFS and has an encrypted dataset. It is running nearly identical software to the system in my other issue, but the only real differences are the other system has a different pool, a Sapphire Rapids Xeon, and ECC memory, whereas this is a Ryzen 6900HS. Both pools were created at different times, but both pools are several years old as well. I wonder if your issue affects pools created after the encrypted metadata problem I reported a long time ago?
I've had an older physical system configured for testing other zfs related things recently, so I figured I'd try to reproduce this today since we have a couple important encrypted datasets in our environment.
The host is running fedora 38 with kernel 6.4.15 and zfs 2.2.0-rc4. Hardware wise it's an older dell workstation with 2 Xeon E5-2643v3 cpus (avx2 at the latest) and 128G of ecc ram.
I created an encrypted dataset, ensured it was mounted and ran echo 3 > /proc/vm/drop_caches
and it didn't crash.
I then wrote some random data to the dataset and repeated with no crash. Tried rebooting, mounting, writing, and repeating the echo to drop_caches without any failure.
I do have this host configured with mitigations=off, let me see if that makes a difference. Edit: no change.
Please let me know if I'm missing something obvious or if there's anything else you'd like me to try.
If it's only breaking on SPR, maybe #14989?
This is extremely odd, since it's almost deterministic on my side. Maybe you need some hugepages allocated before being able to trigger it then?
I'll take a look tomorrow if I can give you a way to allocate them manually, release them and then flush caches to see if that triggers it otherwise I'm seriously confused by this.
@rincebrain I think @RodoMa92 is running an AMD system, so this is something else, but that did cause my issue I think (and thanks for finding and fixing that!). @RodoMa92 if it helps, I do not have any hugepages configured on my system when I tested dropping the caches. If someone wants to test with fixed hugepages, https://wiki.archlinux.org/title/KVM#Enabling_huge_pages is a straightforward way to allocate some. If I manage to recover my data tonight, I'm glad to test again with this enabled.
Thanks @65a for the pointer, updated test for repro (BTW, I'm running on a B450 Chipset with a R5 2600):
This is still an issue on the latest main 6.5.1.
Can anyone from the team try to debug what's going wrong with encryption on the latest kernels?
Steps to repro: 1) Use a kernel >= 6.3.1 2) Have an encrypted dataset mounted on the system 3) Execute this command as root for allocating some hugepages first, then defrag memory afterwards:
echo 550 > /proc/sys/vm/nr_hugepages
echo 3 > /proc/sys/vm/drop_caches
Marco.
I tried this on the same pool/machine as before, and still no issue in logs or dmesg. Does your machine have a lot of memory pressure? I wonder if that is part of it? Possibly a kernel version related issue that got fixed by 6.5.3?
Same here, allocating huge pages doesn't seem to change anything. I'll try with some actual vms as well.
This happens before a VM gets started. I'm also using THP, but not sure if zfs modules can get huge pages through that.
Maybe capping the max ARC cache to 8 GB (the machine has 32 GB available) also has an influence to this? At this point I'm throwing stuff at the wall to see what stick. I might try to retest this on the latest git + 6.5.3, but I'll need to take a backup since the other corruption bug reported by @65a
Regarding memory pressure it's quite hard to be the cause since I can literally reboot the OS and run the VM to get the issue.
I also just tested just the cache drop on 6.4.10 runing zfs-2.2.99-64_gcae502c175, but it's a custom config kernel with some hardening options enabled, and some encrypted datasets. It is not ZFS root (my first machine was). I can't reproduce there either.
I'll keep trying things on and off, but so far I've had no luck. Have about 8G worth of huge pages allocated and assigned to a VM, arc limited to 8G, still 6.4.15 on 2.2.0-rc4.
Veeery interesting, it's gone on my end also on 6.5.3. I'll test with some bootup and shutdown of VMs, but usually just calling it would cause a crash. Using zfs-git + 6.5.3 seems to have it fixed somewhere?
Calling from the script instead immediately triggers it, hmm. Odd.
Finally I have a minimal repro that works here:
1) Use a kernel >= 6.3.1
2) Have an encrypted dataset mounted on the system
3) Execute this command as root:
sync; echo 3 > /proc/sys/vm/drop_caches; echo 1 > /proc/sys/vm/compact_memory
It seems that the combination of drop_caches and compact_memory interact someway with ZFS + this commit and causes a kernel oops.
Hope that with this all of you will be able to reproduce this issue in a deterministic way like I am.
Marco.
Even odder, the oops now just get triggered the first time is executed, then it will not oops again afterwards and everything works fine. I'm not testing it too much since I'm using a zvol backed Windows drive with my KVM VM, so I don't want to get too close to @65a bug reported.
I'm still not able to reproduce this on my system. I did the following in a script:
set -e
zfs load-key store/test/enc <<< crashmebaby
zfs mount store/test/enc
mount -t zfs | grep enc
echo 550 > /proc/sys/vm/nr_hugepages
for i in {1..20}; do sync; echo 3 > /proc/sys/vm/drop_caches; echo 1 > /proc/sys/vm/compact_memory; done
zfs umount store/test/enc
zfs unload-key store/test/enc
echo 0 > /proc/sys/vm/nr_hugepages
This is probably my dumbest hack yet, but if it works it works:
sync
echo 3 > /proc/sys/vm/drop_caches
if [[ ! -f "/tmp/stupid_wait" ]]; then
sleep 30
touch /tmp/stupid_wait
fi
echo 1 > /proc/sys/vm/compact_memory
Not really a proper fix, but it will work for now.
I'm still not able to reproduce this on my system. I did the following in a script:
set -e zfs load-key store/test/enc <<< crashmebaby zfs mount store/test/enc mount -t zfs | grep enc echo 550 > /proc/sys/vm/nr_hugepages for i in {1..20}; do sync; echo 3 > /proc/sys/vm/drop_caches; echo 1 > /proc/sys/vm/compact_memory; done zfs umount store/test/enc zfs unload-key store/test/enc echo 0 > /proc/sys/vm/nr_hugepages
I have no hecking clue why it happens on my end, then. I'll try to remove zenpower from my kernel to remove the only OOT module in kernel and report back shortly.
It's possible the mm weirdness doesn't come in to play on the older hardware I have. I don't have anything newer I can test on at the moment. B450 is zen2, correct?
System information
Describe the problem you're observing
While executing the prepare script for a QEMU virtual machine, if I'm on kernel version 6.3.1 up to the latest 6.4.7 the script crashes with the following stack trace (this log is for a crash on 6.3.9, but I have tested all extremes above and the error is almost the same one as below):
I had my previous installation on top of BtrFS available before switching to ZFS as root, so I could test the same thing under another filesystem. This do not happens on the latest kernel.
I'm using a zvol as a backing store for the VM, the libvirt xml is also attached below, with minor fields redacted.
Describe how to reproduce the problem
1) Use a kernel >= 6.3.1 2) Load in virtualbox kernel drivers (
vboxdrv vboxnetadp vboxnetflt
) 2) Have an encrypted dataset mounted on the system and a zvol created on that filesystem (not sure if both are needed as a precondition) 3) Execute this command as root:sync; echo 3 > /proc/sys/vm/drop_caches; echo 1 > /proc/sys/vm/compact_memory
4) Crash is triggered It seems that the combination of drop_caches and compact_memory interact someway with ZFS + this commit plus having VirtualBox driver loaded clashes with each other and causes a kernel oops.I'm using the above for managing VMs memory hugepage allocation.
Using anything older than 6.3 the code works perfectly fine and I can bootup the VM each time with no issues. I'm currently using 6.1 LTS, and I have no problems with the VM in itself.
The above is needed to being able to allocate 1GB hugepages correctly, otherwise after the first bootup the memory is too fragmented to allocate 1 GB chunks without compressing the memory first, and the VM fail to boot up properly. This sometimes cause some host system instability and issues with the host shutting down cleanly.
qemu.tar.gz GamingWin11.tar.gz