Open sempervictus opened 3 years ago
As ZSTD-19 is NOT very compatible with low to medium-end systems, please share your systems specs.
Also: Be aware that ZSTD update might make your dedupe less efficient and ZSTD will be kept up-to-date, hence ZSTD is not adviced to be used with deduplication. But that should not lead to crashes like this.
System's a test VM, 4GB of RAM, 4 KVM cores. Regardless how memory-constrained a system is, this shouldn't happen. Kinda dampens the whole "portability" premise :). If ZFS permits this sort of resource starvation, there are any number of potential DoS-style attacks which could be carried out against the compression implementation with potentially adverse outcomes. The fact that it happens after many interations though does lead me to wonder - is there a memory leak somewhere?
ZSTD-19 is NOT supposed to be used for general use and only reasonable for VERY specific usecases.
Your system is not capable of running ZSTD-19 on any reasonably heavy workload. Just like it isn't capable of running dedupe on a reasonably big workload. You run a compression level that REQUIRES an extreme amount of ram. Running into issues is to be expected...
TLDR: ZSTD-19 is a very-VERY hungry beast, not to be used unless you know what you are doing.
However: There might be a bug besides it being an extremely edgecase setup: What I think is happening is it's spawning new memory pools before the old ones are removed, which enhances the inherent memory consumption issues native to ZSTD-19. Any thoughts @BrainSlayer ?
That is dangerous - an attacker can create or modify a zpool in a different context of privilege (VM/container/separate host) which you the user have no choice but to import with the setting they applied. So an attacker from a 16G domU can make a storage pool which will merc the 4G dom0. This is no different from a file-format attack, just an fs format one.
As I said: It's not "just" your system being incapable of running ZSTD-19, there might also be a bug making the issue worse.
If you, as a system administrator, have no other choice than load an unknown volume on a production machine without precautions., you need to call the Police not OpenZFS.
But like I said: It's not just ZSTD-19, there seems to be an additional factor at play.
zstd-19 is very slow and takes about 20 - 30 megs ram per thread. a typical 4 core system (with hyperthreading enabled) uses 32 threads. so typical task hanging warnings by the kernel are normal. if memory allocation fails zstd compression will be disabled for the requested block. so i'm uncertain why it ends in a page fault in your case. but i'm sure its a problem with your memory contrained test scenario and it can be also a problem by memory virtualisation in QEMU. so qemu returns success for the requested allocation but its not present in reality. qemu does virtualize the memory allocations. so the test case is not valid for my oppinion. its a emulator and no real system. but its not dangerous if a attacker send you a bad export. you cannot import or change properties without root privileges. and the trustlevel for such a file is on user side. its like you receive a trojan horse as email attachement. its up to you if you run it or not
@BrainSlayer I think using this would retry the memory allocation (which gets disabled) so often, it doesn't have the chance to get rid of space used by any previous memory pool threads:
"while true do; (rsync -a /usr/*
So as far as you the system administrator preventing import of pools, you can't delegate ability to import a pool based on the metadata of its datasets, only the IDs of the subject. Secondly, you can't stop systemd from trying to import a zpool on-boot in AWS for instance - no bootloader access to preempt it and rescue target the mess. Lastly, you cannot know what was done to the pool from another context until you import it. Basically this is a time bomb, and DoS of a storage plane is a really bad kind of DoS, doubly so if it results in the data being inaccessible. If the function is known to be dangerous, experimental, and not for use today, why in the world is it an option to create/configure a pool that could be incompatible in certain contexts when the whole idea of OpenZFS is portability? Saying virtual machines arent real systems is also... inaccurate? If it has a kernel, its an OS. Cisco ASA, Juniper SRX, etc are all VMs by the way - purchasing departments seem to qualify them as real systems globally. Given the potential instability of this and that with every day more and more people might be trying to create archival datasets with the maximum possible compression, it might be prudent to pull the breaking levels out of the tagged release so consumers dont learn this the hard way after putting their real data in there.
Maybe stop ranting and start listening?
I told you that ZSTD-19 is too heavy for your hardware, but that your problems are NOT only caused by it.
I verified one of my working theories with @BrainSlayer what could be the cause and he confirmed the code should pagefault and OOM issues and he also is not sure what is causing the issue.
if memory allocation fails zstd compression will be disabled for the requested block. so i'm uncertain why it ends in a page fault in your case
Both me and @BrainSlayer where trying to find the problem you are having and by getting all worked up about our theorycrafting, you are making the job insanely hard for us.
the breaking levels
There is no "the breaking levels", your bug is not specific to a certain level. Nor do we (at the moment) even know what is exactly causing your issue. Can we stop throwing a fuss and work together on solving the issue instead, please?
Simply put your issues seems to be a combination of 4 (or more) factors:
I was consulting with Brainslayer wheither the first 3 alone would be enough to cause this and he doesn't think it does, because it should just not-compress the data (which is actually tested seperately at the end of developement, by manually triggering all allocation failure failover options). Which brings us back to the "Currently Unknown Cause", which might even be triggerable on lower zstd-levels.
@BrainSlayer Can we put a check into ZSTD that one need to have atleast 2x the amount of predicted memory consumtion of a ZSTD-level before it can be enabled? (in which case it would fall back to zstd-3)
I don't think thats the root cause of this issue, but it does prevent the endless flow of support calls about people trying ZSTD-19 on low-end hardware or complain about there being even an option to do so...
@Ornias1993 thats hard to check. i mean free memory is not free memory. there is cached and buffers which are available too for allocation. so definition of free memory is, if allocation fails nothing is free. linux does reclaim memory on demand if required. if we use atomic allocations with kmalloc again. but this will fail in 80% at all allocation attempts for big memory blocks since atomic is defined to return memory in a fast time without reclaiming. from my point of view the qemu environment is abstracting memory allocations to local virtual memory. this may include swapping. this is not the case for real kernel environments which uses hard constrained memory allocations. so vmalloc will fail if there is nothing available. but qemu wont fail and just pretends that memory is available, even if not. normally the memory pool will only contain guaranteed available memory which is direct accessible
@BrainSlayer I was actually refering to total-system memory, not just free... Because, indeed, free memory would be dynamic. The question is where in the compression chain would we be able to do so?
Lets assume for the moment the issue isn't qemu related (I would be able to get some hardware going, with specifications similair to this issue, if that ends up to be our only remaining option in a few days)...
@Ornias1993: There's no fuss, there's a stated concern about having release code which can do this out to the public right now, compounded by the fact that level 19 is not supposed to be used while being presented to users. A DoS is a security problem, and security workflows cut off the potential access vectors to the bug first and then fix it while attackers cant get to it (in this case our most likely adversary is a user accidentally moving their entire life's backup into a dataset configured this way and having it go corrupt, but the reds are probably reading this - issue trackers are free targeting guides :) )
Agreed that it needs diagnostic and redress of the root cause, simply saying that while the bug is known to exist, a quick patch to prevent users setting any level above 14 or whatever's considered "safe for all consumers" should go into the release branch as a safety measure for users installing from DKMS RPMs or whatever their process is for this. Time == faulty installs which may == lost data/time/money. We work in enterprise space where something like this being left in the release code does not help build confidence among those people for FOSS.
@BrainSlayer: there is no swap in play anywhere, host system is all ZFS with 64G on a gen8 i9HK, point was to run constrained.
I had tested the ZSTD with the kvmalloc code you'd originally pushed to the 0.8.x series and dont recall what the maximum compression was in it, but i dont remember it hitting anything like this. We tend to run these "exhaust the space and try to break it" tests before we upgrade production pools to see if there are edge cases ZTS doesn't address (like leaky space maps). Since its kernel code, testing in containers is rough, and test metal needs a full offload of all other test workloads to run this on it. I can stand up a physical test host over the break if it gets that far, but for now at least the most easily available environment is KVM.
I re-ran the test at zstd=9
for 30m with no crash, so thats a win. Running at 14 currently, will bump it to 16 next if this doesnt die miserably.
There's no fuss, there's a stated concern about having release code which can do this out to the public right now.
There is a Bug and you already have 2 people working on the cause. So APPARANTLY your issue is taken serieusly. Stop the fearmongering, we don't even know for sure it is an issue spread beyond just your system.
compounded by the fact that level 19 is not supposed to be used while being presented to users.
Don't start selectively quoting me. Thats not what I said.
I Said:
ZSTD-19 is NOT supposed to be used for general use and only reasonable for VERY specific usecases.
What I mean was: The same conditions apply with ZSTD-19 (and other High zstd levels) that are implace for Deduplication. Nothing more, nothing less. Deduplication is also not supposed to be used for general use and only reasonable in VERY specific usecases.
A DoS is a security problem
Which you have not proven as of yet. It might or might not be anything serieus and it might or might not be possible to turn this into a DDoS for... A RaspberryPi. Chill out.
and security workflows cut off the potential access vectors to the bug first
Security workflows also don't do that when the first dude shouts (and nothing more) DDoS on a forum. We don;t even know what the vector IS. Potential access vector might not even BE ZSTD.
in this case our most likely adversary is a user accidentally moving their entire life's backup into a dataset configured this way and having it go corrupt, but the reds are probably reading this - issue trackers are free targeting guides
Reds, please be informed: if people have inherently idiotic compression levels that are not compatible with their system AND are running a full on the equivalent of a raspberrypi AND are executing your writes generating script... You might be in luck. ohh and it might not even be these requirements that caused this issue in the first place. Have a nice day hacking.
Agreed that it needs diagnostic and redress of the root cause, simply saying that while the bug is known to exist, a quick patch to prevent users setting any level above 14 or whatever's considered "safe for all consumers" should go into the release branch as a safety measure for users installing from DKMS RPMs or whatever their process is for this.
Okey, Please disable:
And cherry on top: WE DO NOT KNOW WHAT IS CAUSING THIS. (yet) It might not even be ZSTD related and we don't even know what levels are and are not affected.
I'm not going to continue discussing your ongoing rampage about a possible DDOS. I rather bugtrace your issue to actually figure out what is the cause of this, which might (HINT) not even be ZSTD or even ZFS. We currently don't even know if the bug is an issue in code or related to your specific system.
So please supply the following information:
I had tested the ZSTD with the kvmalloc code you'd originally pushed to the 0.8.x series and dont recall what the maximum compression was in it, but i dont remember it hitting anything like this.
Yeah, thats because actually a lot of content didn't get compressed at all I think, or the failover allocation code that was added afterwards... Or another (thought to be) unrelated change...
edit If 16 hapens to be okey, i'm open to discussing 3 options:
In all cases: We should document two things: 1.A minimum recommended system requirement (i'm amazed there is not one ram advice in the docs)
@sempervictus I've come across some changes upstream to the relevant codepath, you might wanna try my 1.4.8 PR-build just to be sure (#11367)
@Ornias1993: There's no fuss, there's a stated concern about having release code which can do this out to the public right now, compounded by the fact that level 19 is not supposed to be used while being presented to users. A DoS is a security problem, and security workflows cut off the potential access vectors to the bug first and then fix it while attackers cant get to it (in this case our most likely adversary is a user accidentally moving their entire life's backup into a dataset configured this way and having it go corrupt, but the reds are probably reading this - issue trackers are free targeting guides :) ) Agreed that it needs diagnostic and redress of the root cause, simply saying that while the bug is known to exist, a quick patch to prevent users setting any level above 14 or whatever's considered "safe for all consumers" should go into the release branch as a safety measure for users installing from DKMS RPMs or whatever their process is for this. Time == faulty installs which may == lost data/time/money. We work in enterprise space where something like this being left in the release code does not help build confidence among those people for FOSS. @BrainSlayer: there is no swap in play anywhere, host system is all ZFS with 64G on a gen8 i9HK, point was to run constrained. I had tested the ZSTD with the kvmalloc code you'd originally pushed to the 0.8.x series and dont recall what the maximum compression was in it, but i dont remember it hitting anything like this. We tend to run these "exhaust the space and try to break it" tests before we upgrade production pools to see if there are edge cases ZTS doesn't address (like leaky space maps). Since its kernel code, testing in containers is rough, and test metal needs a full offload of all other test workloads to run this on it. I can stand up a physical test host over the break if it gets that far, but for now at least the most easily available environment is KVM. I re-ran the test at
zstd=9
for 30m with no crash, so thats a win. Running at 14 currently, will bump it to 16 next if this doesnt die miserably.
you missed one point. there is host involved virtual memory. its running under QEMU and not with a native system
@BrainSlayer Agreed, that would be next test step: So please supply the following information:
i can just say that i tested zstd-19 with a 512mb memory constrained device with no issues. but it was real hardware. not a emulator
i can just say that i tested zstd-19 with a 512mb memory constrained device with no issues. but it was real hardware. not a emulator
Yeah... We both did a lot of contrained testing, also tested the failback allocators one by one by force...
Did you test the latest allocator change constrained though?
So far i've cycled through 9,13,16, and currently running 18. The next tests will be conducted at 8 cores available to the VM with 1G of memory to try and induce the behavior. GZIP has never caused these issues in the past - the test was run for gzip-9 as part of the 2.0 testing The VM storage is a virtio-scsi block device mapped to a qcow2 on the host's ZFS pool on a 2T nvme. Stats are showing its clearly cpu-bound - no iowait to speak of, kernel threads eating up 85%+ of CPU. I'll track the VM's memory usage, from the host. I will test on hardware when i have time to do so, which probably will not be until the weekend.
@sempervictus Awesome, thats pretty similair as to how we did our tests a few months ago 👍
The actual error at the top of the message indicates an access boundary violation trying to hit that address, which manifests as a page fault. The address is on CR2 in the stack trace. You guys are much more familiar with the code - any chance something is being derefenced from the wrong place and used as a pointer?
Did you test the latest allocator change constrained though?
the test results where the start for developing the fallback handlers since i was running in allocation deadlocks and i retested it all the time while developing. i'm also running zstd on embedded systems (wifi routers) with even lower memory constraints still today
Someone much more expert than me in these matters (Brad Spengler) noted that this is being caused by a null in ZSTD_matchState_t
being treated as a pointer.
I'm building a debug kernel now to try and track down states right before this occurs. He also mentioned that the functions involved have changed in 1.4.7 so this might already be fixed upstream. Whats the plan for ZSTD version bumps? i gather thats why dedup+zstd isn't a great idea, but if backports are happening from upstream will the significant changes be logged in this repo or is it planned as more of "we bumped to vX.Y, please read their changelog, and here's how we deviated from them?"
@sempervictus You can check my 1.4.8 PR to make sure it's fixed, but indeed... the upstream code has been evolved quite a bit.
If there are any relevant highlights, I tried to add them to the changelog on our PR as well. I only noticed the speed improvement on smaller blocksizes though. But if this bug is indeed solved by upstream, I'll make note in the Update PR... In theory we did leave room for handling breaking changes in the ZSTD library, but indeed... small changes mean that new blocks might be different than the current blocks in your dedupe table.
@Ornias1993: 1.4.8 appears to fix the problem - thank you. 3h of running the original test scenario with nothing adverse in dmesg. Tests still running, zloop running in parallel for cross coverage on the same VM. Do the buildbots have memory and other sanitizer builds to provide more generic coverage for stuff like this? Stress testling like what i'm doing currently can show some issues, but given the complexity of ZSTD coupled with the fact that its running at the same tier as the core memory allocator, it might be handy to have sanitizers watch it get fuzzed for coverage of operating conditions and data inputs - are there ways for instance to internally "zip-bomb" a compressor/decompressor implementation given the caveats of running in ring0 under additional pressures?
So close. Adding zloop to the mix seems to have raised enough pressure to trip it again:
Dec 22 23:10:15 svl-arch00 kernel: BUG: unable to handle page fault for address: ffffffffffffff24
Dec 22 23:10:15 svl-arch00 kernel: #PF: supervisor write access in kernel mode
Dec 22 23:10:15 svl-arch00 kernel: #PF: error_code(0x0002) - not-present page
Dec 22 23:10:15 svl-arch00 kernel: PGD 4040c067 P4D 4040c067 PUD 4040e067 PMD 0
Dec 22 23:10:15 svl-arch00 kernel: Oops: 0002 [#1] SMP
Dec 22 23:10:15 svl-arch00 kernel: CPU: 2 PID: 420 Comm: z_wr_iss Tainted: G T 5.4.85 #1
Dec 22 23:10:15 svl-arch00 kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
Dec 22 23:10:15 svl-arch00 kernel: RIP: 0010:zfs_ZSTD_updateStats+0x67/0xc0
Dec 22 23:10:15 svl-arch00 kernel: Code: 48 39 d6 75 ec 43 8d 04 36 01 45 30 41 83 fe 3f 76 60 45 0f bd f6 41 83 c6 13 48 8b 45 08 41 83 c5 01 41 83 ec 03 45 0f bd ed <42> 83 04 b0 01 48 8b 45 18 4d 63 ed 83 45 34 01 42 83 04 a8 01 83
Dec 22 23:10:15 svl-arch00 kernel: RSP: 0000:ffffaaa683de3880 EFLAGS: 00010202
Dec 22 23:10:15 svl-arch00 kernel: RAX: ffffffffffffff20 RBX: ffffaaa69597d86c RCX: 0000000000000542
Dec 22 23:10:15 svl-arch00 kernel: RDX: ffffaaa69597d86d RSI: ffffaaa69597d86d RDI: ffff9f1c2062d3b0
Dec 22 23:10:15 svl-arch00 kernel: RBP: ffff9f1c2062d3b0 R08: 0000000000000004 R09: 0000000000000001
Dec 22 23:10:15 svl-arch00 kernel: R10: ffff9f1c2062d288 R11: 0000000000000004 R12: 0000000000000001
Dec 22 23:10:15 svl-arch00 kernel: R13: 000000000000000b R14: 0000000000000001 R15: 0000000000000001
Dec 22 23:10:15 svl-arch00 kernel: FS: 0000000000000000(0000) GS:ffff9f1c3bb00000(0000) knlGS:0000000000000000
Dec 22 23:10:15 svl-arch00 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 22 23:10:15 svl-arch00 kernel: CR2: ffffffffffffff24 CR3: 00000001192a4006 CR4: 0000000000360ee0
Dec 22 23:10:15 svl-arch00 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec 22 23:10:15 svl-arch00 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Dec 22 23:10:15 svl-arch00 kernel: Call Trace:
Dec 22 23:10:15 svl-arch00 kernel: zfs_ZSTD_compressBlock_btultra2+0x15ba/0x4bf0
Dec 22 23:10:15 svl-arch00 kernel: zfs_ZSTD_compressBlock_internal+0x1d1/0x4d0
Dec 22 23:10:15 svl-arch00 kernel: zfs_ZSTD_compressContinue_internal+0x667/0x9b0
Dec 22 23:10:15 svl-arch00 kernel: zfs_ZSTD_compressEnd+0x35/0x380
Dec 22 23:10:15 svl-arch00 kernel: ZSTD_compressStream_generic+0x42e/0x4f0
Dec 22 23:10:15 svl-arch00 kernel: zfs_ZSTD_compress2+0xe3/0x280
Dec 22 23:10:15 svl-arch00 kernel: zfs_zstd_compress+0xdd/0x170
Dec 22 23:10:15 svl-arch00 kernel: zio_compress_data+0xc1/0xf0
Dec 22 23:10:15 svl-arch00 kernel: zio_write_compress+0x44b/0x8f0
Dec 22 23:10:15 svl-arch00 kernel: ? __wake_up_common_lock+0x86/0xb0
Dec 22 23:10:15 svl-arch00 kernel: zio_execute+0x8a/0x130
Dec 22 23:10:15 svl-arch00 kernel: taskq_thread+0x2ce/0x510
Dec 22 23:10:15 svl-arch00 kernel: ? try_to_wake_up.isra.0+0x390/0x390
Dec 22 23:10:15 svl-arch00 kernel: ? zio_destroy+0xe0/0xe0
Dec 22 23:10:15 svl-arch00 kernel: ? taskq_thread_spawn+0x60/0x60
Dec 22 23:10:15 svl-arch00 kernel: kthread+0x10f/0x130
Dec 22 23:10:15 svl-arch00 kernel: ? kthread_associate_blkcg+0x90/0x90
Dec 22 23:10:15 svl-arch00 kernel: ret_from_fork+0x1f/0x40
Dec 22 23:10:15 svl-arch00 kernel: Modules linked in: intel_rapl_msr intel_rapl_common intel_pmc_core_pltdrv intel_pmc_core kvm_intel kvm irqbypass crct10dif_pclmul ghash_clmulni_intel joydev rapl mousedev qxl ttm drm_kms_helper evdev fb_sys_fops input_leds syscopyarea uio_pdrv_genirq led_class sysfillrect uio psmouse pcspkr mac_hid i2c_piix4 sysimgblt qemu_fw_cfg button sch_fq_codel drm fuse ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 ata_generic pata_acpi virtio_net net_failover virtio_balloon virtio_blk failover serio_raw atkbd libps2 crc32_pclmul crc32c_intel uhci_hcd aesni_intel ehci_pci ehci_hcd glue_helper crypto_simd cryptd usbcore virtio_pci usb_common ata_piix floppy i8042 serio
Dec 22 23:10:15 svl-arch00 kernel: CR2: ffffffffffffff24
Dec 22 23:10:15 svl-arch00 kernel: ---[ end trace 0f2e9221cf73c222 ]---
Dec 22 23:10:15 svl-arch00 kernel: RIP: 0010:zfs_ZSTD_updateStats+0x67/0xc0
Dec 22 23:10:15 svl-arch00 kernel: Code: 48 39 d6 75 ec 43 8d 04 36 01 45 30 41 83 fe 3f 76 60 45 0f bd f6 41 83 c6 13 48 8b 45 08 41 83 c5 01 41 83 ec 03 45 0f bd ed <42> 83 04 b0 01 48 8b 45 18 4d 63 ed 83 45 34 01 42 83 04 a8 01 83
Dec 22 23:10:15 svl-arch00 kernel: RSP: 0000:ffffaaa683de3880 EFLAGS: 00010202
Dec 22 23:10:15 svl-arch00 kernel: RAX: ffffffffffffff20 RBX: ffffaaa69597d86c RCX: 0000000000000542
Dec 22 23:10:15 svl-arch00 kernel: RDX: ffffaaa69597d86d RSI: ffffaaa69597d86d RDI: ffff9f1c2062d3b0
Dec 22 23:10:15 svl-arch00 kernel: RBP: ffff9f1c2062d3b0 R08: 0000000000000004 R09: 0000000000000001
Dec 22 23:10:15 svl-arch00 kernel: R10: ffff9f1c2062d288 R11: 0000000000000004 R12: 0000000000000001
Dec 22 23:10:15 svl-arch00 kernel: R13: 000000000000000b R14: 0000000000000001 R15: 0000000000000001
Dec 22 23:10:15 svl-arch00 kernel: FS: 0000000000000000(0000) GS:ffff9f1c3bb00000(0000) knlGS:0000000000000000
Dec 22 23:10:15 svl-arch00 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 22 23:10:15 svl-arch00 kernel: CR2: ffffffffffffff24 CR3: 00000001192a4006 CR4: 0000000000360ee0
Dec 22 23:10:15 svl-arch00 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec 22 23:10:15 svl-arch00 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 000000000000040
Do the buildbots have memory and other sanitizer builds to provide more generic coverage for stuff like this? Stress testling like what i'm doing currently can show some issues, but given the complexity of ZSTD coupled with the fact that its running at the same tier as the core memory allocator, it might be handy to have sanitizers watch it get fuzzed for coverage of operating conditions and data inputs - are there ways for instance to internally "zip-bomb" a compressor/decompressor implementation given the caveats of running in ring0 under additional pressures?
I don't think it does, but we might actually want to add your scenario as a test... Could you also supply an example of your "Adding zloop to the mix" scenario? @behlendorf This is your department I think
So close. Adding zloop to the mix seems to have raised enough pressure to trip it again: Well, it at least seems to be a different code path this time and takes more effort... Which is a good thing!
@Cyan4973 and @c0d3z3r0 Can this be upstream related, or more like: Our codepath related?
@Ornias1993, @behlendorf: the test scenario is
hile true; do date; (rsync -a /usr/* /testpool/) || (rm -rf /testpool/*) ; done
)@sempervictus Maybe a weird question, mind testing without sha512? Maybe there is some weird correlation there... Because thats one of the things we didn't explicitly test with low-memory testing...
@Ornias1993: sure, building with debug symbols to make tracking a bit easier. Just to double-check - are we sure kmemleak and friends can even instrument a non-GPL component?
are we sure kmemleak and friends can even instrument a non-GPL component?
It depends on the checker/sanitizer. When a non-GPL kmod is loaded the lock profiler is one of things which is automatically disabled for the entire kernel. However, kernel build options like kasan and kmemleak work fine.
Speaking of kmemleak we do use this in the CI and haven't seen any zstd related memory leaks. And from what I saw above that jibes with what you're seeing. The kernel BUG here is indicates there was no page at the address which needed to be faulted in. I'd suggest building the debug kernel with the kernel address sanitizer enabled and see if that detects anything. It wouldn't be a bad idea to enable CONFIG_DEBUG_SLAB
and CONFIG_DEBUG_PAGEALLOC
either. That'll enable redzones and poisoning plus unmapping of freed pages.
These changes do make another change that might actually be obfuscating the issue.
@@ -12,6 +12,7 @@ CONFIG_CLANG_VERSION=0
CONFIG_CC_CAN_LINK=y
CONFIG_CC_HAS_ASM_GOTO=y
CONFIG_CC_HAS_ASM_INLINE=y
+CONFIG_CONSTRUCTORS=y
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y
CONFIG_THREAD_INFO_IN_TASK=y
@@ -283,6 +284,7 @@ CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_ZONE_DMA32=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
+CONFIG_KASAN_SHADOW_OFFSET=0xdffffc0000000000
CONFIG_HAVE_INTEL_TXT=y
CONFIG_X86_64_SMP=y
CONFIG_ARCH_SUPPORTS_UPROBES=y
@@ -832,7 +834,6 @@ CONFIG_COMPAT_OLD_SIGACTION=y
CONFIG_64BIT_TIME=y
CONFIG_COMPAT_32BIT_TIME=y
CONFIG_HAVE_ARCH_VMAP_STACK=y
-CONFIG_VMAP_STACK=y
CONFIG_ARCH_HAS_STRICT_KERNEL_RWX=y
CONFIG_STRICT_KERNEL_RWX=y
CONFIG_ARCH_HAS_STRICT_MODULE_RWX=y
@@ -7620,7 +7621,6 @@ CONFIG_USB_IDMOUSE=m
CONFIG_USB_FTDI_ELAN=m
CONFIG_USB_APPLEDISPLAY=m
CONFIG_USB_SISUSBVGA=m
-CONFIG_USB_SISUSBVGA_CON=y
CONFIG_USB_LD=m
CONFIG_USB_TRANCEVIBRATOR=m
CONFIG_USB_IOWARRIOR=m
@@ -10178,16 +10178,12 @@ CONFIG_LSM="lockdown,yama,loadpin,safesetid,integrity"
#
# Kernel hardening options
#
-CONFIG_GCC_PLUGIN_STRUCTLEAK=y
#
# Memory initialization
#
-# CONFIG_INIT_STACK_NONE is not set
+CONFIG_INIT_STACK_NONE=y
# CONFIG_GCC_PLUGIN_STRUCTLEAK_USER is not set
-CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF=y
-# CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL is not set
-# CONFIG_GCC_PLUGIN_STRUCTLEAK_VERBOSE is not set
CONFIG_GCC_PLUGIN_STACKLEAK=y
CONFIG_STACKLEAK_TRACK_MIN_SIZE=100
# CONFIG_STACKLEAK_METRICS is not set
@@ -10625,7 +10621,11 @@ CONFIG_MESSAGE_LOGLEVEL_DEFAULT=4
#
# Compile-time checks and compiler options
#
-# CONFIG_DEBUG_INFO is not set
+CONFIG_DEBUG_INFO=y
+# CONFIG_DEBUG_INFO_REDUCED is not set
+# CONFIG_DEBUG_INFO_SPLIT is not set
+# CONFIG_DEBUG_INFO_DWARF4 is not set
+CONFIG_GDB_SCRIPTS=y
# CONFIG_ENABLE_MUST_CHECK is not set
CONFIG_FRAME_WARN=2048
CONFIG_STRIP_ASM_SYMS=y
@@ -10650,15 +10650,20 @@ CONFIG_DEBUG_KERNEL=y
# Memory Debugging
#
# CONFIG_PAGE_EXTENSION is not set
-# CONFIG_DEBUG_PAGEALLOC is not set
+CONFIG_DEBUG_PAGEALLOC=y
+CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT=y
# CONFIG_PAGE_OWNER is not set
# CONFIG_PAGE_POISONING is not set
CONFIG_DEBUG_RODATA_TEST=y
# CONFIG_DEBUG_OBJECTS is not set
-# CONFIG_SLUB_DEBUG_ON is not set
+CONFIG_SLUB_DEBUG_ON=y
# CONFIG_SLUB_STATS is not set
CONFIG_HAVE_DEBUG_KMEMLEAK=y
-# CONFIG_DEBUG_KMEMLEAK is not set
+CONFIG_DEBUG_KMEMLEAK=y
+CONFIG_DEBUG_KMEMLEAK_MEM_POOL_SIZE=16000
+# CONFIG_DEBUG_KMEMLEAK_TEST is not set
+# CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF is not set
+CONFIG_DEBUG_KMEMLEAK_AUTO_SCAN=y
# CONFIG_DEBUG_STACK_USAGE is not set
# CONFIG_DEBUG_VM is not set
CONFIG_ARCH_HAS_DEBUG_VIRTUAL=y
@@ -10667,8 +10672,12 @@ CONFIG_DEBUG_VIRTUAL=y
# CONFIG_DEBUG_PER_CPU_MAPS is not set
CONFIG_HAVE_ARCH_KASAN=y
CONFIG_CC_HAS_KASAN_GENERIC=y
-# CONFIG_KASAN is not set
+CONFIG_KASAN=y
+CONFIG_KASAN_GENERIC=y
+# CONFIG_KASAN_OUTLINE is not set
+CONFIG_KASAN_INLINE=y
CONFIG_KASAN_STACK=1
+CONFIG_TEST_KASAN=m
# end of Memory Debugging
CONFIG_ARCH_HAS_KCOV=y
@@ -10771,8 +10780,10 @@ CONFIG_BUG_ON_DATA_CORRUPTION=y
CONFIG_HAVE_ARCH_KGDB=y
# CONFIG_KGDB is not set
CONFIG_ARCH_HAS_UBSAN_SANITIZE_ALL=y
-# CONFIG_UBSAN is not set
-CONFIG_UBSAN_ALIGNMENT=y
+CONFIG_UBSAN=y
+CONFIG_UBSAN_SANITIZE_ALL=y
+CONFIG_UBSAN_NO_ALIGNMENT=y
+# CONFIG_TEST_UBSAN is not set
CONFIG_ARCH_HAS_DEVMEM_IS_ALLOWED=y
CONFIG_STRICT_DEVMEM=y
CONFIG_IO_STRICT_DEVMEM=y
The debug options apparently kill structleak, which may or may not impact the test conditions. @behlendorf: is this a case of ZSTD requesting userspace allocation for kernel memory, or is this a case of memory exhaustion resulting it requesting memory at an illegal address? I'll run the test loop overnight and try to catch dmesg through a virtual ttys0 in case it hard-crashes the system. Any specific procedures for tracking down issues once dmesg is showing sanitizer or allocator issues?
The overhead from all of the debug options is massive, a single iteration fo fill/flush the zpool is taking from 10-40m (!!). If we have a race to initialize some struct element, we might never "lose" the race like this. The zpool iostat
write bandwidth counters aren't even hitting 10MB/s anymore (compression ratio ~3.15x, dedup is minimal - 1.02)
I'll leave it to run overnight, but don't have high hopes for this. Might need to back off some of the debug options and try to induce the crash, and try to manually unwind the calls and states when it trips since those illegal pointer values are coming from somewhere.
somewhere.
Yeah, thats the big issue, if we knew what was causing this (ZFS, ZSTD, ZSTD-on-ZFS, Hypervisor, A combination of those) we could've it fixed in a day or two (or delegate to upstream).
That being said: I'm glad the improvements in 1.4.8 make it harder to actually abuse this, if it happens to be abusable (as both me and Brainslayer are not able to reproduce it as of yet)
just as a hint. for testing on a physical system (not a emulator) just add mem=1G to the kernel commandline, for running just with 1 Gigs of ram.
just as a hint. for testing on a physical system (not a emulator) just add mem=1G to the kernel commandline, for running just with 1 Gigs of ram.
Thats actually a pretty good hint to force this issue sooner on physical, assuming it's present on bare metal..
Charming:
Dec 24 03:39:41 svl-arch00 kernel: kmemleak: 3 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
^C
[root@svl-arch00 vagrant]# cat /sys/kernel/debug/kmemleak
[root@svl-arch00 vagrant]#
Given that we're in "prime time to do bad things" (nobody's at work anywhere so we get to do massive infra overhauls over the next week), i might have to plan out the build and test cycles a bit. Have a couple of enterprise NIDS stacks and a Ceph cluster awaiting some love today - including onsite time at (human-free) client facilities, so my apologies in advance for the patchy effort today.
If this is in fact a data race, and since we have a grsec subscription, and because Open Source Security did the work for backporting KTSAN; we might be able to rig up a testbed with a data race detector on 5.4. I need to talk to spender (whom i assume to be lurking this thread) about that to see how it would work, but may be an option. Alternatively we might be able to stand up a newer kernel tree on upstream with KTSAN in play. Thing is, they clean up the C code (including some in ZFS) a bunch to help the compiler plugins to their work, so they may have inadvertently fixed whatever's happening here (i've not tested this against the grsec branch at all yet). If this gets so bad as to ask them to track it down, they're the best (by a long shot) kernel unf...rs i've ever seen, but its time consuming and not cheap.
Since the current "debug all the things" build isn't able to repro the error, i'm a teensy bit concerned that we might have some actual struct-format issue somewhere with a bad field (like an int getting a long written into it) being "buffered out" by the memory debugging space, or worse that with structleak disabled this might be some sort of nasty interaction with that. There's always the option of it being a data race (some struct field not initialized appropriately before being used as a pointer). Over none of the testing last night did memory usage exceed 60% so - either its too slow to flood, too padded to fail, or something else is afoot.
Figure the next step is to build with debug symbols, with kasan and kmemleak, but without slub and page allocator debug enabled since those appear to have REALLY slowed things down. @behlendorf: any thoughts on a "fast debug config with coverage that can still lose races?" Same question to all discussion members and spender if in fact lurking.
I'm heading off to our first client site in a few minutes, should be there in ~1h and can kick off a debug build while i revamp their NIDS stack (which also run on ZFS with a lot of CPU contention on PF_RING and NIDS threads). Actual test runs might be more tricky since they take a while to fail and the next stop may only be an hour or two. All of this is compounded by my service dog having eaten a bunch of things dogs shouldn't last night and spending his evening in the ER - he's stable and recuperating at this point, but any change in his state will obviously override engagements and testing (apparently being a fully trained and heavily experienced service animal only goes to further the advanced nature of crimes committed - the crime rate is lower than with a pet, the audacity of crimes committed is way higher however)
Wow, "that was easy..." first iteration on the thinned out debug kernel gives us:
Dec 25 01:00:21 svl-arch00 kernel: BUG: kernel NULL pointer dereference, address: 0000000000000004
Dec 25 01:00:21 svl-arch00 kernel: #PF: supervisor read access in kernel mode
Dec 25 01:00:21 svl-arch00 kernel: #PF: error_code(0x0000) - not-present page
Dec 25 01:00:21 svl-arch00 kernel: PGD 0 P4D 0
Dec 25 01:00:21 svl-arch00 kernel: Oops: 0000 [#1] SMP
Dec 25 01:00:21 svl-arch00 kernel: CPU: 1 PID: 335 Comm: z_wr_iss Tainted: G T 5.4.85 #1
Dec 25 01:00:21 svl-arch00 kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
Dec 25 01:00:21 svl-arch00 kernel: RIP: 0010:zfs_ZSTD_compressBlock_btultra2+0x2912/0x4bf0
Dec 25 01:00:21 svl-arch00 kernel: Code: 00 85 db 0f 84 fc f8 ff ff 8b b4 24 dc 00 00 00 8b 48 0c 48 c7 84 24 7c 01 00 00 00 00 00 00 c7 84 24 84 01 00 00 00 00 00 00 <8b> 50 04 29 de 48 8b 9c 24 b0 00 00 00 48 8d 04 f5 00 00 00 00 48
Dec 25 01:00:21 svl-arch00 kernel: RSP: 0018:ffffbb22c33478b0 EFLAGS: 00010293
Dec 25 01:00:21 svl-arch00 kernel: RAX: 0000000000000000 RBX: 0000000000000003 RCX: 000000000000000c
Dec 25 01:00:21 svl-arch00 kernel: RDX: 0000000000002581 RSI: ffffbb22cbd4fdd0 RDI: ffffbb22cbd73f48
Dec 25 01:00:21 svl-arch00 kernel: RBP: ffffbb22c3347a70 R08: 0000000000002e25 R09: 0000000000002ca0
Dec 25 01:00:21 svl-arch00 kernel: R10: ffffbb22cbd4fe78 R11: 0000000000000007 R12: 00000000000276e0
Dec 25 01:00:21 svl-arch00 kernel: R13: ffffbb22d3e706e0 R14: ffffbb22d3e89000 R15: ffff91bde0755360
Dec 25 01:00:21 svl-arch00 kernel: FS: 0000000000000000(0000) GS:ffff91bdfba80000(0000) knlGS:0000000000000000
Dec 25 01:00:21 svl-arch00 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 25 01:00:21 svl-arch00 kernel: CR2: 0000000000000004 CR3: 000000013a4ff003 CR4: 0000000000360ee0
Dec 25 01:00:21 svl-arch00 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec 25 01:00:21 svl-arch00 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Dec 25 01:00:21 svl-arch00 kernel: Call Trace:
Dec 25 01:00:21 svl-arch00 kernel: zfs_ZSTD_compressBlock_internal+0x1d1/0x4d0
Dec 25 01:00:21 svl-arch00 kernel: zfs_ZSTD_compressContinue_internal+0x667/0x9b0
Dec 25 01:00:21 svl-arch00 kernel: zfs_ZSTD_compressEnd+0x35/0x380
Dec 25 01:00:21 svl-arch00 kernel: ZSTD_compressStream_generic+0x42e/0x4f0
Dec 25 01:00:21 svl-arch00 kernel: zfs_ZSTD_compress2+0xe3/0x280
Dec 25 01:00:21 svl-arch00 kernel: zfs_zstd_compress+0xdd/0x170
Dec 25 01:00:21 svl-arch00 kernel: zio_compress_data+0xc1/0xf0
Dec 25 01:00:21 svl-arch00 kernel: zio_write_compress+0x44b/0x8f0
Dec 25 01:00:21 svl-arch00 kernel: ? __wake_up_common_lock+0x86/0xb0
Dec 25 01:00:21 svl-arch00 kernel: zio_execute+0x8a/0x130
Dec 25 01:00:21 svl-arch00 kernel: taskq_thread+0x2ce/0x510
Dec 25 01:00:21 svl-arch00 kernel: ? try_to_wake_up.isra.0+0x3a0/0x3a0
Dec 25 01:00:21 svl-arch00 kernel: ? zio_destroy+0xe0/0xe0
Dec 25 01:00:21 svl-arch00 kernel: ? taskq_thread_spawn+0x60/0x60
Dec 25 01:00:21 svl-arch00 kernel: kthread+0x10f/0x130
Dec 25 01:00:21 svl-arch00 kernel: ? kthread_associate_blkcg+0x90/0x90
Dec 25 01:00:21 svl-arch00 kernel: ret_from_fork+0x1f/0x40
Dec 25 01:00:21 svl-arch00 kernel: Modules linked in: intel_rapl_msr intel_rapl_common intel_pmc_core_pltdrv intel_pmc_core kvm_intel kvm irqbypass crct10dif_pclmul ghash_clmulni_intel joydev mousedev rapl evdev qxl input_leds led_class uio_pdrv_genirq uio psmouse ttm pcspkr mac_hid drm_kms_helper fb_sys_fops syscopyarea sysfillrect sysimgblt i2c_piix4 qemu_fw_cfg button sch_fq_codel drm fuse ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 ata_generic pata_acpi virtio_net net_failover virtio_balloon virtio_blk failover serio_raw atkbd libps2 crc32_pclmul crc32c_intel aesni_intel glue_helper crypto_simd cryptd uhci_hcd ehci_pci ehci_hcd virtio_pci usbcore floppy i8042 ata_piix serio usb_common
Dec 25 01:00:21 svl-arch00 kernel: CR2: 0000000000000004
Dec 25 01:00:21 svl-arch00 kernel: ---[ end trace 80233bb0b1cd7e41 ]---
Dec 25 01:00:21 svl-arch00 kernel: RIP: 0010:zfs_ZSTD_compressBlock_btultra2+0x2912/0x4bf0
Dec 25 01:00:21 svl-arch00 kernel: Code: 00 85 db 0f 84 fc f8 ff ff 8b b4 24 dc 00 00 00 8b 48 0c 48 c7 84 24 7c 01 00 00 00 00 00 00 c7 84 24 84 01 00 00 00 00 00 00 <8b> 50 04 29 de 48 8b 9c 24 b0 00 00 00 48 8d 04 f5 00 00 00 00 48
Dec 25 01:00:21 svl-arch00 kernel: RSP: 0018:ffffbb22c33478b0 EFLAGS: 00010293
Dec 25 01:00:21 svl-arch00 kernel: RAX: 0000000000000000 RBX: 0000000000000003 RCX: 000000000000000c
Dec 25 01:00:21 svl-arch00 kernel: RDX: 0000000000002581 RSI: ffffbb22cbd4fdd0 RDI: ffffbb22cbd73f48
Dec 25 01:00:21 svl-arch00 kernel: RBP: ffffbb22c3347a70 R08: 0000000000002e25 R09: 0000000000002ca0
Dec 25 01:00:21 svl-arch00 kernel: R10: ffffbb22cbd4fe78 R11: 0000000000000007 R12: 00000000000276e0
Dec 25 01:00:21 svl-arch00 kernel: R13: ffffbb22d3e706e0 R14: ffffbb22d3e89000 R15: ffff91bde0755360
Dec 25 01:00:21 svl-arch00 kernel: FS: 0000000000000000(0000) GS:ffff91bdfba80000(0000) knlGS:0000000000000000
Dec 25 01:00:21 svl-arch00 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 25 01:00:21 svl-arch00 kernel: CR2: 0000000000000004 CR3: 000000013a4ff003 CR4: 0000000000360ee0
Dec 25 01:00:21 svl-arch00 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec 25 01:00:21 svl-arch00 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
The extra memory debugging was causing enough of a delay to keep the bug from tripping - write speed went from a few MB/s to ~70.
Dec 25 01:00:21 svl-arch00 kernel: zfs_ZSTD_compressEnd+0x35/0x380
Dec 25 01:00:21 svl-arch00 kernel: ZSTD_compressStream_generic+0x42e/0x4f0
Dec 25 01:00:21 svl-arch00 kernel: zfs_ZSTD_compress2+0xe3/0x280
I see it's dipping into ZSTD_compressStream_generic() which doesn't have the namespace kludge so I guess it might be using the in-kernel version of the symbol. To rule it out, could you add this line to the end of module/zstd/include/zstd_compat_wrapper.h
?
#define ZSTD_compressStream_generic zfs_ZSTD_compressStream_generic
This is almost certainly benign in this case and would probably only affect in-tree kernel builds, but better safe than sorry.
While we're at it, would you mind adding this line at the start of module/zstd/lib/zstd.c
:
#include "os/linux/spl/sys/debug.h"
... then these two lines at the start of ZSTD_compressBlock_internal
:
ASSERT(zc);
ASSERT(dst);
... and these two lines at the start of ZSTD_compress2
:
ASSERT(cctx);
ASSERT(dst);
... and what the heck, this line beneath hdr = (zfs_zstdhdr_t *)d_start;
in zfs_zstd_compress() in module/zstd/zfs_zstd.c
:
ASSERT(hdr->data);
I expect one of those ASSERT()
s to fire for the repro case... probably that last one.
(Sorry for the edits, had to check that the suggestions actually compiled... :grimacing: )
Dec 25 01:00:21 svl-arch00 kernel: zfs_ZSTD_compressEnd+0x35/0x380 Dec 25 01:00:21 svl-arch00 kernel: ZSTD_compressStream_generic+0x42e/0x4f0 Dec 25 01:00:21 svl-arch00 kernel: zfs_ZSTD_compress2+0xe3/0x280
I see it's dipping into ZSTD_compressStream_generic() which doesn't have the namespace kludge so I guess it might be using the in-kernel version of the symbol. To rule it out, could you add this line to the end of
module/zstd/include/zstd_compat_wrapper.h
?
#define ZSTD_compressStream_generic zfs_ZSTD_compressStream_generic
This is almost certainly benign in this case and would probably only affect in-tree kernel builds, but better safe than sorry.
Good catch, should be in there though. But it indeed isn't. Checked in both the current master build as my 1.4.8 PR.
Added, building now @ 1.4.8, shall leave to run overnight with kasan and kmemleak
@Ornias1993 you may also check my git. which includes a 1.4.8 version. since i may did it different than you. unintentionally i did already cover this issue you mentioned here in my tree
@BrainSlayer I checked yours, I might take your compat wrapper instead of my custom one... I'm a lot more certain with your work on the compatwrapper than mine.
@Ornias1993 i just used the small nm / awk script to generate a symbol output. so i update the table. only with ZSTD_isError i still have small problem since its also used by the main code. so adding it to the wrapper will cause a compile error. but i can fix this too in some minutes. but it should not be present in kernel code.
I've always kept that one out of the compat wrapper...
I do have some issues with zloop though, before adding:
#define ZSTD_compressStream_generic zfs_ZSTD_compressStream_generic
It ran just fine, after adding it it started crashing after a few minutes. Could be unrelated though...
System information
Describe the problem you're observing
Describe how to reproduce the problem
dedup=on checksum=sha512 compression=zstd-19
while true do; (rsync -a /usr/* <path to dataset mount>) || (rm -rf <path to mount>/); done
Include any warning/errors/backtraces from the system logs