Possible memory leak "ExAllocatePoolWithTag failed" in `spl-seg_kmem.c, line 134`

EchterAgo commented 1 year ago

When testing #281 I noticed that when copying a 5TB dataset using rclone it always ends in an allocation failure:

*** Assertion failed: 0
***   Source File: H:\dev\openzfs\module\os\windows\spl\spl-seg_kmem.c, line 134

so ExAllocatePoolWithTag failed

memory.txt

This seems like a new issue because I was still able to copy the full dataset not that long ago.

I'll try to get some kstat information. Is it possible to get kstat info from the debugger when the failure has already happened? I could also try logging it periodically to a file.

lundman commented 1 year ago

created branch new-kmem -

EchterAgo commented 1 year ago

Crashed a very short time after starting the copy operation:

21: kd> k
 # Child-SP          RetAddr               Call Site
00 fffff686`320ebee8 fffff802`26117f82     nt!DbgBreakPointWithStatus
01 fffff686`320ebef0 fffff802`261179f2     nt!KiBugCheckDebugBreak+0x12
02 fffff686`320ebf50 fffff802`25ffd747     nt!KeBugCheck2+0xdd2
03 fffff686`320ec660 fffff802`2601ad7d     nt!KeBugCheckEx+0x107
04 fffff686`320ec6a0 fffff802`25fd01c2     nt!PspSystemThreadStartup$filt$0+0x44
05 fffff686`320ec6e0 fffff802`26007d32     nt!_C_specific_handler+0xa2
06 fffff686`320ec750 fffff802`25eca3c7     nt!RtlpExecuteHandlerForException+0x12
07 fffff686`320ec780 fffff802`25ec94e6     nt!RtlDispatchException+0x297
08 fffff686`320ecea0 fffff802`2601186c     nt!KiDispatchException+0x186
09 fffff686`320ed560 fffff802`2600d2bd     nt!KiExceptionDispatch+0x12c
0a fffff686`320ed740 fffff802`2b972ada     nt!KiPageFault+0x43d
0b fffff686`320ed8d0 fffff802`2b96b8b4     OpenZFS!kmem_findslab+0x5a [H:\dev\openzfs\module\os\windows\spl\spl-kmem.c @ 840] 
0c fffff686`320ed920 fffff802`2b96a3f9     OpenZFS!kmem_error+0x84 [H:\dev\openzfs\module\os\windows\spl\spl-kmem.c @ 871] 
0d fffff686`320ed9e0 fffff802`2b974be7     OpenZFS!kmem_slab_free+0x1e9 [H:\dev\openzfs\module\os\windows\spl\spl-kmem.c @ 1472] 
0e fffff686`320eda60 fffff802`2b96ef76     OpenZFS!kmem_magazine_destroy+0x1c7 [H:\dev\openzfs\module\os\windows\spl\spl-kmem.c @ 1738] 
0f fffff686`320edac0 fffff802`2b9782d4     OpenZFS!kmem_cache_magazine_purge+0x1a6 [H:\dev\openzfs\module\os\windows\spl\spl-kmem.c @ 2990] 
10 fffff686`320edb20 fffff802`2b965bba     OpenZFS!kmem_cache_magazine_resize+0x84 [H:\dev\openzfs\module\os\windows\spl\spl-kmem.c @ 3095] 
11 fffff686`320edb60 fffff802`25f0e6f5     OpenZFS!taskq_thread+0x51a [H:\dev\openzfs\module\os\windows\spl\spl-taskq.c @ 2083] 
12 fffff686`320edc10 fffff802`26006278     nt!PspSystemThreadStartup+0x55
13 fffff686`320edc60 00000000`00000000     nt!KiStartSystemThread+0x28

EchterAgo commented 1 year ago

sp seems to be zeroed.

EchterAgo commented 1 year ago

21: kd> dt sp
Local var @ 0xfffff686320ed8f8 Type kmem_slab*
0xffffbe8d`71bbffb8 
   +0x000 slab_cache       : (null) 
   +0x008 slab_base        : (null) 
   +0x010 slab_link        : avl_node
   +0x028 slab_head        : (null) 
   +0x030 slab_refcnt      : 0n0
   +0x034 slab_chunks      : 0n0
   +0x038 slab_stuck_offset : 0
   +0x03c slab_later_count : 0
   +0x03e slab_flags       : 0
   +0x040 slab_create_time : 0n0

EchterAgo commented 1 year ago

cbuf.txt

EchterAgo commented 1 year ago

sp->slab_cache != cp in kmem_slab_free

EchterAgo commented 1 year ago

Retrying produced another crash very fast:

11: kd> k
 # Child-SP          RetAddr               Call Site
00 fffff488`330ebb58 fffff800`6db17f82     nt!DbgBreakPointWithStatus
01 fffff488`330ebb60 fffff800`6db179f2     nt!KiBugCheckDebugBreak+0x12
02 fffff488`330ebbc0 fffff800`6d9fd747     nt!KeBugCheck2+0xdd2
03 fffff488`330ec2d0 fffff800`6da1ad7d     nt!KeBugCheckEx+0x107
04 fffff488`330ec310 fffff800`6d9d01c2     nt!PspSystemThreadStartup$filt$0+0x44
05 fffff488`330ec350 fffff800`6da07d32     nt!_C_specific_handler+0xa2
06 fffff488`330ec3c0 fffff800`6d8ca3c7     nt!RtlpExecuteHandlerForException+0x12
07 fffff488`330ec3f0 fffff800`6d8c94e6     nt!RtlDispatchException+0x297
08 fffff488`330ecb10 fffff800`6da1186c     nt!KiDispatchException+0x186
09 fffff488`330ed1d0 fffff800`6da0ce5a     nt!KiExceptionDispatch+0x12c
0a fffff488`330ed3b0 fffff800`75bfb876     nt!KiGeneralProtectionFault+0x31a
0b fffff488`330ed540 fffff800`75bb9f95     OpenZFS!avl_first+0x46 [H:\dev\openzfs\module\avl\avl.c @ 182] 
0c fffff488`330ed570 fffff800`75bb95f7     OpenZFS!kmem_slab_alloc+0x45 [H:\dev\openzfs\module\os\windows\spl\spl-kmem.c @ 1398] 
0d fffff488`330ed5e0 fffff800`75ef5dbf     OpenZFS!kmem_cache_alloc+0x437 [H:\dev\openzfs\module\os\windows\spl\spl-kmem.c @ 2276] 
0e fffff488`330ed660 fffff800`75e0ccb7     OpenZFS!abd_alloc_chunks+0x34f [H:\dev\openzfs\module\os\windows\zfs\abd_os.c @ 260] 
0f fffff488`330ed730 fffff800`75da1447     OpenZFS!abd_alloc+0xe7 [H:\dev\openzfs\module\zfs\abd.c @ 195] 
10 fffff488`330ed7a0 fffff800`75d91463     OpenZFS!arc_get_data_abd+0x87 [H:\dev\openzfs\module\zfs\arc.c @ 4968] 
11 fffff488`330ed800 fffff800`75d99b1c     OpenZFS!arc_hdr_alloc_abd+0x373 [H:\dev\openzfs\module\zfs\arc.c @ 3224] 
12 fffff488`330ed8b0 fffff800`75cf61ea     OpenZFS!arc_write_ready+0x16bc [H:\dev\openzfs\module\zfs\arc.c @ 6649] 
13 fffff488`330eda10 fffff800`75ce9330     OpenZFS!zio_ready+0x22a [H:\dev\openzfs\module\zfs\zio.c @ 4484] 
14 (Inline Function) --------`--------     OpenZFS!__zio_execute+0x2f5 [H:\dev\openzfs\module\zfs\zio.c @ 2298] 
15 fffff488`330edae0 fffff800`75bb5bba     OpenZFS!zio_execute+0x310 [H:\dev\openzfs\module\zfs\zio.c @ 2209] 
16 fffff488`330edb60 fffff800`6d90e6f5     OpenZFS!taskq_thread+0x51a [H:\dev\openzfs\module\os\windows\spl\spl-taskq.c @ 2083] 
17 fffff488`330edc10 fffff800`6da06278     nt!PspSystemThreadStartup+0x55
18 fffff488`330edc60 00000000`00000000     nt!KiStartSystemThread+0x28

node seems to be bad.

EchterAgo commented 1 year ago

Crashes on the second iteration.

EchterAgo commented 1 year ago

When I reduce the machines memory to 4GB it keeps running, but quite slowly.

EchterAgo commented 1 year ago

It seems to run fast for a bit (until it needs to reduce memory usage?) and then it seems to stall while the memory usage slowly decreases, then it starts copying again.

Edit: The slowness seems to be from me having tested with just one CPU core in this run. With 4GB and 12 cores it actually seems to run well.

EchterAgo commented 1 year ago

With 8GB it still runs fine, but increasing the memory to 12GB almost immediately lets the machine crash on starting a copy.

EchterAgo commented 1 year ago

My copy started out fast and with a relatively stable memory usage. Now that it is at the point it would crash before it slowed down to ~25% of the speed and the memory usage has a bit of a sawtooth pattern:

EchterAgo commented 1 year ago

It changed back to stable allocation, but the speed is still quite slow. I think it copied more now than it could before though.

EchterAgo commented 1 year ago

Something just happened, at ~2.2 TB copied, same as before, mem usage jumped to 100%, 4.6MB available. I've started logging kstat output just before this.

I'll let it run for a bit to see what happens. I've attached cbuf from right before this happens, let's see if it crashes.

cbuf.txt

EchterAgo commented 1 year ago

Yea, system is not really responsive anymore. I'm dumping stacks now, after that I'll post the kstat logs.

EchterAgo commented 1 year ago

stacks.txt

EchterAgo commented 1 year ago

Another slightly later cbuf: cbuf.txt

EchterAgo commented 1 year ago

These are the kstat logs, I started logging them very shortly before this, the memory usage was still doing sawtooth shapes but then abruptly rose: kstat_logs.zip

The logs are less than a minute, the system went unresponsive very shortly after I started logging, but I think it is not because I started logging. I expected it to do this based on previous testing so I started logging.

EchterAgo commented 1 year ago

@lundman if you want me to try anything I can trivially produce the first 2 crashes when I run with 12GB RAM, haven't checked the exact limit. As for the last crash it seems to take a bit of copying (~2.2TB from my dataset in this case) with rclone to hit it.

EchterAgo commented 1 year ago

Does anyone have an idea why a breakpoint like bp /w "segkmem_total_mem_allocated > 13000000000" "spl-seg_kmem.c:141" doesn't work for me? I'd expect it to get hit when memory usage goes over 13GB in this example.

Edit: I know symbols have to be loaded before setting the breakpoint.

lundman commented 1 year ago

OK thanks for taking a look. Busy morning here.

The sawtooth pattern is very encouraging, it means we detect the pressure and we pull back from it. A little aggressively mind you, so some tweaking is needed there.

https://github.com/openzfsonwindows/openzfs/blob/new-kmem/module/os/windows/spl/spl-kmem.c#L5221-L5223

I believe Windows has a tendency to sent more than one pressure event. Currently, we increase the amount to free spl_vm_pages_wanted each time, and up the level spl_vm_pressure_level from 0,1,2 to 3. 0 being normal, and 3 critical.

Possibly, spl_vm_pressure_level should just go to 1. Or, perhaps, just set spl_vm_pages_wanted once, instead of adding to it.

The 12G crash is most peculiar but there is some "small machine" vs "large machine" checks, I will go over them again.

As for the breakpoint, when I try to set a conditional breakpoint in VS, it just tells me it isn't supported for kernel debugging.

EchterAgo commented 1 year ago

Conditional breakpoints definitely work for kernel debugging, maybe just not every condition? I used them multiple times before but IIRC for checking locals, never tried a global. Maybe it was comparing the address of the symbol? I also haven't tried setting them in VS, only in WinDbg.

You can also execute commands when a breakpoint is hit, for example dumping cbuf.

lundman commented 1 year ago

Admittedly when I tried it was VS2017 or something, never tried again. I do have issues with global vars, can't see their values by hover - which is a real pain.

lundman commented 1 year ago

So i just assumed it cant do it, due to that message

EchterAgo commented 1 year ago

Yea, seems to just be about the VS extension, it does work when setting it in the console.

I just encountered a deadlock in zfs_AcquireForLazyWrite it seems: #313

EchterAgo commented 1 year ago

I just noticed that if I resume my copy after running out of memory at ~2.2TB I can produce the issue much faster. Just happened after only 99GB.

EchterAgo commented 1 year ago

I bisected this now and the first bad commit is 1bc536da04c47cce6bd5bed0cbd53366ca1043eb

I'll still do some more thorough testing to ensure this is the right commit, but so far it looks like it.

Edit: My git bisect log:

git bisect start
# status: waiting for both good and bad commits
# bad: [538d466910f55badbad24c6ce0c7189133eb7695] Update kmem to latest macOS
git bisect bad 538d466910f55badbad24c6ce0c7189133eb7695
# status: waiting for good commit(s), bad commit known
# good: [51ee5642ed49a69a6bbd434a30c85d66391c568b] META: zfswin-2.2.0rc5
git bisect good 51ee5642ed49a69a6bbd434a30c85d66391c568b
# bad: [b508b291a4d18afa4a8bada199e98d561fb8906c] Unmounting snapshots need to open correct zfsvfs
git bisect bad b508b291a4d18afa4a8bada199e98d561fb8906c
# bad: [30f59ca2f5e81a93c09a7f0ecbc9b0b60fab0630] Acquire unmount rwlock in callbacks
git bisect bad 30f59ca2f5e81a93c09a7f0ecbc9b0b60fab0630
# good: [21fff86ab62897995a33c20ba0a3f06726ebe71f] DeleteOnClose should ignore errors
git bisect good 21fff86ab62897995a33c20ba0a3f06726ebe71f
# good: [bf5e2c58d2b1ec3fdf4c6a37fc9dc50f8d2003c9] Detect NULL zp sooner in lazy and fastio
git bisect good bf5e2c58d2b1ec3fdf4c6a37fc9dc50f8d2003c9
# bad: [1bc536da04c47cce6bd5bed0cbd53366ca1043eb] Handle SetFileAllocation more correctly
git bisect bad 1bc536da04c47cce6bd5bed0cbd53366ca1043eb
# good: [fd8bf0d2b92d18b818505413bb7dd8e75fc8decd] Correct CcFileSizes and cache use
git bisect good fd8bf0d2b92d18b818505413bb7dd8e75fc8decd
# first bad commit: [1bc536da04c47cce6bd5bed0cbd53366ca1043eb] Handle SetFileAllocation more correctly

EchterAgo commented 1 year ago

I'm still checking fd8bf0d2b92d18b818505413bb7dd8e75fc8decd more, but it is running fine for quite a while now, while with 1bc536da04c47cce6bd5bed0cbd53366ca1043eb it crashed within minutes.

EchterAgo commented 1 year ago

I'm also checking new-kmem with 1bc536da04c47cce6bd5bed0cbd53366ca1043eb reverted. We might want to do that as a temporary fix until we figure out what is actually happening.

EchterAgo commented 1 year ago

Of course reverting 1bc536da04c47cce6bd5bed0cbd53366ca1043eb makes rclone complain again. I'll try with --local-no-preallocate --local-no-sparse, both with and without revert.

EchterAgo commented 1 year ago

Looks like with --local-no-preallocate --local-no-sparse I can get over the ~2.2TB without reverting 1bc536da04c47cce6bd5bed0cbd53366ca1043eb. I think I also have an idea why this starts happening at ~2.2TB, this is where a clean copy process of this dataset starts copying many small files. I'll see tomorrow if I can reproduce this more easily.

So the issue seems to be due to copying many small files with preallocation. Reverting 1bc536da04c47cce6bd5bed0cbd53366ca1043eb or adding --local-no-preallocate --local-no-sparse to rclone is a workaround.

I probably can reproduce this quickly by starting copies from a certain set of files.

Edit: Note that this time my reproducer is in a completely new and separate VM that runs on QEMU with a 144TB zpool consisting of passed through harddrives. It still behaves the exact same as my previous VMWare VM and consistently also crashes on ~2.2TB of data copied.

EchterAgo commented 1 year ago

Both running the new-kmem branch, one with --local-no-preallocate --local-no-sparse, the other without:

lundman commented 1 year ago

Most peculiar, it's the least exciting commit ever, doesn't do much but set values for Windows - I suppose Windows caches will pre-inflate to that size, but presumably will be released by us at some point. If we weren't releasing them, the problem should still happen - although I suppose it would take much longer to show, and the memory would sit in Windows side, not kstat ballooning.

EchterAgo commented 1 year ago

I agree, it doesn't really make sense to me yet. I'd like to give it more testing to really be sure it is not fd8bf0d2b92d18b818505413bb7dd8e75fc8decd.

What is curious is that it happens from one moment to another, which is why I wanted that conditional breakpoint to work. Memory usage seems very stable for hours but then over the course of a minute it suddenly rises steeply and crashes.

EchterAgo commented 1 year ago

Also the copy with preallocation disabled is still going, now at 3.5TB. I haven't had a successful copy of my dataset since opening this issue. I have some ideas how to produce this issue and will try to write a test for it soon.

lundman commented 1 year ago

Looks like we call CcUninitializeCacheMap() every time a file is closed, so that should be OK.

EchterAgo commented 1 year ago

Most peculiar, it's the least exciting commit ever, doesn't do much but set values for Windows - I suppose Windows caches will pre-inflate to that size, but presumably will be released by us at some point. If we weren't releasing them, the problem should still happen - although I suppose it would take much longer to show, and the memory would sit in Windows side, not kstat ballooning.

Also, rclone pre-allocates at most 4095 bytes more than the actual file size, it basically rounds up to the next allocation size. Even if this happened on each file in the dataset the overhead would still only be a couple of megabytes.

EchterAgo commented 1 year ago

It might be that my test did not fail prior to 1bc536da04c47cce6bd5bed0cbd53366ca1043eb because rclone would delete the copies with incorrect size.

EchterAgo commented 1 year ago

FYI I tried bac2eda43e322e8081770ddb95464c65606c3969 and it still crashes in the same way when I give the VM 12GB of memory.

lundman commented 1 year ago

OK thanks. I will see if 12G will trigger it here and see if I can spot why

lundman commented 1 year ago

OK, 12G does crash - if I replace abd_os.c with old file, it runs. I am unsure what the actual issue is, so far.

lundman commented 1 year ago

OK moved all work back into release-2.2.0 - so we can focus on this issue :)

EchterAgo commented 1 year ago

What I'm wondering here is how can it be that segkmem_total_mem_allocated goes up to 7815495680 (7.27 GiB) when failing when total_memory is 4294414336 (3.99 GiB), so almost double. Is that expected behaviour?

I also figured out why my conditional breakpoint was not triggering in osif_malloc, I used the wrong kind of quotes 🤦🏻

bp /w "segkmem_total_mem_allocated > total_memory" `spl-seg_kmem.c:141`
bp /w "segkmem_total_mem_allocated > 0x3c700000" `spl-seg_kmem.c:141`

these actually work

EchterAgo commented 1 year ago

I'll restart my tests with the latest changes.

lundman commented 1 year ago

quotes? ugh :)

hmm yeah, maybe something is using real_total_memory when it should be total_memory.

EchterAgo commented 1 year ago

Yea, real_total_memory is also something I considered.

I want to check some values when it goes over the limit, I figured out that I can access kstat values from the debugger and also conditionally break on them or on access.

EchterAgo commented 1 year ago

Setting the conditional breakpoint in osif_malloc causes things to slow down too much, but I just added the condition in the code and call DbgBreakPoint instead now.

EchterAgo commented 1 year ago

Something I noticed is in spl_free_thread, when segkmem_total_mem_allocated > total_memory then new_spl_free is negative, then the later if (new_spl_free > total_memory) new_spl_free = total_memory; triggers because the negative signed integer is bigger (because it is implicitly casted to a uint64_t for the comparison), causing new_spl_free to be set to total_memory when actually we are already over the limit. I added a new_spl_free > 0 condition in there and am testing right now if this makes a difference.

EchterAgo commented 1 year ago

new_spl_free also gets overwritten later by the spl_enforce_memory_caps / spl_dynamic_memory_cap check (spl-kmem.c:4814), but I'm not sure this is always the case.

EchterAgo commented 1 year ago

No, didn't help :\

openzfsonwindows / openzfs

Possible memory leak "ExAllocatePoolWithTag failed" in `spl-seg_kmem.c, line 134` #283