photoszzt commented 3 years ago

I'm trying different memory allocators for a key value store project. On the left is the rss usage with snmalloc(using rust crate) and on the right is jemalloc(using tikv-jemalloc crate which provides binding for 5.2.1). I wonder when snmalloc will release memory back to the os?

mjp41 commented 3 years ago

From the screen captures, I am assuming you are using a Mac. It should call

https://github.com/microsoft/snmalloc/blob/8990c349119bc848fe3c776ce02888d82abdef7f/src/pal/pal_bsd.h#L34-L38

whenever it has an unused chunk, 1MiB by default, of contiguous memory. It could be your fragmentation pattern means we never hit that threshold (fairly unlikely), or something else is going on.

I have not benchmarked on a Mac, but it also depends on fragmentation. Mac just uses the standard BSD Platform Abstraction Layer. If there is something different we should call, we can add it to:

https://github.com/microsoft/snmalloc/blob/8990c349119bc848fe3c776ce02888d82abdef7f/src/pal/pal_apple.h#L15-L38

Other platforms do different things. For instance, Windows only releases memory back to the OS, if there is a "LowMemoryNotification".

davidchisnall commented 3 years ago

macOS / iOS can also provide a low-memory notification, but only via libdispatch. At some point, I'd like to wire that up as an optional feature (not used by people who don't want to take a libdispatch dependency from their memory allocator) but I haven't got around to it (if someone else gets to it first, I'd be very happy to review a patch!).

Note that, because we use MADV_FREE on macOS, we don't actually return memory to the OS, we just mark the memory such that the OS can reclaim it if memory is constrained. The madvise call will just clear the modified flag in the page tables and set a flag in the metadata related to that page. If the system still has plenty of memory, nothing happens. If you do encounter memory pressure, then the OS will mark the pages as read-only, check the modified flag, and if it's still clear then reclaim the page.

This means that, if we don't encounter memory pressure, reusing that memory for subsequent allocations is free. It's only when the OS starts to run low on physical memory that it needs to reclaim the pages. You won't see a drop in RSS unless the OS needs the physical pages for something else.

photoszzt commented 3 years ago

Oh, I forget to mention the OS. I'm running the experiment on Linux server and I'm viewing the traces on my laptop.

davidchisnall commented 3 years ago

Ah, on Linux the notify_not_using PAL function is the default POSIX one, which is a no-op. We should consider adding an MADV_FREE to the Linux PAL. I believe @sylvanc found it too slow when he tested it a couple of years back, but that code path is now a lot colder and so it might be fine.

mjp41 commented 3 years ago

I think we had madvise(MADV_DONTNEED). I am not sure that madvise(MADV_FREE) really existed on Linux back then. Not in the lazy form that it has now, like FreeBSD.

@photoszzt if you want to experiment you could add

 static void notify_not_using(void* p, size_t size) noexcept 
 { 
   SNMALLOC_ASSERT(is_aligned_block<OS::page_size>(p, size)); 
   madvise(p, size, MADV_FREE); 
 }

to the pal/pal_linux.h file.

I might try this next week to see what the results look like on our benchmarks.

photoszzt commented 3 years ago

From the man page, MADV_FREE is added after kernel 4.5. The other relavant flag from the man page is MADV_DONTNEED . It needs to fallback to MADV_DONTNEED if MADV_FREE is not available. I will try the MADV_FREE.

mjp41 commented 3 years ago

So I ran a mini-experiment today with mimalloc-bench. I did this a few times, with similar results, so I think showing a single run is representative enough.

# --------------------------------------------------
# benchmark allocator elapsed rss user sys page-faults page-reclaims

cfrac mi           06.14 3420 6.14 0.00 0 376
cfrac sn_MADV_FREE 06.19 8576 6.18 0.00 0 428
cfrac sn_0.5.1     06.20 8364 6.20 0.00 0 423

espresso mi           05.18 5924 5.15 0.02 0 609
espresso sn_MADV_FREE 05.10 10212 5.06 0.03 0 388
espresso sn_0.5.1     05.09 10068 5.07 0.02 0 387

barnes mi           02.87 66440 2.85 0.02 0 230
barnes sn_MADV_FREE 02.89 71420 2.87 0.01 0 274
barnes sn_0.5.1     02.88 71348 2.86 0.02 0 264

leanN mi           29.13 580468 119.44 1.35 0 210063
leanN sn_MADV_FREE 29.69 544132 118.32 1.90 0 208663
leanN sn_0.5.1     28.72 533828 117.52 0.73 0 3953

redis mi           4.251 39072 1.80 0.33 0 1235
redis sn_MADV_FREE 4.015 40628 1.76 0.25 0 650
redis sn_0.5.1     4.064 40568 1.67 0.37 0 535

alloc-test1 mi           03.77 17928 3.76 0.01 0 2813
alloc-test1 sn_MADV_FREE 04.00 17296 3.99 0.00 0 286
alloc-test1 sn_0.5.1     03.96 17244 3.95 0.01 0 270

alloc-testN mi           03.93 71144 61.76 0.03 0 590
alloc-testN sn_MADV_FREE 04.02 40348 63.40 0.02 0 549
alloc-testN sn_0.5.1     04.00 40404 63.27 0.02 0 495

larsonN mi           1.311 704592 322.14 2.43 0 121479
larsonN sn_MADV_FREE 1.375 768220 365.71 3.82 0 15799
larsonN sn_0.5.1     1.222 740096 359.38 0.86 0 17731

sh6benchN mi           00.21 232004 1.88 0.65 0 57769
sh6benchN sn_MADV_FREE 00.10 272016 2.45 0.09 0 888
sh6benchN sn_0.5.1     00.08 301520 2.62 0.03 0 654

sh8benchN mi           00.24 455888 4.74 0.14 0 3347
sh8benchN sn_MADV_FREE 00.28 463572 8.44 1.85 0 1643
sh8benchN sn_0.5.1     00.19 488028 5.39 0.12 0 1043

xmalloc-testN mi           0.215 552676 311.91 16.89 0 31030
xmalloc-testN sn_MADV_FREE 0.265 705740 194.16 64.61 0 2212
xmalloc-testN sn_0.5.1     0.256 708948 200.46 59.69 0 1791

cache-scratch1 mi           01.66 5752 1.66 0.00 0 260
cache-scratch1 sn_MADV_FREE 01.65 7612 1.65 0.00 0 224
cache-scratch1 sn_0.5.1     01.65 7708 1.65 0.00 0 228

cache-scratchN mi           00.08 10208 3.01 0.01 0 652
cache-scratchN sn_MADV_FREE 00.10 83800 3.24 0.02 0 479
cache-scratchN sn_0.5.1     00.10 83860 3.32 0.02 0 510

mstressN mi           04.41 3795292 51.85 1.92 0 27666
mstressN sn_MADV_FREE 06.37 2199452 67.82 3.95 0 7329
mstressN sn_0.5.1     04.21 2137956 56.08 1.66 0 5351

rptestN mi           28.188 3128708 232.30 37.47 0 1111094
rptestN sn_MADV_FREE 28.920 2396228 331.20 45.02 0 821576
rptestN sn_0.5.1     27.978 2283588 330.20 9.03 0 7919

A few of the high-churn benchmarks show a fairly large increase in the system time, which then leads onto a drop in overall performance. None of these benchmarks measure the RSS over time, so can't see how quickly we return memory.

The benchmarking suggests we want to pass back the pages more slowly than we do currently as we see quite an increase in page faults in lean and rptest. Overall switching this on by defaults does not look too painful, but I would sooner do some investigation about batching the returns, so we don't increase the page faults as badly.

mjp41 commented 3 years ago

@photoszzt did adding the MADV_FREE call give you the correct kind of memory release graph?

photoszzt commented 3 years ago

@mjp41 On the left is using MDV_FREE and on the right is using MDV_DONTNEED.

photoszzt commented 3 years ago

actually, I might hit some bug on nfs. After rebuilding the binary, the MDV_FREE figure looks like the following:

mjp41 commented 3 years ago

Thanks for investigating. I will try to put some time into minimising the regressions for turning this on.

mjp41 commented 3 years ago

@nwf pointed me at a talk by Jason Evans (of jemalloc fame), on when to free to the OS the pages associated with a deallocation:

Tick Tock, malloc Needs a Clock https://dl.acm.org/doi/10.1145/2742580.2742807

It talks about using a sliding maximum for when to return memory, that is keep the maximum for the last 5 seconds, and return memory to the OS in that regard.

I think there are many differences with jemalloc, that make this quite a different problem. In particular, jemalloc uses runs internally, which can lead to strange fragmentation problems, which I don't see in snmalloc. However, I think by tracking the maximum, we could do something simpler as we are only really interested in returning chunks to the OS.

Here is a sketch of a design. Let's ignore large allocations, i.e. allocations above the "Chunk" threshold. So we perhaps always free those back immediately to the OS.

Objective, have some tick that represents a reasonable amount of time, and once that has passed return memory to the OS. Based on the jemalloc talk this could be a say 500ms, and then we have three stacks for the unused superslabs (i.e. large class = 0).

superslab_stack[0] : items that have been freed since the current tick and haven't been reused yet superslab_stack[1] : chunks that were freed in the previous tick, and were not reused superslab_stack[2] : chunks that were freed before the previous tick, and the backing pages have been returned to the OS (i.e MADV_FREE, DECOMMIT, ...)

Whenever, we look for chunks, we look for then in 0 -> 2 in order. We return chunks to superslab_stack[0].

We could implement the tick efficiently, by treating the first two stacks (0,1) like a cyclic ring buffer. So we do

for each element in superslab_stack[1] and then inform the OS we no longer need the pages, and add then to superslab_stack[2] (do this one element at a time, this stack will not grow, so there is bounded work here)
switch the index in the cyclic list to switch superslab_stack[0] and superslab_stack[1] around.

The tick could be introduced by checking the time each superslab_stack is accessed, and if sufficient time has passed then this is considered a tick. Updating the tick should only be done with a single thread, so a try_lock on the flag_lock could be used.

We can increase the number of stacks to spread the cost of deallocation out over time.

I think the worst bit is how do we get time in a cross architecture and cross platform way. I am not sure what we should do on OE, although there is no way to decommit the memory on SGX 1, so maybe it doesn't mater.

Sorry this turned into a brain dump. @nwf, @sylvanc, @plietar, @davidchisnall, @achamayou

davidchisnall commented 3 years ago

@nwf pointed me at a talk by Jonathan Edwards (of jemalloc fame), on when to free to the OS the pages associated with a deallocation:

Jason Evans?

mjp41 commented 3 years ago

Whoops, fixed.

mjp41 commented 3 years ago

272 This draft PR sketches the idea I outlined above.

mjp41 commented 2 years ago

With 0.6.0 we hand memory back to the OS nicely, and I think this issue can be considered closed.

microsoft / snmalloc

Question: when snmalloc release memory back to os #257

272 This draft PR sketches the idea I outlined above.