Question about the external fragmentation in a long running process

romange commented 1 year ago

Hello Daan,

The project I am working on uses mimalloc underneath.

A user reported the following problem: Dragonfly reports 80GiB used memory (tracked by aggregating mi_usable_size) but mi_malloc stats show 90GiB committed. He also says that the backend adds up roughly 0.5% committed memory each day while used does not change. Assuming that my "used" accounting is correct, it means the backend does not have leakage (otherwise "used" would grow as well). Dragonfly has 16 threads.

mimalloc stats output:

heap stats:    peak      total      freed    current       unit      count   
  reserved:   90.1 GiB   90.1 GiB      0       90.1 GiB                        not all freed!
 committed:   90.0 GiB   90.4 GiB  412.0 MiB   90.0 GiB                        not all freed!
   touched:  180.1 MiB  180.1 MiB   53.1 MiB  126.9 MiB                        not all freed!
  segments:    1.4 Ki     1.4 Ki       0        1.4 Ki                         not all freed!
     pages:  834.0 Ki   834.2 Ki     288      833.9 Ki                         not all freed!
   commits:    508    
   threads:     16         16          0         16                            not all freed!
  searches:     0.0 avg
numa nodes:       1
   elapsed:  111589.444 s
   process: user: 227549.812 s, system: 255655.955 s, faults: 0, rss: 89.2 GiB, commit: 90.0 GiB

Having 12.5% increase of "committed" over "used" seems normal to me. What ratios do you observe for long-running processes?
I obtained a sample of area sizes (via mi_heap_visit_blocks(.., false, ...)) from a single heap - from a single thread out of 16. That heap has 5.9 GiB of committed memory and 5GiB used. When looking at the unused memory I see lots of areas that have unused blocks. See the chart below. Why does it happen? Specifically, can mi_malloc create a new area of block size 80 if there is another one with unused memory? "Unused" in this context is committed - used.

I am also attaching the detailed table of areas from which the chart was created. The first column in the table is the number of occurrences of triple <reserved, committed, used>. Unused memory vs block size sorted-malloc.txt

daanx commented 1 year ago

Thanks @romange for the detailed feedback (and apologies for the late reply).

12.5% increase of committed over used is excellent (!) :-) For a long running process there will always be fragmentation and thus more committed than actually used.
It should be the case that for each thread we never allocate a fresh (mimalloc) page when there is still a page that has blocks available of the right size (see page.c:mi_page_queue_find_free_ex). However, we slowly extend the "top" of the page up (page->capacity vs page->reserved items) so we touch the memory in a page in about 4KiB increments -- so a partially used page will only increase the RSS by the touched part.
We do commit the memory though per mimalloc page at once so that may affect the measurements. We could also commit inside a page on-demand. I tried this once and it helps but is also relatively expensive ...

So, for your measurements it is best to distinguish between the unused but untouched parts (the difference between the capacity and total reserved space), and the unused but touched part (the blocks in the free lists).

I suspect the graph with the "unused but untouched" should have much less unused memory? (let me know :-) )

romange commented 1 year ago

Hi @daanx, thanks for following up on my questions. Meanwhile, I've been studying the domain myself. Unfortunately, when time passes by, the difference in the process between committed and used memory goes well beyond 40-50%, so I need to solve the issue. Below is another graph that shows the distribution of (comitted-used) sizes for different block sizes.

As you said yourself in #645 , any improvements on the allocator side will be incremental (aka reduce probability of, reduce the acuteness etc), because there is nothing we can do with an under-utilized and non-empty page, since the allocator can not move its contents somewhere else. Therefore, it seems that the only solution (if someone really needs to solve the external fragmentation issue) would be - app-aware defragmentation. And this is what I described in https://github.com/dragonflydb/dragonfly/issues/448

Redis, btw, implemented something similar with jemalloc. They call it "active-defrag". They introduced a function that tells how much a page is utilized: https://github.com/redis/redis/commit/e8099cabd19c4e3a46c94c39e69e13191d43f5eb and use it to move memory around.

We are taking a similar approach with mimalloc. This is the function that I believe will help us to determine whether we need to realloc the ptr that belongs to the underutilized page: https://github.com/romange/helio/pull/27/files#diff-a9d9d3912eb3a1eeced1a0726da5a12f80f764b999f683c40d96a45c8c31884b

What do you think about it?

daanx commented 1 year ago

Ah I see -- I agree that an allocator can not generally "solve" the fragmentation problem as we cannot move objects. I understand now that you would like to give the application itself the opportunity to reallocate objects that are in fragmented page.

Best is to avoid the fragmentation you see in the first place: do you know the root cause of the application allocating many objects of some size but only keeping few live after a while? If this is somewhat isolated, you could consider using an extra mi_heap and allocate most temporary objects in that heap so they won't fragment the long lived ones.
I looked a bit at your "is page underutilized" function: the measure is right and it seems thread safe as well -- just don't try to access the xheap pointer since if it belongs to another thread, that thread may be terminating and deallocating its heap structure. You don't need to get the page queue right? Just check if page->prev != NULL is enough to know it is not the first page in a queue.
I wonder if mimalloc could do more as well: we could visit all pages, and reduce the capacity back to the top-most object that is still live -- but again, that depends a bit on good luck; but also, I may on average reduce the used vs committed by 50% in your graphs. Of course, this would be a bit expensive and cannot be done automatically (but needs a mi_compact call or something).

Very interesting to work on this

romange commented 1 year ago

@daanx We ended up patching mimalloc with the following patch: https://github.com/romange/helio/blob/master/patches/mimalloc-v2.0.5.patch

seems that it's doing its job. We reduce the committed/used ratio from 1.25 to 1.08.

anthonyalayo commented 4 months ago

Best is to avoid the fragmentation you see in the first place: do you know the root cause of the application allocating many objects of some size but only keeping few live after a while? If this is somewhat isolated, you could consider using an extra mi_heap and allocate most temporary objects in that heap so they won't fragment the long lived ones.

@daanx I am hitting the same issue as the original poster, and I tried to use an extra mi_heap like you noted. However, the next issue I hit was https://github.com/microsoft/mimalloc/issues/720, where I can't actually use it because I need it in a multi-threaded environment.

microsoft / mimalloc

Question about the external fragmentation in a long running process #632