microsoft / mimalloc

mimalloc is a compact general purpose allocator with excellent performance.
MIT License
10.48k stars 847 forks source link

memory reclamation #333

Open timblechmann opened 3 years ago

timblechmann commented 3 years ago

hi! doing some investigations on the mimalloc performance, i ran into some issues where mimalloc is rather aggressively retaining memory instead of releasing it back to the OS.

consider these two test cases of producer/consumer (1M of 1kb objects allocated, then deallocated) from (a) separate threads and (b) the same thread.

the performance characteristics are rather different from glibc's: the mimalloc retains a lot of memory, while glibc returns almost all memory to the OS.

in the multi-threaded test, i'm experiencing the following:

similar behaviour seems to happen in the single-threaded test, though in this case glibc releases all memory immediately.

i've seen this in a real-world program where memory tends to pile up. and while it can be reused in the process, it's not available for other processes. this makes me wonder if the heuristics for 'collection' passes could potentially be tweaked to reclaim memory more aggressively?

#include <thread>

#include <boost/thread/concurrent_queues/sync_queue.hpp>
#include <boost/thread/barrier.hpp>

int multi_threaded_test()
{
    boost::concurrent::sync_queue<void*> sq;
    boost::barrier producer_done(2), consumer_done(2), consumer_joined(2);

    std::thread producer([&] {
        for(int i = 0; i != 1'000'000; ++i) {
            void *mem = malloc(1024);
            sq.push(mem);
        }

        fprintf(stderr, "producer done\n");
        producer_done.wait();

        fprintf(stderr, "waiting for consumer\n");
        consumer_done.wait();

        std::this_thread::sleep_for(std::chrono::seconds(8));
        fprintf(stderr, "producer destroyed\n");
    });

    producer_done.wait();
    std::this_thread::sleep_for(std::chrono::seconds(5));

    fprintf(stderr, "spawning consumer\n");
    std::thread consumer([&] {
        for(int i = 0; i != 1'000'000; ++i) {
            void *mem = sq.pull();
            free(mem);
        }
        fprintf(stderr, "consumer done\n");
        consumer_done.wait();
        std::this_thread::sleep_for(std::chrono::seconds(16));
        fprintf(stderr, "consumer destroyed\n");
        consumer_joined.wait();
    });

    consumer_joined.wait();
    fprintf(stderr, "joining consumer\n");
    consumer.join();
    std::this_thread::sleep_for(std::chrono::seconds(5));
    fprintf(stderr, "joining producer\n");
    producer.join();
    std::this_thread::sleep_for(std::chrono::seconds(5));
    fprintf(stderr, "collecting\n");
    mi_collect(true);
    std::this_thread::sleep_for(std::chrono::seconds(5));
    return 0;
}

int single_threaded_test()
{
    boost::concurrent::sync_queue<void*> sq;

    for(int i = 0; i != 1'000'000; ++i) {
        void *mem = malloc(1024);
        sq.push(mem);
    }

    fprintf(stderr, "freeing\n");
    for(int i = 0; i != 1'000'000; ++i) {
        void *mem = sq.pull();
        free(mem);
    }
    fprintf(stderr, "done\n");
    std::this_thread::sleep_for(std::chrono::seconds(5));

    fprintf(stderr, "collecting\n");
    mi_collect(true);
    std::this_thread::sleep_for(std::chrono::seconds(5));
    return 0;
}

int main()
{
    return multi_threaded_test();
//    return single_threaded_test();
}
koka-build commented 3 years ago

Thanks for taking time to post a test program as well. There is currently some work going on to address your specific issue (currently in the dev-slice branch so best to test with this). First though, note that all memory can still be reused by other programs -- it just might be swapped to disk which is bad for performance but otherwise ok. Second, how much memory is returned to the OS is often quite dependent on workloads -- on many benchmarks mimalloc tends to use less memory than other allocators (as there may be no optimal strategy that performance well).

Anyway, your test brings out the worst in mimalloc :-) Mimalloc essentially uses segments (large chucks of 4 or 8MiB) that are "owned" by a specific thread. Only the owner can allocate in such segments. Other threads can free in it (as in your test, and the xmalloc-test in the benchmark suite) but that memory will not yet be released -- just marked. Only when the owner is active allocating it will actually free the remotely marked frees.

Now, in your test the owner sleeps or waits so it never gets to release any of the remote frees and thus no OS memory is returned. This is also a problem in general for mimalloc if there is a thread that allocates a lot but then is infinitely blocked or never frees/allocates again -- this is why mi_collect exists as it is supposed to be called intermittently (as mi_collect(false)) when being potentially blocked for a long time. If a thread terminates, the OS memory is returned though as its segments become abandoned and can be claimed by other threads.

I am still thinking of ways to do this better. I think best is if a thread sometimes abandons segments even if it is not terminated -- for example if it is full and other threads free in it. This would naturally migrate segments to different owners who are most active. However, in your example it still would not work as the main producer is blocked (and thus cannot even abandon its segments) -- it would need to be done incrementally while it is allocating but that would impact performance again... tricky. It would be great if we can find a good solution though that would at least take the edge of the worst case... tbc

aKabra commented 3 years ago

I am facing a similar issue with mi_malloc. I am using microsoft parallel patterns library(ppl). I submit millions of tasks to ppl, it picks up a thread from windows thread pool and runs the task. In each task I allocate memory using mi_ malloc and then free but since the memory is cached on the thread, whenever windows reuses thread for incoming task the thread already has the memory allocated and that improves performance for me. But the at the end when all off my tasks are done, I would like a way to release all the memory that is cached from my parent thread. Would there be any solution to this anytime soon? I am trying to convince my team to use mi_malloc but unless we have solution to this problem it is hard to use this library.

albi-k commented 3 years ago

I'm also noticing this issue and have even noticed it with the default allocator prior to mimalloc. Before, I would deal with it by calling malloc_trim at certain times however after dynamically overriding to mimalloc, malloc_trimno longer has an effect. It seems like the mi_collectcall serves a similar purpose. For the users of the dynamic override method, who can't use the "mi_" calls, can you override malloc_trim as well? Please let me know if you'd rather me make a separate ticket.

come-raczy commented 2 years ago

I think that our applications are hitting this type of situation as well and v2.0.3 is noticeably worse than v1.6.7 in terms of overall memory usage. The overhead compared to the default allocator is approximately +65% for v1.6.7 and +80% for v2.0.3. Sorry I can't reduce the application into a test case that would pinpoint the exact allocation patterns that causes the issue. Overall I am talking about heavily multithreaded applications with a broad diversity of allocation sizes and longevity of threads. I believe that a lot of the issue has to do with the long lived threads in thread pools that end up hogging the memory that was allocated for their biggest job and never releasing it to the system (but I might be wrong). Even though it ends up being unused memory that could potentially be easily swapped, it is still an issue for systems that don't have any swap space or where there is a job monitor killing jobs using excessive memory. Altogether the total memory usage is in the order of 70Gb per job with the default allocator (115Gb with mimalloc v1.6.7 and 125GB with mimalloc 2.0.3). Each job can have in the order of 50 threads at any given time.

daanx commented 2 years ago

@come-raczy; yikes, that is not good. I will re-investigate this issue. Some background: usually (and perhaps especially) on large service we see the opposite where mimalloc does really well and often uses less memory. So, there is some particular situation where this starts happening -- that initial post is a bit old and we made improvements but as you say, it may still have to do something with threads holding on to memory that should be decommitted. I am especially supposed to see v2.0.3 doing worse here as it should be the opposite :-(. 50 threads is not out of place, many services here use 1200+ threads and do fine. Both v1 and v2 use a delay before decommitting memory pages (as it is expensive to do) -- I wonder if we get trouble if a thread releases a lot of memory and than immediately blocks; there will be no opportunity to decommit all that memory -- this can perhaps happen in a thread pool implementation. I will need to study this a bit more and see what we can do.

ps. In the meantime; if it is not too difficult or too much work for you to run tests, could you try:

come-raczy commented 2 years ago

@daanx I have collected the data at two reliable checkpoints within the application, the first one about 5 minutes after starting, the second one 10 minutes later.

default allocator                 50Gb   70Gb
v1.6.7 defaults                   97Gb  113Gb
v1.7.3 defaults                   90Gb  106Gb
v2.0.3 defaults                  107Gb  122Gb
v2.0.3 MIMALLOC_RESET_DELAY=1    104Gb  119Gb
v2.0.3 MIMALLOC_EAGER_COMMIT=0    89Gb  106Gb
v2.0.3 MIMALLOC_RESET_DECOMMITS=1 89Gb  106Gb

MIMALLOC_RESET_DECOMMITS=1 didn't seem to make much difference on v1.7.3.

I believe that before the first checkpoint there are large chunks of memory that don't get returned to the OS. I am not a developer on the application so I can't provide a lot of detail on the memory allocation patterns but there is probably a mixture of large mallocs, memory mapping and many small allocations. I wouldn't be surprised if the allocations and deallocations happen on different threads.

This runs on CentOs 7. Everything is compiled with gcc-9.2.

Let me know if there is any additional info that you would need.