microsoft / mimalloc

mimalloc is a compact general purpose allocator with excellent performance.
MIT License
10.37k stars 839 forks source link

Get fresh global statistics #265

Open hp48gx opened 4 years ago

hp48gx commented 4 years ago

When you access the statistics, it seems that you have two options: either print the "global" statistics with mi_stats_print_out or alternatively, each thread can print its own statistics. Neither option is ideal for us, because we have a large pool of worker threads that never return (global statistics seem to be updated only when a thread exits) and each of them may allocate/free memory anytime; you can send a query to the process, but it will be served by a random thread in a different pool, so basically if you ask the stats, you get back a list of very small numbers, while the process itself is using gigabytes.

One option could be to enumerate all the stats objects of all heaps/threads and merge them on request into a temporary (without resetting them), but it's a bit unclear how to achieve this.

For the moment, we added a very crude grand total with a very simple patch: just to give you an idea, in alloc.c:

volatile _Atomic(uintptr_t) grandtotal = 0;

extern inline mi_decl_restrict void* mi_malloc(size_t size) mi_attr_noexcept { mi_atomic_add(&grandtotal, (uintptr_t)size); return mi_heap_malloc(mi_get_default_heap(), size); }

extern inline mi_decl_restrict size_t mi_total_malloc_size() mi_attr_noexcept { // not ideal, but it should work everywhere return (size_t)mi_atomic_add(&grandtotal, 0); }

and later in mi_free we subtract mi_page_usable_block_size(page)

however, I'd like to get an opinion on this, and I'm not totally sure of which entry points I should touch to update the grand total (e.g. mi_heap_malloc_small...).

(p.s.: there's a typo in the documentation: [https://microsoft.github.io/mimalloc/group__extended.html#ga256cc6f13a142deabbadd954a217e228], where mi_stats_print_out is incorrectly called mi_stats_print)

asl commented 4 years ago

@hp48gx we faced the same issue in https://github.com/ablab/spades/. For us the situation is even more worse as we are using OpenMP and it is the runtime that manages the pool of worker threads. So, I had to produce few workaround here and there at least to combine statistics from threads and combine everything.

Maybe I'd clean up my approach and submit as a PR here to allow others to chime-in and reuse, if necessary.

hp48gx commented 4 years ago

at the moment, we basically followed a different approach: as every worker thread has a task queue, we inject in all queues a special task that executes mi_stats_merge() and then, we print the global stats. in practice, this means that if you query the stats twice, it's likely that the threads picked up the update task in between, so the second result is going to be roughly correct.

however, as this is both overkill and convoluted, we'd insist with a very simple approach like the "grand total" that we suggested previously. it's indeed very simple to implement, but the list of "entry points" to touch is not trivial.

asl commented 4 years ago

at the moment, we basically followed a different approach: as every worker thread has a task queue, we inject in all queues a special task that executes mi_stats_merge() and then, we print the global stats.

This is similar to what we are doing, right. The difference is that we know when threads are idle, so we could merge then

daanx commented 4 years ago

Hi all, really appreciate the feedback! and I agree, a "grand total" statistics would be good to have. I didn't appreciate before how useful this can be and focused the initial design on final statistics at program termination -- let's fix this.

However, in initial benchmarks it turns out that maintaining statistics can be expensive if you contend with other threads (through atomic operations on shared memory) so it is not entirely trivial to do with the current design where there is generally no communication between threads or even a global structure that makes the thread local heaps visible to each other.

Currently, I think there are two ways to tackle this:

  1. in the allocator "slow" path, each thread will do a statistics merge (perhaps once for every 10 slow paths, or once at least 1 MB is allocated or something). This would give "delayed" statistics, one would see the statistics how it was a short while ago. I think this is actually perhaps the best and easiest solution. The good part is that due to the slow path mechanics the statistics for a thread that is busy allocating is more up to date by construction, and it would give a conservative lower bound on the "real" usage versus the reported usage (as in "at most 1MiB error per thread"). I think I will try this first and benchmark the impact.

  2. Create an API to iterate through all statitistics metadata of each thread so one can sum them up; however, this would create a data race with each thread concurrently updating its statistics :-( Also, it would require that all thread local heaps are reachable through a global data structure that would require a lock (and currently mimalloc is lock-free).

daanx commented 4 years ago

Ah, and related to this. It seems that the full statistics are nice for debugging, but here the request is more for monitoring the current memory usage; What would be a good minimal set that does not expose implementation internals? I think about exposing those programmatically, like:

struct mi_usage_s {
  size_t  reserved;               // virtual reserved address space (but not necessarily used)
  size_t  committed;              // committed memory     (e.g. may swap; this is what is reported in the task manager/top)
  size_t  allocated_current;      // current live allocated   (approximated in release mode as we cannot maintain it per allocation)
  size_t  allocated_peak;         // peak live allocated       (approximate)
  size_t  allocated_total;        // total of all allocations  (approximate)
}

Hmm, thinking about this a bit more, one can get the reserved and committed info also directly from the OS (see mi_process_info in stats.c). I wonder if that is already enough then?

daanx commented 4 years ago

I just tried approach 1 but it does affect the bench marks (just a tiny bit, under 1%). But I realized we can already use this approach with the current API if we need to track ongoing statistics. We can use mi_register_deferred_free to do this. In main, use:

mi_register_deferred_free(&deferred_fun, NULL /* arg */);

where

static void mi_cdecl deferred_fun(bool force, unsigned long long heartbeat, void* arg) {  
  if (force || (heartbeat%8) == 0) {  // about every 512KiB or 8 larger allocations
    mi_stats_merge();
  }
}

This will cause all threads to merge ongoing statistics and the mi_stats_print will reflect fairly recent statistics. Would this work for your scenario?

daanx commented 4 years ago

Secondly, I guess we can expose the now internal mi_process_info:

void mi_process_info(mi_msecs_t* utime, mi_msecs_t* stime, size_t* peak_rss, size_t* page_faults, size_t* page_reclaim, size_t* peak_commit) {

this return OS traced information about user time, system time, peak working set, page faults, soft page faults, and peak commit size. I guess we should also add the current working set (rss) besides the peak one?

hp48gx commented 4 years ago

Speaking for myself, I'd say "allocated_*" would be enough. If I understand correctly, allocated_total is the amount of memory that mimalloc requested from the OS, and allocated_current is the amount of memory that the application is using (so, current <= total).

As a rule, you are totally right about the contention: if you have many threads, you should not let them touch the same atomic variable all the time. however I guess it should not be too hard to create a global linked list of structures with a few atomic integers, an atomic pointer and some padding; every thread has its own node, which is attached at the beginning of the list at thread creation time (which you might assume it's an infrequent operation), and just let the main "get statistics" function to iterate and sum all the atomics. All lock free.

If you prefer a simpler implementation, just have a global array of atomics of fixed length (say 256), and have every thread update counter[hash(threadId) % 256]; the main stats function would just blindly sum counter[0]... counter[255].

Feel free to contact me if you want to discuss, but I'm pretty sure you got the idea.

hp48gx commented 4 years ago
mi_register_deferred_free

that's a good point. I'll need to check precisely when deferred_free triggers, but it looks promising. I think the next thing to make it perfect would be to have access to the struct containing the statistics, instead of the text report.

daanx commented 4 years ago

I hope the mi_register_deferred_free works for you. I have given it more thought but I think it as actually a very nice solution and maybe it should be built in (with a option to control how often it is called). The only drawback is that it incurs a tiny overhead and I wish we could limit that to only the point where stats_print is called.

hp48gx commented 4 years ago

it should be built in (with a option to control how often it is called).

I guess that's the "% 8", isn't it?

The only drawback is that it incurs a tiny overhead and I wish we could limit that to only the point where stats_print is called.

I didn't check the code, but I suspect mi_stats_merge uses atomics when merging to the global stats, hence the overhead, is this right? I'm still convinced an approach like the one I described (e.g. 256 counters), would reveal superior in performance. but you may also combine it (e.g. let the counter be updated from deferred_free).

daanx commented 4 years ago

I just pushed commit 46ee895 which exposes:

void mi_process_info(size_t* user_msecs, size_t* system_msecs, size_t* current_rss, size_t* peak_rss, size_t* current_commit, size_t* peak_commit, size_t* page_faults);

This may help getting global statistics while running the program. Also, the main OS related statistics are now done directly on the main stats (_mi_stats_main) and thus don't need merged statistics per thread. Hope this helps.

matgrioni commented 1 year ago

I have a similar use case to the OP of this issue. I have a long-running windows service and want to periodically get information about how much memory mimalloc is using (not all memory in the application comes from mimalloc and some comes from other calls to VirtualAlloc, so it is valuable to know how much memory is coming from each resource). Thread pools are used in the service and therefore, thread local stats do not regularly get merged back into the global stats.

I tried something similar to what @daanx suggested by adding to my entry point in main.cpp:

static void mi_cdecl deferred_fun(bool force, unsigned long long heartbeat, void* arg) {  
  if (force || (heartbeat%8) == 0) {  // about every 512KiB or 8 larger allocations
    mi_stats_merge();
  }
}

mi_register_deferred_free(&deferred_fun, nullptr);

Then I have another thread which regularly calls mi_stats_print_out and whose callback just logs the stats message to stdout.

At first the stats look good and within the first couple of minutes, the reserved/commit "current" values peak. However, over time, the "freed" value outpaces "total". Eventually "freed" exceeds "total" and current usage is reported as negative. An example output is below.

heap stats:     peak      total      freed    current       unit      count  .
  reserved:    45.1 gb   289.8 gb   281.6 gb     8.2 gb       1 b              not all freed!.
 committed:    29.2 gb   286.1 gb   291.1 gb    -4.9 gb       1 b              ok.
     reset:       0 b        0 b        0 b        0 b        1 b              ok.
   touched:   628.4 mb   819.5 mb   182.3 gb  -181.5 gb       1 b              ok.
  segments:    11.4 k     12.8 k      6.1 k      6.6 k                         not all freed!.
-abandoned:       4          4          4          0                           ok.
   -cached:       0          0          0          0                           ok.
     pages:     2.2 m      2.5 m      2.3 m    186.2 k                         not all freed!.
-abandoned:      27         27         27          0                           ok.
 -extended:       0   ."
 -noretire:       0   ."
     mmaps:       0   ."
   commits:    79.9 k ."
   threads:     2.5 k      2.8 k      292        2.5 k                         not all freed!.
  searches:     0.0 avg."
numa nodes:       2."
   elapsed:    2340.504 s."
   process: user: 25821.656 s, system: 1235.953 s, faults: 39850487, rss: 93.5 gb, commit: 159.0 gb."

I confirmed that deferred free is in fact being across many different threads, and this is the case whether MI_STAT is 0 or 1.