Open asl opened 5 years ago
Hi @asl!
I think there are ~ 3 things going on here. First, is that there appears to be a good amount of memory Mesh can reclaim on SPades (over a GB), neat!
Second is that you're right; we're hitting limits around vm.max_map_count
. This is a little confusing and a bug -- we try to explicitly avoid hitting this limit, but I think our existing code to do so is too naiive. On startup, Runtime::initMaxMapCount()
looks at /proc/sys/vm/max_map_count
and sets a limit on the number of meshes based on max_map_count
. The comment for kMeshesPerMap
says:
// if we have, e.g. a kernel-imposed max_map_count of 2^16 (65k) we
// can only safely have about 30k meshes before we are at risk of
// hitting the max_map_count limit.
static constexpr double kMeshesPerMap = .457;
BUT, we only check that at the start of GlobalHeap::meshAllSizeClasses()
-- if we find too many spans to mesh we could end up in the Danger Zone. I've opened up #38 to specifically track this.
Third, we allocate our (sparse) arena on program startup, which gives us a lot of simplicity. When things were much earlier in development, the Ubuntu system I was on had trouble with coredumping the arena - it seemed to insist on filling the (mostly empty) virtual mapping of the arena with zeros on a crash.
Before Friday, the arena size was 8 GB, which is too small for your ~ 30 GB working set. I've increased the arena to 64 GB in the latest commit to master
- let me know if this enables SPades to run correctly. I've opened #39 to track this specific issue.
Mesh can reclaim on SPades (over a GB), neat!
How I can see it? Is this "Meshed MB HWM" value?
/proc/sys/vm/max_map_count and sets a limit on the number of meshes based on max_map_count
Oh, well.. This does not smell good :) SPAdes uses (file) memory maps here and there. Though typically it's something around 10 * # thread, so should be below 1000 for almost any sane system.
Before Friday, the arena size was 8 GB, which is too small for your ~ 30 GB working set. I've increased the arena to 64 GB in the latest commit to master - let me know if this enables SPades to run correctly. I've opened #39 to track this specific issue.
I believe it's quite important not have some hard-coded arena size. We could easily utilize, say, 1Tb of RAM in hard cases ;) The actual working set should be around ~60 Gb for this particular dataset iirc. Sadly, now mesh just fails to allocate anything and and therefore throws std::bad_alloc.
More information about that std::bad_alloc – it seems mesh failed to fulfill request to allocate 28 Gb as one piece.
And indeed, in
void *GlobalHeap::malloc(size_t sz) {
We're seeing that mesh unable to allocate more than 2Gb of memory in a single chunk:
if (unlikely(pageCount * kPageSize > INT_MAX)) {
return nullptr;
}
Really? :)
I opened #40 to track this issue
hah, yeah... Thanks for the separate tracking issue :)
And agreed on not requiring a fixed max; it is just that having a single range of virtual address space greatly simplifies the implementation. I know Go does (or used to do) a similar thing. This LWN article seems to describe this exact problem: https://lwn.net/Articles/428100/
Hey there. Are there any future plans for fixing this issue? This project has great potential and I can see that a lot of effort has been put into it to solve this "fragmentation" nightmare we all battle against in long-running server jobs. Unfortunately this issue seems to me like a show breaker why one can't use this library.
Could you also explain why that arena size can't be set to "INFINITY" (2^64) ? I mean why it is even needed to have constraint on size? I am not sure I get it why does Ubuntu try to dump memory which wasn't ever malloc'd or was freed.
bump @bpowers
@brano543 can you describe what actual issues you are running into? The max heap size is now 64 GB, "which ought to be enough for anybody". Please let us know if you are running into issues with this in practice and we can prioritize working on it, but please don't not try mesh because of perceived limitations.
There are two main reasons we didn't set it larger from the get-go -- the first is that some tools (like the crash reporting software on Ubuntu) choked on very large virtual memory mappings. The behavior we were seeing was: Mesh would allocate 64 GB of virtual address space. A program would allocate a few hundred MB and then crash. The core dump parser wouldn't be smart enough to understand 63.5 GB of that virtual address space wasn't ever allocated or backed by real RAM, and would try to create a core dump file for sending to Ubuntu filled with 63.5 GB worth of 0
s. There is a madvise
flag MADV_DONTDUMP
that should help with this, but at the time, I had trouble integrating this in a way that didn't hurt performance.
The second reason is that we have some ancillary data structures we allocate (like lookup tables) that depend on the size of the arena. I think this is a smaller issue, as they will "just" use up some extra virtual address space.
Thanks for clarification. This effectively closes Mesh for SPAdes as we're routinely allocating more than 64 Gb of RAM. Apparently no other memory allocator we are aware of has such a limitation.
How much memory do you allocate? I feel like this is something that could be made a build-time parameter.
As much as necessary. Could allocate 0.5 Tb, could allocate 1 Tb. It depends on the input.
And to be clear, you mean that the actual physical footprint of the app in RAM is ~ 1TB, correct?
It might be 100 Mb, it might be 4 Gb, it might consume 1 Tb. Everything depends on the input.
@asl if you increase this constant here: https://github.com/plasma-umass/Mesh/blob/master/src/common.h#L104 from 64 to 2000, that should bump the max heap up to 2 TB. I would be eager to hear how this works for you! If things seem to work fine, I can do some testing on some much smaller systems, and see about making that the default.
I'll also talk to @emeryberger - my intuition is that having a single, non-growable heap makes parts of the implementation significantly easier, but maybe I'm overthinking it.
Well, for us we'd need something like a run-time constant then. E.g. the user could specify max amount of memory he / she could use.
@bpowers So, I tried again on small dataset (with expected memory consumption less than 10 Gb). Unfortunately, I had to lower kMeshesPerMap down to 0.1, otherwise it did not work. So, I guess #38 is really a blocker.
I also ran into this:
Mesh: arena exhausted: current arena size is 64 GB; recompile with larger arena size.
The system has 128 GB RAM and the application uses more-or-less 128GB RAM for the given input. I would also like to test it on another machine with up to 1TB if RAM (and use all of it).
I have three questions:
static constexpr size_t kArenaSize = 128ULL * 1024ULL * 1024ULL * 1024ULL;
. Is this correct? Do I need to change anything else?For a few projects (actually implementations of programming languages for HPC etc.) I wanted to propose to use Mesh. But any such compile-time limitation puts me off. It's simply impossible to use - many ordinary desktop systems have more than 64GByte ram nowadays. Ordinary servers typically several terabytes and special systems tens or even small hundreds of terabytes.
Any plans on removing this limitation or making it at least a run-time setting?
@dumblob The project seems to be abandoned. We (SPAdes) are using mimalloc now.
@asl aha, ok. That's a pity anyway.
Btw. actually I wanted to test Mesh against mimalloc and I'm pleased to hear mimalloc serves your purpose well (my experience is also basically only very positive with mimalloc).
The lead PhD student on this project, @bpowers , has moved to industry and recently had a child, so he has been otherwise quite occupied :).
In any event, this particular issue got lost; sorry about that.
I see - then I wish all the best to @bpowers et al.
Should this project get resurrected at some point, I'll try to keep an eye on it.
Hello
I'm trying to benchmark SPAdes (https://github.com/ablab/spades) with Mesh. Currently SPAdes runs fine with both tcmalloc and jemalloc (and have embedded jemalloc for the sake of completeness). On reasonable small dataset (with memory consumption around ~30Gb) I'm seeing:
Some quick debugging revealed that mesh tried to mprotect lots of pages of 4k each. As a result mprotect() at some point returns ENOMEM. On my system I'm having:
If I'd increase vm.max_map_count to 655300 (I'm lucky, I'm having sudo access and the majority of users don't) then the assertion goes away and just std::bad_alloc is thrown. Here is MALLOCSTATS=1 report just in case:
But for me it looks there is some huge design flaw somewhere as the # of memory mappings is a limited resource and one simply cannot mmap / mprotect each page.