Add alloc shims - Githubissues

kormang commented 2 years ago

The problem

Some programs that require more optimized dynamic memory allocation, especially in multithread environments, use alternative implementations of malloc, such as tcmalloc or jemalloc. They replace system malloc with alternative implementation.
Calls to dlsym (from resolve_some_symbol functions in real.cpp, in particular) sometimes require dynamic memory allocation (specifically, by calling calloc).
Alternative malloc implementations call some of the functions wrapped in real.cpp, which results in calls to dlsym, which ends up with either stack overflow (infinite recursion), or deadlock.

The solution

Provide shim, yet another implementation of malloc, that is preloaded as as separate library from libcoz. That makes it optional. It is perfectly reasonable to assume that there are certain conditions, which are difficult to predict in advance, under which such shim would actually break things instead of improving them. Programs that don't need it, and don't want risk of breaking something because of the shim itself, can simply not load it. Programs that use jemalloc, or tcmalloc, and can not be profiled with coz without this shim. Such programs can preload this library, and benefit from it.

Shim should call real malloc most of the time, and with as little overhead as possible. Shim should call its own malloc implementation, only while resolving real symbols using libdl, only from the thread that actually resolves a symbol.

Implementation details

Shims are located in alloc_shims.cpp, together with two instrumentation function (coz_lock_and_set_dummy_alloc_shims and coz_restore_real_alloc_shims_and_unlock), they end up being compiled into separate library - libcozallocshims.so.

The two instrumentation functions are weakly referenced from libcoz, and when libcozallocshims is loaded, they are called before and after calls to dlsym in resolve_some_symbol functions in real.cpp. When libcozallocshims is not loaded, they are nullptrs in libcoz, and they are not called, so there is no effect on running program except for comparing these functions to nullptrs, which happens very limited number of times during runtime of the program.

Also, when libcozallocshims is not loaded, malloc and calloc it provides are not loaded either, so there is no effect on the program being profiled, they use real malloc directly.

By default shims are not loaded, to preload this library user should pass --with-alloc-shims command line switch to coz python script, which will add it to LD_PRELOAD.

Library tries to find real malloc when either program calls malloc for the first time (which results in first_malloc function being called), or when first time symbol is resolved in real.cpp (which ever happens first).

Library insures that its own malloc is called only while resolving symbols with dlsym in real.cpp, and only by that thread. Other threads during that time should end up calling real malloc. To simplify concurrent code, the two instrumentation functions use spinlock, to ensure that only one thread resolves symbols in real.cpp at a time. But this doesn't happen very often and usually there should never be congestion there, we just want to make sure.

Because this alternative malloc is used only for dlsym calls during symbol resolution in real.cpp, it can be really simple, and it doesn't have to use large memory pools. That is why it is called dummy_malloc.

kormang commented 2 years ago

Hi, again!

Another PR from me. This seems quite common problem for people using coz. In fact, I've encountered it the first time I tried to use it.

It solves problems #98 and #176. And, if I may say, in a pretty safe way.

kormang commented 2 years ago

There is also another PR #189, that tries to solve the problem. It also uses the standard approach to overriding malloc with LD_PRELOAD (providing its own dummy implementation), and that is OK, that is standard approach. However, it is based on false assumptions that it is enough to provide it during coz initialization (which might not always be true), and that memory allocated during initialization will never be deallocted later. On the other hand, not all allocations during initialization need to be allocated using dummy implementation. It is also falsely assumed that there is no bug in the implementation of malloc shim, and that no program that used to get profiled with coz before, could now crash because of the provided shim. Also, it has larger performance impact due to if condition in malloc.

kormang commented 2 years ago

How to review this PR?

The change is not that big as it seems. The implementation of the shim logic itself, is contained in single file with 200 lines of code, with big portion of it being comments actually. However, it might be intimidating to accept such change, as although there is not much code, the logic is rather tricky and it has to deal with concurrency, which is always error prone.

First, take a look at other files (not alloc_shims.cpp) to make sure that this change won't make any effect on programs that could be profiled without it. As it is contained in a separate library that is not preloaded by default, we can be safe, and sure that nothing will break.

Then take a look at alloc_shims.cpp.

What I have tested?

It is reasonable to assume that there is some situation I haven't thought about, or failed to get it right. I've made all the reasonable efforts to make sure this works even for programs that don't require it and usually will not load this library. I've tested it on couple of programs using tcmalloc, partition alloc from google, jemalloc, and also with standard malloc with preloading shim library. I've made all the assumptions I could think of, about situations that I haven't encountered, but which could happen in production. I've spent days, tweaking and adjusting code, to make it as safe, simple, and reliable as possible. It doesn't effect programs that don't need it (by default), and it does make programs that use tcmalloc or jemalloc, able to be profiled (with --with-alloc-shims), which wasn't possible before.

I've tested make install, and running installed version.

kormang commented 2 years ago

@ccurtsinger Hello. Since you're assigned to the related bug, and since it seems to be a common problem, could you please take a look? At least so that I know it will be considered. Thanks.

VGubarev commented 2 years ago

Hello, @kormang! I could not deny that your approach solves issue. And that it is more precise and safe for issue that bothers both of us. I'd be just ok to have your solution in COZ. Though engineer' curiosity will not let ignore critique of my own PR :) It's not for arguing but pure interest.

It's been few weeks since I revisited it last time but I'll try:

AFAIR, COZ is initialized right before user main. By this time, all the statics (including allocator) are initialized by runtime. So dummy implementaton will be disabled by the time program enters its own main.
I assumed only allocations matter and must be preserved. Stub deallocation does not affect program in any way.
Didn't really get your point on bug in dummy allocator as your program has similar implementation.
I'm really obsessed of talks about (non)predictable indirrect call vs (non)predictable branch but when we're talking about malloc it just does not make any sense to discuss because malloc itself is extremely expensive when compared to our subject.

emeryberger commented 2 years ago

Please take a look at https://github.com/emeryberger/Heap-Layers/blob/master/wrappers/heapredirect.h to see if we can use that instead (I'd prefer to use an existing, time-tested shim).

kormang commented 2 years ago

@VGubarev Hello, I'll share my thoughts with you in the comments in your PR.

@emeryberger Thanks for suggestion. I've only reacted with thumbs up on your comments when I've read it, didn't have time for much more than that. :) I'll try today to figure out what it is, and how to use it, and get back to you.

kormang commented 2 years ago

@emeryberger I have looked into https://github.com/emeryberger/Heap-Layers/blob/master/wrappers/heapredirect.h. But first, I have read all 3 papers related to heap layers. I find it very interesting, and elegant. Few years ago, I implemented my own allocation algorithm for extremely resource constrained embedded system. It was huge win in terms of performance and fragmentation. But it ended up having obvious layers for different allocations sizes and different strategies, but still being monolithic piece code. I wish I knew about heap layers back then.

Then I continued to study source code of heap layers, and I started from example of usage in scelene. I've learned a lot by reading the source code. Your github profile is such a treasure. I've learned a lot be reading comments too. For example, I didn't know that Double checked locking has a name, and even dedicated wikipedia article, I thought it just a logical but small and not that important trick.

The most important thing I've learned from the source code is that we can just redefine calloc that would return nullptr to dlsym. If it works (well obviously it works, I'm just too surprised and can't believe it), I would like to make another PR, where I would use code from https://github.com/emeryberger/Heap-Layers/blob/master/wrappers/wrapper.cpp, with your permission. We might need to change some things, since, we don't need anything else but your my_dlsym and calloc.

So let me explain why we only need code from wrapper.cpp and not from heapredirect.h:

Creating a wrapper around allocation functions is not our goal at all (it was a way of solving another problem). Code from heapredirect.h is used to accomplish that goal. Our goal, however, is to resolve infinite recursion between dlsym and calloc. That is the same problem that heapredirect.h solves by using tricks from wrapper.cpp. So we should use it too.

So, I'd like to copy some pieces of wrapper.cpp, and make another PR.

kormang commented 2 years ago

Well, unfortunately it does not work. I completely forgot that we need malloc too. During dlopen. And in such case, the trick from the heap layers does not work, we need to return real address. So what ever I tried I end up writing a ton of custom code. I wouldn't say that there is a bug in the heap layers. I would say that coz creates really specific conditions when profiling programs that use jemalloc, tcmalloc, partition alloc and similar.

One thing I could do, is, simplify code by using thread local, like in wrapper.cpp, and also atomic pointer to free space in dummy_malloc, like VGubarev did in his PR. The reason why I didn't do it, is, because I wanted to minimize effect on program being profiled. If you look at the code in this PR, you'll see that malloc calls its implemention through function pointer. Nothing else. As little interference as possible. And only at most 35 times during run time, it will point to dummy malloc, all other time it will call directly real malloc. It is not just meant for performance, but to effectively cut off coz's code from interfering with program when unnecessary.

One reason, why I don't switch to using this approach with thread local and atomic pointer, now, is because I don't think I'll gain anything. I've already spent a lot of time testing current solution, I'll have to do it all over again. However I will try to simplify my code a little bit, and use different naming where possible, to make it clear that it is in fact really simple.

kormang commented 2 years ago

I think it is easier to comprehend now.

Any way, my reasoning is like this: We have a class of programs profilable (I'm not sure if it a word, but I'll use it anyway) by coz, and we have a class of programs that, due to their alloc functions, are nonprofilable by coz. I don't want to make a change that would in any way affect profilable programs, to become nonprofilable because of some bug introduced by my patch. I have accomplished that. On the other hand, if some of nonprofilable programs are still nonprofilable, because I have a bug in my code - well, they were nonprofilable already. I hope that makes sense.

From my own testing experience the code works well. And I think I have pretty representative cases, because I've tested it on programs with 4 different allocation algorithms, and I have already encountered almost all other bugs in the issue list, and few that haven't been reported yet. Next, I'd like to move on to next bug that prevents me from effectively use coz - which is reporting lines that were never accessed (if it is even possible to solve it).

..but I'm still open for suggestion, the most important thing is to get this done and make everybody happy.

plasma-umass / coz

Add alloc shims #191

The problem

The solution

Implementation details

How to review this PR?

What I have tested?