raedwulf commented 8 years ago

Forward

I've been working on this for 3-4 days, so this text might be slightly scrambled as I've been reworking/research/implementing different parts of libdill. I initially looked into implemented segmented stacks but have come up with a faster, more refined version of simply stack switching.

Intro

What started with some minor tweaks to the libdill code base, ended up being a significant optimisation effort within libdill's core coroutine code. My initial non-intrusive changes are available here: https://github.com/raedwulf/libdill/tree/optimized-limit

Martin, this can be merged if you want, although it might be best once I've actually completed the rest of my changes.

Some micro-bench results

On a i7 4770k:

Libdill (current master) performs a context switch in 16ns (for some reason my values earlier in the year were 22ns, but perhaps compiler changes/other codebase changes improved it)
Libdill (optimised-limit) performs a context switch in 13ns
Libdill (current master) performs a creation/termination of coroutine in 44ns
Libdill (optimised-limit) performs a creation/termination of coroutine in 25ns

More noticeable on a old Core 2 Duo:

Libdill (current master) performs a context switch in 46ns
Libdill (optimised-limit) performs a context switch in 37ns
Libdill (current master) performs a creation/termination of coroutine in 131ns
Libdill (optimised-limit) performs a creation/termination of coroutine in 100ns

This is probably the limit of optimisation though inlining (I shaved off one more nanosecond with an intrusive change for my i7 4770k but it isn't worth the effort/code clarity to keep).

As a comparison, I fired up go and scraped out some direct translations of our perf programs: https://gist.github.com/raedwulf/e6bbb76a3becb5fa14eb27e670260b1b https://gist.github.com/raedwulf/e11f7bf6a25eabad3ed3e439f14b262d https://gist.github.com/raedwulf/4a2c6c5b6209c0dd839fb95ec46378aa

This is far better than on Go (on i7 4770k): context switches take 331ns and creation/termination: 237ns gccgo is even worse: 825 ns context switching and 3652 ns creation/termination ... OUCH

As I have very little experience with Go, check those gists in case I've actually missed something which causes such poor performance!

The Problem(s)

Signal Handling

There is no clear way to deal with signals within libdill. The signal mask belongs to the original context of the process and if that is changed within a coroutine, it applies everywhere. Furthermore, the signal handler for a trapped signal will be executed in context of a coroutine. This may be an issue for future more complex applications.

This may also be the reason why libmill/dill performs so well vs Go! To preserve the signal masks across different coroutines, the syscall SYS_rt_sigprocmask needs to be made. This kills performance - to try yourself, comment out the libdill's assembly routines and use sigsetjmp(ctx, 1).

To resolve this problem, I initially thought that optimising the context switching mechanism so that the signal mask is only set when the mask is different between coroutines (using cached values). In an implementation, that would involve wrapping each of the sigproc* commands with dill equivalents (like proc for fork).

The pathological case would be thrashing between two or more coroutines that have different signal masks. Since, if anywhere, the signal handler should probably be done in the same context as the scheduler and control always returns there, we still have a problem of thrashing.

The best tradeoff between semantics and performance I've come up with is to have signals handled in two different ways. Signals that need to be handled immediately (e.g. SIGSEGV, SIGKILL, etc...) are handled in all contexts (thus the signal mask is the same in the scheduler and coroutines) and signals that are used to communicate behaviour (SIGINT, SIGUSR1, SIGUSR2 etc.) that are handled in dill_wait using epoll_pwait or kqueue's EVFILT_SIGNAL.

From there, it's a design choice whether to simulate a read-write based interface for signals, implementing a signal handler callback within the coroutine and/or emulate sigpending, sigwait, and sigtimedwait.

Coroutine Limits In Performance

What happens right now

However, in digging, one of the limitations of libmill/dill is that the system runs out of mmaps on 32k coroutines - whispers is a perfect example. For networking servers with many connections, this can be an issue, although you can adjust the number of mmaps using:

sysctl -w vm.max_map_count=65530 # default

However, the performance of libdill degrades to that below of Go and increases with each addition coroutine. In my tests, libdill was averaging at around 4000-4500 ns per "whisper" on the whisper benchmark and Go was speeding up, till it averaged around 2200 ns per whisper @ 100000 coroutines.

Now, disabling the guard page (and thus no mmap through mprotect), using DILL_NOGUARD, improves the performance of libdill into the same order of magnitude of Go (actually slightly faster). However, we do lose the ability to check for stack overflows.

Memory overhead is not as bad the allocated size suggests as the majority of the memory in a stack is not resident until it is touched. This translates to being a single page of memory (either 4096 or 8192 bytes). On my system with 32gb RAM, I can handle around 6 million resident coroutines in the whisper benchmark before it gives up and starts paging like crazy. In a real application, I suspect this would be closer to 4 - 5 million in-memory.

What can happen?

I've currently updated my unpushed code base with an extended go() interface that allows optional parameters (through some standards compliant C macro magic). In particular, you can now specify,

if(stk(dstkbest | dstkmalloc | dstkmalign))
    return -1;
go(worker(), dstkguard, 8192);

which means, it'll choose the best memory allocation mechanism between malloc & posix_memalign, use page guards, and allocate a 8192 byte stack.

Omitting parameters will give defaults (currently set to dstkbest and 256kb stack) so that the original libdill interface is possible.

/* Hints - these don't end up in the final result of the mode, but they choose
   out of the following */
#define dstkbest     0x0000 /* Use this if you don't link with another library */
#define dstkcompat   0x0001 /* Use this if you do link with another library */
/* Memory Allocation methods are selected in two ways:
   - If a hint, dstkbest or dstkcompat, are chosen, then choose the best out of
     those selected here.
   - If no hints is chosen, then these flags are mutually exclusive. stk() will
     error with -1 and errno=ENOTSUP */
#define dstkdefault  0x0000
#define dstkmalloc   0x0010 /* Fallback */
#define dstkmalign   0x0020
#define dstkmmap     0x0040
/* Memory Allocation attributes only work with specifically chosen allocation
   methods.  dstkhuge is a hint, and will only apply if dstkmmap is chosen. */
#define dstkseg      0x0100 /* Only works with -fsplit-stack */
#define dstkhuge     0x0200 /* Huge pages, only works with dstkmmap */
#define dstkpool     0x0400
/* Runtime choices to be passed to go */
#define dstkfixed    0x0400 /* Fixed sized stacks, works with everything */
#define dstkguard    0x0800 /* Only works with dstkmmap and dstkmalign */

The amount of additional code required is minimal, as most of this is already implemented. The main code added is to expose the choice out at the API level.

The features I currently want to add:

Huge page support simply requires OS support and a flag to mmap (and documentation on how to do it. So implementing it is trivial, although the smallest huge page size is typically 2Mb.
Memory pool support
Segmented stack support (actually I've changed my mind, stack switching is better!)
Memory Pool Support

Given the possibility that millions of coroutines are possible, it would be nice allocate smaller sized stacks if it is known that the stack won't exceed a certain size. For example, if the call stack within the coroutine is reasonably flat - i.e. it only travels through libdill and dsock in it's processing. Some analysis would need to be done here, but I would assume this is reasonable and thus stacks could be as small as 1024 bytes.

This alone would be useful and my implementation is around 90% there for supporting this. Passing (dstkfixed | dstkhuge | dstkmmap) would give you a fixed sized allocated coroutine using a

Segmented stack support (controversial)

The Problem

The ideal situation is:

Coroutines can be made arbitrarily tiny (almost zero overhead!).
Coroutines have some stack protection (e.g. page guards).
Coroutines have very little allocation/deallocation overhead.
Coroutines have no limits except memory limits.

What we currently have is;

Coroutines can be made small - down to roughly a single page size of resident memory (4k or 8k).
Coroutines can have stack protection with mprotect but will exhaust OS limits unless they are changed. However, this increases allocation overhead as well.
Without stack protection, there is very little allocation/deallocation overhead.

Segmented stacks have problems. There's no denying that. But segmented stacks also have solutions.

Quick Primer

Stacks are usually big contiguous blocks of memory pre-allocated up front. This means they're fixed and overflowing can be easily triggered. Segmented stacks are non-contiguous. A small stack is initially allocated and when an overflow is detected, another block is then allocated.

The original gcc -fsplit-stack design was presented was presented here.

The Problems Go and Rust have found

The hot-split problem or stack-thrashing performance problem

Imagine a function call happening when the stack is almost full. The call will force the stack to grow and a new segment will be allocated. When that function returns, the segment will be freed and the stack will shrink again. Reference

Code without segmented stack support requires a large stack to be allocated. This defeats the purpose of segmented stacks. Reference

Rust has stricter memory rules, and I think they found issues with segmented stacks preventing them optimising memory usage.
And how we can handle them

In Rust and Go, it's assumed that any function can be a coroutine and can call any other function and can be interrupted at any time. This makes segmented stacks problematic.

However, in libdill, we can make a very important assumption that when calling a function that does not use stack splitting support, we don't have context switch nor callbacks (unless you have some code that links with libdill that calls the scheduler) which would have a lifetime that extends after a libdill context switch. This fact means we can make coroutines split their stack any time to a single/global scratch buffer/stack. This scratch buffer/stack could actually be the application stack as it would not be used at that time.

Implementation Issues

Using -fsplit-stack

Both GCC >4.6 and LLVM > 3.4 has segmented stacks implemented with -fsplit-stack. However, to get them working, the interface provided by libgcc is woefully painful to use. One could override __morestack and __morestack_allocate_stack_space to support libdill, and it would very likely work pretty well. However, -fsplit-stack is a global flag which applies to all functions compiled in a *.c file. This makes it cumbersome in the sense you have to explicitly mark all other functions as __attribute__((no_split_stack)) and leave the coroutines unmarked, ironically.

There's ways around it, like submitting a patch or creating a plugin for GCC/LLVM to have a flag like -fsplit-stack-explicit but the cost/benefit of developing this approach is currently on the fence.

Recommended solution (stack switching and not stack segmenting!)

However, given that this is an optimisation for coroutines, a neater solution would be to set a small size for coroutines and then wrap all functions that are likely to trigger a stack overflow with a "stack switcher". Is it safe? As safe as our coroutines, because it does a perfect tiny subset of what our context switching does; it keeps all the registers but changes the stack into a scratch buffer. Thus is is super cheap - much much cheaper than context switching.

In fact, I've just realised that I've already written a generic macro that switches the stackpointer... so I actually do support this in my current code base already. I'll wrap this up with some of my memory pool optimisation and show off this technique in a test.

The TLDR example:

coroutine helloworld() {
    /* I only use small buffers here */
   int buffer[128]; int world; int hello;
   .... /* do interesting stuff like prepare and loop buffers */
   scratch(bsend(s, buffer, sizeof(buffer), -1)); /* do something with dsock */
   scratch(externalcall("president"));  /* call some external library which uses lots of stack space */
   /* tidyup */
}

So for a constrained memory coroutine, the scratch macro gives the control for the application writer to optimise stack usage without running into any problems (except still allowing them to shoot themselves in the foot). The one advantage of -fsplit-stack is that it does this automatically at a runtime cost, making sure that it is a lot harder for the application writer to shoot themselves in the foot, but it moves the fight against the compilers instead.

sustrik commented 7 years ago

exactly, why not just pass a dummy value as the first param

raedwulf commented 7 years ago

The first argument needs to also be set in the coroutine function e.g.

void work(void *dummy, int param0) {}

It can be worked around to avoid adding the void *dummy but it becomes exceedingly difficult:

In the simple case on most platforms, it just requires copying each argument along in each register, and overflowing onto the stack.
However, on x86-64, if some of the arguments are structs, fixed sized arrays or floating point arguments, the semantics of passing them is very complex and that's not detectable/solvable at runtime.

raedwulf commented 7 years ago

Having said that, normally passing struct and fixed-sized arrays as arguments (by value) is bad practice and pointless as the compiler can't inline them with coroutines. void * is a good compromise and if needs be I can add floating point support using C11 _Generic.

The only specific circumstances I can think of using these sorts of arguments is vector types in older games. I've not seen struct-by-value used anywhere else.

raedwulf commented 7 years ago

The coroutine specifier might be required on non-x86 platforms for other reasons. On ARM, it appears that it is necessary to make sure the function uses the aapcs calling convention. In the general case it will use aapcs but if floating point arguments are passed, it can use aapcs-vfp if the target platform supports it.

sustrik commented 7 years ago

Hm, so it looks like that trampolines basically replace one set of constraints by another set of constraints. Instead of "don't do function calls in args" we'll have "first arg cannot by a struct" or somesuch.

raedwulf commented 7 years ago

That's true.

For trampoline function:

Arguably the constraints of the trampolines are fixed and can be reliably specified as opposed to the current method where a compiler hiccup could potentially cause a segfault when arguments are loaded. (EDIT: I'm exaggerating this, this is in theory provided that compilers start getting too intelligent and ignore semantics of inline assembly clobbering).
The implementation detail of context switch is hidden inside libdill rather than in the header libdill.h . Hopefully after this change, the header remains static and won't increase in size when more architectures are added as inline assembly.

The current version does have the advantage that it is simpler from a technical standpoint:

Context switching using setjmp/longjmp operations that are normally implemented on modern GCC/clang systems. These are pretty reliable.
The specific implementation of dill_trampoline requires research on each platform.

After a shower and some thinking, I think I have an interface for thread support that no longer requires dill_trampoline. In particular, if threads get constructed from coroutine handles, it doesn't matter how they are initialised: their startup process is sequential from the caller in the current method.

raedwulf commented 7 years ago

I don't think this should hold back the next release, but it is a viable proof of concept :)

sustrik commented 7 years ago

Great work though. I think you should keep the branch. One never knows.

raedwulf commented 7 years ago

I just noticed you put the reference count into hvfs - it caused a performance drop with having the reference count stored in the vfs structure on my systems (in specific cases). It seems that GCC doesn't know how to optimise so well (maybe because the vfs structure > 16 bytes).

It's not a huge issue, but I was wondering if there was a reason to use the reference count in the the hvfs structure as opposed to internal structure? Is this a remnant of the virtualised hdup?

sustrik commented 7 years ago

The background:

In dsock I want protocol being used by another protocol to be unusable from outside. For example:

s1 = tcp_connect(...);
s2 = crlf_start(s1);
bsend(s1, "ABC", 3); // Error!

To achieve that there have to be two distinct handles pointing to the same object, one owned by the user, the other one owned CRLF protocol instance. The former will be "locked", the latter will be usable in a standard way.

That being said, how much does the performance drop?

raedwulf commented 7 years ago

Oh okay! That makes sense.

Actually, I've double checked the values and it doesn't seem to affect the standard build that much.

I'll investigate further, but for some strange reason, when compiling gcc -O3 *.c perf/go.c the resulting executable ends up being roughly 33% slower, whereas it used to be the same as the shared library build.

On my laptop:

cbf96ca8838ab6ebc928eb7f600b154e783ddbe6 (gcc -O3 *.c perf/go.c): 48ns
cbf96ca8838ab6ebc928eb7f600b154e783ddbe6 (shared): 46ns
master (gcc -O3 *.c perf/go.c): 77ns
master (shared): 46ns

It's strange why it should make such a difference!

raedwulf commented 7 years ago

I worked out what the problem was: https://github.com/sustrik/libdill/blob/master/cr.c#L104

Since hvfs is part of that structure and no longer 16-bytes, the packed misaligns the elements within the cr structure. Removing packed fixes the problem.

sustrik commented 7 years ago

Ok, thanks. Fixed.

raedwulf commented 7 years ago

Side-note, I have managed to improve valgrind support with my pool allocator so that it works without spitting out a lot of valgrind errors and warnings now (my valgrind handling code was to blame as opposed to the actual allocator). I'll test it a bit more and look to integrating it with our most recent changes.

yarf commented 7 years ago

regarding vm.max_map_count etc ...

Couldn't you instead allocate many coro-stacks as a single contiguous allocation, with memalign() or whatever, and then protect a region (readonly) between each carved-out subsection via mprotect() ?

Just a thought

sustrik commented 7 years ago

I have no idea of how mprotect() works under the hood, but given that in the end we are going to get same number of memory regions with different protection flags, wouldn't the total number of maps be the same?

raedwulf commented 7 years ago

Couldn't you instead allocate many coro-stacks as a single contiguous allocation, with memalign() or whatever, and then protect a region (readonly) between each carved-out subsection via mprotect() ?

Like @sustrik says, the limit is on mprotect regions and not the number of block allocations.

sustrik commented 7 years ago

I think the optimizations were implemented by now. I'm closing the ticket.

sustrik / libdill

Libdill performance analysis (and muddled thoughts) #30

Forward

Intro

Some micro-bench results

The Problem(s)

Signal Handling

Coroutine Limits In Performance

What happens right now

What can happen?

Memory Pool Support

Segmented stack support (controversial)

The Problem

Quick Primer

The Problems Go and Rust have found

And how we can handle them

Implementation Issues

Using -fsplit-stack

Recommended solution (stack switching and not stack segmenting!)