Closed raedwulf closed 7 years ago
exactly, why not just pass a dummy value as the first param
The first argument needs to also be set in the coroutine function e.g.
void work(void *dummy, int param0) {}
It can be worked around to avoid adding the void *dummy
but it becomes exceedingly difficult:
Having said that, normally passing struct
and fixed-sized arrays as arguments (by value) is bad practice and pointless as the compiler can't inline them with coroutines. void *
is a good compromise and if needs be I can add floating point support using C11 _Generic
.
The only specific circumstances I can think of using these sorts of arguments is vector types in older games. I've not seen struct-by-value used anywhere else.
The coroutine
specifier might be required on non-x86 platforms for other reasons. On ARM, it appears that it is necessary to make sure the function uses the aapcs
calling convention. In the general case it will use aapcs
but if floating point arguments are passed, it can use aapcs-vfp
if the target platform supports it.
Hm, so it looks like that trampolines basically replace one set of constraints by another set of constraints. Instead of "don't do function calls in args" we'll have "first arg cannot by a struct" or somesuch.
That's true.
For trampoline function:
The current version does have the advantage that it is simpler from a technical standpoint:
dill_trampoline
requires research on each platform.After a shower and some thinking, I think I have an interface for thread support that no longer requires dill_trampoline
. In particular, if threads get constructed from coroutine handles, it doesn't matter how they are initialised: their startup process is sequential from the caller in the current method.
I don't think this should hold back the next release, but it is a viable proof of concept :)
Great work though. I think you should keep the branch. One never knows.
I just noticed you put the reference count into hvfs - it caused a performance drop with having the reference count stored in the vfs structure on my systems (in specific cases). It seems that GCC doesn't know how to optimise so well (maybe because the vfs structure > 16 bytes).
It's not a huge issue, but I was wondering if there was a reason to use the reference count in the the hvfs
structure as opposed to internal structure? Is this a remnant of the virtualised hdup
?
The background:
In dsock I want protocol being used by another protocol to be unusable from outside. For example:
s1 = tcp_connect(...);
s2 = crlf_start(s1);
bsend(s1, "ABC", 3); // Error!
To achieve that there have to be two distinct handles pointing to the same object, one owned by the user, the other one owned CRLF protocol instance. The former will be "locked", the latter will be usable in a standard way.
That being said, how much does the performance drop?
Oh okay! That makes sense.
Actually, I've double checked the values and it doesn't seem to affect the standard build that much.
I'll investigate further, but for some strange reason, when compiling gcc -O3 *.c perf/go.c
the resulting executable ends up being roughly 33% slower, whereas it used to be the same as the shared library build.
On my laptop:
gcc -O3 *.c perf/go.c
): 48nsgcc -O3 *.c perf/go.c
): 77nsIt's strange why it should make such a difference!
I worked out what the problem was: https://github.com/sustrik/libdill/blob/master/cr.c#L104
Since hvfs
is part of that structure and no longer 16-bytes, the packed
misaligns the elements within the cr
structure. Removing packed
fixes the problem.
Ok, thanks. Fixed.
Side-note, I have managed to improve valgrind support with my pool allocator so that it works without spitting out a lot of valgrind errors and warnings now (my valgrind handling code was to blame as opposed to the actual allocator). I'll test it a bit more and look to integrating it with our most recent changes.
regarding vm.max_map_count etc ...
Couldn't you instead allocate many coro-stacks as a single contiguous allocation, with memalign() or whatever, and then protect a region (readonly) between each carved-out subsection via mprotect() ?
Just a thought
I have no idea of how mprotect() works under the hood, but given that in the end we are going to get same number of memory regions with different protection flags, wouldn't the total number of maps be the same?
Couldn't you instead allocate many coro-stacks as a single contiguous allocation, with memalign() or whatever, and then protect a region (readonly) between each carved-out subsection via mprotect() ?
Like @sustrik says, the limit is on mprotect
regions and not the number of block allocations.
I think the optimizations were implemented by now. I'm closing the ticket.
Forward
I've been working on this for 3-4 days, so this text might be slightly scrambled as I've been reworking/research/implementing different parts of libdill. I initially looked into implemented segmented stacks but have come up with a faster, more refined version of simply stack switching.
Intro
What started with some minor tweaks to the libdill code base, ended up being a significant optimisation effort within libdill's core coroutine code. My initial non-intrusive changes are available here: https://github.com/raedwulf/libdill/tree/optimized-limit
Martin, this can be merged if you want, although it might be best once I've actually completed the rest of my changes.
Some micro-bench results
On a i7 4770k:
More noticeable on a old Core 2 Duo:
This is probably the limit of optimisation though inlining (I shaved off one more nanosecond with an intrusive change for my i7 4770k but it isn't worth the effort/code clarity to keep).
As a comparison, I fired up go and scraped out some direct translations of our perf programs: https://gist.github.com/raedwulf/e6bbb76a3becb5fa14eb27e670260b1b https://gist.github.com/raedwulf/e11f7bf6a25eabad3ed3e439f14b262d https://gist.github.com/raedwulf/4a2c6c5b6209c0dd839fb95ec46378aa
This is far better than on Go (on i7 4770k): context switches take 331ns and creation/termination: 237ns gccgo is even worse: 825 ns context switching and 3652 ns creation/termination ... OUCH
As I have very little experience with Go, check those gists in case I've actually missed something which causes such poor performance!
The Problem(s)
Signal Handling
There is no clear way to deal with signals within libdill. The signal mask belongs to the original context of the process and if that is changed within a coroutine, it applies everywhere. Furthermore, the signal handler for a trapped signal will be executed in context of a coroutine. This may be an issue for future more complex applications.
This may also be the reason why libmill/dill performs so well vs Go! To preserve the signal masks across different coroutines, the syscall
SYS_rt_sigprocmask
needs to be made. This kills performance - to try yourself, comment out the libdill's assembly routines and usesigsetjmp(ctx, 1)
.To resolve this problem, I initially thought that optimising the context switching mechanism so that the signal mask is only set when the mask is different between coroutines (using cached values). In an implementation, that would involve wrapping each of the
sigproc*
commands with dill equivalents (like proc for fork).The pathological case would be thrashing between two or more coroutines that have different signal masks. Since, if anywhere, the signal handler should probably be done in the same context as the scheduler and control always returns there, we still have a problem of thrashing.
The best tradeoff between semantics and performance I've come up with is to have signals handled in two different ways. Signals that need to be handled immediately (e.g.
SIGSEGV
,SIGKILL
, etc...) are handled in all contexts (thus the signal mask is the same in the scheduler and coroutines) and signals that are used to communicate behaviour (SIGINT
,SIGUSR1
,SIGUSR2
etc.) that are handled indill_wait
usingepoll_pwait
or kqueue'sEVFILT_SIGNAL
.From there, it's a design choice whether to simulate a read-write based interface for signals, implementing a signal handler callback within the coroutine and/or emulate
sigpending
,sigwait
, andsigtimedwait
.Coroutine Limits In Performance
What happens right now
However, in digging, one of the limitations of libmill/dill is that the system runs out of mmaps on 32k coroutines - whispers is a perfect example. For networking servers with many connections, this can be an issue, although you can adjust the number of mmaps using:
sysctl -w vm.max_map_count=65530 # default
However, the performance of libdill degrades to that below of Go and increases with each addition coroutine. In my tests, libdill was averaging at around 4000-4500 ns per "whisper" on the whisper benchmark and Go was speeding up, till it averaged around 2200 ns per whisper @ 100000 coroutines.
Now, disabling the guard page (and thus no
mmap
throughmprotect
), usingDILL_NOGUARD
, improves the performance of libdill into the same order of magnitude of Go (actually slightly faster). However, we do lose the ability to check for stack overflows.Memory overhead is not as bad the allocated size suggests as the majority of the memory in a stack is not resident until it is touched. This translates to being a single page of memory (either 4096 or 8192 bytes). On my system with 32gb RAM, I can handle around 6 million resident coroutines in the whisper benchmark before it gives up and starts paging like crazy. In a real application, I suspect this would be closer to 4 - 5 million in-memory.
What can happen?
I've currently updated my unpushed code base with an extended
go()
interface that allows optional parameters (through some standards compliant C macro magic). In particular, you can now specify,which means, it'll choose the best memory allocation mechanism between malloc & posix_memalign, use page guards, and allocate a 8192 byte stack.
Omitting parameters will give defaults (currently set to dstkbest and 256kb stack) so that the original libdill interface is possible.
The amount of additional code required is minimal, as most of this is already implemented. The main code added is to expose the choice out at the API level.
The features I currently want to add:
Memory Pool Support
Given the possibility that millions of coroutines are possible, it would be nice allocate smaller sized stacks if it is known that the stack won't exceed a certain size. For example, if the call stack within the coroutine is reasonably flat - i.e. it only travels through libdill and dsock in it's processing. Some analysis would need to be done here, but I would assume this is reasonable and thus stacks could be as small as 1024 bytes.
This alone would be useful and my implementation is around 90% there for supporting this. Passing
(dstkfixed | dstkhuge | dstkmmap)
would give you a fixed sized allocated coroutine using aSegmented stack support (controversial)
The Problem
The ideal situation is:
What we currently have is;
Segmented stacks have problems. There's no denying that. But segmented stacks also have solutions.
Quick Primer
Stacks are usually big contiguous blocks of memory pre-allocated up front. This means they're fixed and overflowing can be easily triggered. Segmented stacks are non-contiguous. A small stack is initially allocated and when an overflow is detected, another block is then allocated.
The original gcc -fsplit-stack design was presented was presented here.
The Problems Go and Rust have found
In Rust and Go, it's assumed that any function can be a coroutine and can call any other function and can be interrupted at any time. This makes segmented stacks problematic.
However, in libdill, we can make a very important assumption that when calling a function that does not use stack splitting support, we don't have context switch nor callbacks (unless you have some code that links with libdill that calls the scheduler) which would have a lifetime that extends after a libdill context switch. This fact means we can make coroutines split their stack any time to a single/global scratch buffer/stack. This scratch buffer/stack could actually be the application stack as it would not be used at that time.
Implementation Issues
Using -fsplit-stack
Both GCC >4.6 and LLVM > 3.4 has segmented stacks implemented with
-fsplit-stack
. However, to get them working, the interface provided by libgcc is woefully painful to use. One could override__morestack
and__morestack_allocate_stack_space
to support libdill, and it would very likely work pretty well. However,-fsplit-stack
is a global flag which applies to all functions compiled in a *.c file. This makes it cumbersome in the sense you have to explicitly mark all other functions as__attribute__((no_split_stack))
and leave the coroutines unmarked, ironically.There's ways around it, like submitting a patch or creating a plugin for GCC/LLVM to have a flag like
-fsplit-stack-explicit
but the cost/benefit of developing this approach is currently on the fence.Recommended solution (stack switching and not stack segmenting!)
However, given that this is an optimisation for coroutines, a neater solution would be to set a small size for coroutines and then wrap all functions that are likely to trigger a stack overflow with a "stack switcher". Is it safe? As safe as our coroutines, because it does a perfect tiny subset of what our context switching does; it keeps all the registers but changes the stack into a scratch buffer. Thus is is super cheap - much much cheaper than context switching.
In fact, I've just realised that I've already written a generic macro that switches the stackpointer... so I actually do support this in my current code base already. I'll wrap this up with some of my memory pool optimisation and show off this technique in a test.
The TLDR example:
So for a constrained memory coroutine, the
scratch
macro gives the control for the application writer to optimise stack usage without running into any problems (except still allowing them to shoot themselves in the foot). The one advantage of -fsplit-stack is that it does this automatically at a runtime cost, making sure that it is a lot harder for the application writer to shoot themselves in the foot, but it moves the fight against the compilers instead.