raedwulf commented 8 years ago

Forward

I've been working on this for 3-4 days, so this text might be slightly scrambled as I've been reworking/research/implementing different parts of libdill. I initially looked into implemented segmented stacks but have come up with a faster, more refined version of simply stack switching.

Intro

What started with some minor tweaks to the libdill code base, ended up being a significant optimisation effort within libdill's core coroutine code. My initial non-intrusive changes are available here: https://github.com/raedwulf/libdill/tree/optimized-limit

Martin, this can be merged if you want, although it might be best once I've actually completed the rest of my changes.

Some micro-bench results

On a i7 4770k:

Libdill (current master) performs a context switch in 16ns (for some reason my values earlier in the year were 22ns, but perhaps compiler changes/other codebase changes improved it)
Libdill (optimised-limit) performs a context switch in 13ns
Libdill (current master) performs a creation/termination of coroutine in 44ns
Libdill (optimised-limit) performs a creation/termination of coroutine in 25ns

More noticeable on a old Core 2 Duo:

Libdill (current master) performs a context switch in 46ns
Libdill (optimised-limit) performs a context switch in 37ns
Libdill (current master) performs a creation/termination of coroutine in 131ns
Libdill (optimised-limit) performs a creation/termination of coroutine in 100ns

This is probably the limit of optimisation though inlining (I shaved off one more nanosecond with an intrusive change for my i7 4770k but it isn't worth the effort/code clarity to keep).

As a comparison, I fired up go and scraped out some direct translations of our perf programs: https://gist.github.com/raedwulf/e6bbb76a3becb5fa14eb27e670260b1b https://gist.github.com/raedwulf/e11f7bf6a25eabad3ed3e439f14b262d https://gist.github.com/raedwulf/4a2c6c5b6209c0dd839fb95ec46378aa

This is far better than on Go (on i7 4770k): context switches take 331ns and creation/termination: 237ns gccgo is even worse: 825 ns context switching and 3652 ns creation/termination ... OUCH

As I have very little experience with Go, check those gists in case I've actually missed something which causes such poor performance!

The Problem(s)

Signal Handling

There is no clear way to deal with signals within libdill. The signal mask belongs to the original context of the process and if that is changed within a coroutine, it applies everywhere. Furthermore, the signal handler for a trapped signal will be executed in context of a coroutine. This may be an issue for future more complex applications.

This may also be the reason why libmill/dill performs so well vs Go! To preserve the signal masks across different coroutines, the syscall SYS_rt_sigprocmask needs to be made. This kills performance - to try yourself, comment out the libdill's assembly routines and use sigsetjmp(ctx, 1).

To resolve this problem, I initially thought that optimising the context switching mechanism so that the signal mask is only set when the mask is different between coroutines (using cached values). In an implementation, that would involve wrapping each of the sigproc* commands with dill equivalents (like proc for fork).

The pathological case would be thrashing between two or more coroutines that have different signal masks. Since, if anywhere, the signal handler should probably be done in the same context as the scheduler and control always returns there, we still have a problem of thrashing.

The best tradeoff between semantics and performance I've come up with is to have signals handled in two different ways. Signals that need to be handled immediately (e.g. SIGSEGV, SIGKILL, etc...) are handled in all contexts (thus the signal mask is the same in the scheduler and coroutines) and signals that are used to communicate behaviour (SIGINT, SIGUSR1, SIGUSR2 etc.) that are handled in dill_wait using epoll_pwait or kqueue's EVFILT_SIGNAL.

From there, it's a design choice whether to simulate a read-write based interface for signals, implementing a signal handler callback within the coroutine and/or emulate sigpending, sigwait, and sigtimedwait.

Coroutine Limits In Performance

What happens right now

However, in digging, one of the limitations of libmill/dill is that the system runs out of mmaps on 32k coroutines - whispers is a perfect example. For networking servers with many connections, this can be an issue, although you can adjust the number of mmaps using:

sysctl -w vm.max_map_count=65530 # default

However, the performance of libdill degrades to that below of Go and increases with each addition coroutine. In my tests, libdill was averaging at around 4000-4500 ns per "whisper" on the whisper benchmark and Go was speeding up, till it averaged around 2200 ns per whisper @ 100000 coroutines.

Now, disabling the guard page (and thus no mmap through mprotect), using DILL_NOGUARD, improves the performance of libdill into the same order of magnitude of Go (actually slightly faster). However, we do lose the ability to check for stack overflows.

Memory overhead is not as bad the allocated size suggests as the majority of the memory in a stack is not resident until it is touched. This translates to being a single page of memory (either 4096 or 8192 bytes). On my system with 32gb RAM, I can handle around 6 million resident coroutines in the whisper benchmark before it gives up and starts paging like crazy. In a real application, I suspect this would be closer to 4 - 5 million in-memory.

What can happen?

I've currently updated my unpushed code base with an extended go() interface that allows optional parameters (through some standards compliant C macro magic). In particular, you can now specify,

if(stk(dstkbest | dstkmalloc | dstkmalign))
    return -1;
go(worker(), dstkguard, 8192);

which means, it'll choose the best memory allocation mechanism between malloc & posix_memalign, use page guards, and allocate a 8192 byte stack.

Omitting parameters will give defaults (currently set to dstkbest and 256kb stack) so that the original libdill interface is possible.

/* Hints - these don't end up in the final result of the mode, but they choose
   out of the following */
#define dstkbest     0x0000 /* Use this if you don't link with another library */
#define dstkcompat   0x0001 /* Use this if you do link with another library */
/* Memory Allocation methods are selected in two ways:
   - If a hint, dstkbest or dstkcompat, are chosen, then choose the best out of
     those selected here.
   - If no hints is chosen, then these flags are mutually exclusive. stk() will
     error with -1 and errno=ENOTSUP */
#define dstkdefault  0x0000
#define dstkmalloc   0x0010 /* Fallback */
#define dstkmalign   0x0020
#define dstkmmap     0x0040
/* Memory Allocation attributes only work with specifically chosen allocation
   methods.  dstkhuge is a hint, and will only apply if dstkmmap is chosen. */
#define dstkseg      0x0100 /* Only works with -fsplit-stack */
#define dstkhuge     0x0200 /* Huge pages, only works with dstkmmap */
#define dstkpool     0x0400
/* Runtime choices to be passed to go */
#define dstkfixed    0x0400 /* Fixed sized stacks, works with everything */
#define dstkguard    0x0800 /* Only works with dstkmmap and dstkmalign */

The amount of additional code required is minimal, as most of this is already implemented. The main code added is to expose the choice out at the API level.

The features I currently want to add:

Huge page support simply requires OS support and a flag to mmap (and documentation on how to do it. So implementing it is trivial, although the smallest huge page size is typically 2Mb.
Memory pool support
Segmented stack support (actually I've changed my mind, stack switching is better!)
Memory Pool Support

Given the possibility that millions of coroutines are possible, it would be nice allocate smaller sized stacks if it is known that the stack won't exceed a certain size. For example, if the call stack within the coroutine is reasonably flat - i.e. it only travels through libdill and dsock in it's processing. Some analysis would need to be done here, but I would assume this is reasonable and thus stacks could be as small as 1024 bytes.

This alone would be useful and my implementation is around 90% there for supporting this. Passing (dstkfixed | dstkhuge | dstkmmap) would give you a fixed sized allocated coroutine using a

Segmented stack support (controversial)

The Problem

The ideal situation is:

Coroutines can be made arbitrarily tiny (almost zero overhead!).
Coroutines have some stack protection (e.g. page guards).
Coroutines have very little allocation/deallocation overhead.
Coroutines have no limits except memory limits.

What we currently have is;

Coroutines can be made small - down to roughly a single page size of resident memory (4k or 8k).
Coroutines can have stack protection with mprotect but will exhaust OS limits unless they are changed. However, this increases allocation overhead as well.
Without stack protection, there is very little allocation/deallocation overhead.

Segmented stacks have problems. There's no denying that. But segmented stacks also have solutions.

Quick Primer

Stacks are usually big contiguous blocks of memory pre-allocated up front. This means they're fixed and overflowing can be easily triggered. Segmented stacks are non-contiguous. A small stack is initially allocated and when an overflow is detected, another block is then allocated.

The original gcc -fsplit-stack design was presented was presented here.

The Problems Go and Rust have found

The hot-split problem or stack-thrashing performance problem

Imagine a function call happening when the stack is almost full. The call will force the stack to grow and a new segment will be allocated. When that function returns, the segment will be freed and the stack will shrink again. Reference

Code without segmented stack support requires a large stack to be allocated. This defeats the purpose of segmented stacks. Reference

Rust has stricter memory rules, and I think they found issues with segmented stacks preventing them optimising memory usage.
And how we can handle them

In Rust and Go, it's assumed that any function can be a coroutine and can call any other function and can be interrupted at any time. This makes segmented stacks problematic.

However, in libdill, we can make a very important assumption that when calling a function that does not use stack splitting support, we don't have context switch nor callbacks (unless you have some code that links with libdill that calls the scheduler) which would have a lifetime that extends after a libdill context switch. This fact means we can make coroutines split their stack any time to a single/global scratch buffer/stack. This scratch buffer/stack could actually be the application stack as it would not be used at that time.

Implementation Issues

Using -fsplit-stack

Both GCC >4.6 and LLVM > 3.4 has segmented stacks implemented with -fsplit-stack. However, to get them working, the interface provided by libgcc is woefully painful to use. One could override __morestack and __morestack_allocate_stack_space to support libdill, and it would very likely work pretty well. However, -fsplit-stack is a global flag which applies to all functions compiled in a *.c file. This makes it cumbersome in the sense you have to explicitly mark all other functions as __attribute__((no_split_stack)) and leave the coroutines unmarked, ironically.

There's ways around it, like submitting a patch or creating a plugin for GCC/LLVM to have a flag like -fsplit-stack-explicit but the cost/benefit of developing this approach is currently on the fence.

Recommended solution (stack switching and not stack segmenting!)

However, given that this is an optimisation for coroutines, a neater solution would be to set a small size for coroutines and then wrap all functions that are likely to trigger a stack overflow with a "stack switcher". Is it safe? As safe as our coroutines, because it does a perfect tiny subset of what our context switching does; it keeps all the registers but changes the stack into a scratch buffer. Thus is is super cheap - much much cheaper than context switching.

In fact, I've just realised that I've already written a generic macro that switches the stackpointer... so I actually do support this in my current code base already. I'll wrap this up with some of my memory pool optimisation and show off this technique in a test.

The TLDR example:

coroutine helloworld() {
    /* I only use small buffers here */
   int buffer[128]; int world; int hello;
   .... /* do interesting stuff like prepare and loop buffers */
   scratch(bsend(s, buffer, sizeof(buffer), -1)); /* do something with dsock */
   scratch(externalcall("president"));  /* call some external library which uses lots of stack space */
   /* tidyup */
}

So for a constrained memory coroutine, the scratch macro gives the control for the application writer to optimise stack usage without running into any problems (except still allowing them to shoot themselves in the foot). The one advantage of -fsplit-stack is that it does this automatically at a runtime cost, making sure that it is a lot harder for the application writer to shoot themselves in the foot, but it moves the fight against the compilers instead.

sustrik commented 8 years ago

Wow. Impressive.

Some comments, in no particular order:

Libdill is not a language runtime and thus we are not fully in control. It's not reasonable to expect us to have full feature parity with Go or Rust. That being said, we should at least try.
Signal masks: Yes, the original implementation saved the sigmask but the performance improvement of not doing so was so big I've removed it.
One option you haven't mentioned is to assume that all signals are blocked in the user thread and create a dedicated thread to receive signals and push them into a channel. Not nice but still an option.
As for memory allocation what having an "allocator" interface and passing that to go() macro? That way we can plug in any allocator implementations without a need to change the API.

a = myallocator(...)
go(foo(), a);

Some allocators could be exported as global variables:

go(foo(), default_stack_allocator);

It would be also nice to have an option to switch stack size measurement on. Coroutine prologue would then fill the stack with, say, 0xAA, run the coroutine and epilogue would look for the first non-AA byte and report the stack size.
I've looked into split stack myself some time ago but having to do even more compiler fighting scared me away. Also -- without stack guards and with memory overcommit -- we are basically allocating stack in 4kB chunks which is not very different from Go's 2kB chunk.
The scratch() macro looks awesome. It adheres to C's philosophy of putting the user in control. If we add an allocator parameter (see pt.4) the solution will be super-generic.
Basically, scratch would be a coroutine invocation that waits till the coroutine finishes. That being said: would it require all scratched functions to be declared with "coroutine" specifier? Even more generally: With you new stack-switching code, is there a need for coroutine specifier at all?

raedwulf commented 8 years ago

Quick replies following your numbering:

With -fsplit-stack support, we can be almost fully general. Instead of compiler-fighting, we could probably try and get a patch accepted into gcc/clang to make our lives easier by having the coroutine being declared as #define coroutine __attribute__((split_stack)) rather having the opposite problem of declaring non-coroutines everywhere. The alternative I had in mind was having the user create a program that compiles -fsplit-stack on a C file with coroutines and omitting that flag without. In either case, it's burdensome, and hence, compiler support would be desired.
Yeah, there's not much we can do with the performance. I could probably make it faster than sigsetjmp but it will be an order of magnitude slower than what we have currently due to that syscall.
Thankfully only some signals need to be handled immediately and the non-critical signals will be dealt with eventually by the scheduler. Having user-thread signal interface could be an option (like proc is for fork). We could register a user-thread specific signal handler that is called by the scheduler.
Yes, that could be a good choice (and won't impact on performance if done correctly). My earlier example could be:

go(worker(), memalign_allocator(dstkguard, 8192));

This could work, but it won't in all cases. An alternative is to have a macro that does diagnostic checks of the stack size (as well). The downside is that the user would have to insert stkchk() manually in the coroutine.
Interestingly, although almost all systems use 4kb as the page size now. Newer systems (depending on distribution/hardware support) sometimes use 8kb as default. I think it might depend on how much RAM you have installed as well. With the memory pool option, we can allocate smaller than that (and faster because we don't need a syscall) - which might be useful if someone wants to play around with tiny coroutines.
Yes, I agree!
I'm currently checking this now, because the coroutine specifier basically forces the function to be non-inlined. If I can force that same thing to occur at the call-site (I suspect with the C rules of volatile and asm), then the coroutine specifier can actually be dropped completely. That being said, calling the function through its pointer will also work (although, again, will need to check the performance implications of that).

raedwulf commented 8 years ago

Interestingly, my initial approach to stack segments was to manually insert a new macro, stkchk(), but I shied away when thinking about external functions - but with the scratch buffer, stkchk() becomes a viable approach to protecting the stack within coroutine functions.

In short, libdill-based coroutines do not normally grow their stack excessively before being context-switched out. The only cases are:

The coroutine calls alloca/VLAs to inside before libdill scheduler.
The coroutine calls another function which grows the stack and then calls the libdill scheduler (externally from the coroutine).
The coroutine calls a libdill/dsock function that then context switches out.
The coroutine makes one big static buffer which triggers one of 1,2,3 when they occur.

Adding stkchk() to each one of these cases would help prevent/detect an overflow. The only ones that make an eyesore are VLAs. If we assume that we cover all of 1,2,3, only code local to coroutine can trigger 4 to stack overflow provided none of the other calls are present (making it exceeding rare and obvious to the programmer).

volatile int somevalue = 123;
coroutine void helloworld() {
     int bigbuffer[1024];
     stkchkbv(dostuffonsamebuffer()); /* need to check the stack before and after call */
     int stuff = stkchkb(dostuffonsamebufferreturnint()); /* same but returns a value */
     scratch(externalfunction()); /* scratch buffer avoids the need completely */
     void *dynamem = dalloca(1024); /* we wrap alloca so it checks the stack */
     int vlarray[somevalue];
     stkchk(); /* you can also use stkchkbv() but that does two checks back to back */
}

Sadly, due to the semantics of C's typeof and GNU's __auto_type, we can't detect void functions in macros (apart from a compiler error).

That being said. We/A user can make these look a lot nicer by wrapping everything in their own macros if they wanted to.

volatile int somevalue = 123;
#define safe_dostuffonsamebuffer() stkchkbv(dostuffonsamebuffer)
#define scratch_externalfunction(a) scratch(externalfunction(a))
#define safe_dostuffonsamebufferreturnint() stkchkb(dostuffonsamebufferreturnint)
#define vla(t,n,s) t n[s]; stkchk();

coroutine void helloworld() {
     int bigbuffer[1024];
     safe_dostuffonsamebuffer(); /* need to check the stack before and after call */
     int stuff = safe_dostuffonsamebufferreturnint(); /* same but returns a value */
     scratch_externalfunction(a); /* scratch buffer avoids the need completely */
     void *dynamem = dalloca(1024); /* we wrap alloca so it checks the stack */
     vla(int, vlarray, somevalue);
}

sustrik commented 8 years ago

In which cases wouldn't 0xAA-ing the stack work?
Assuming that the guard page is switched off, can we check for buffer overflow during the context switch? It would not be fool-proof but still better than nothing.
We can also make libdill/dsock functions (msleep, fdin, choose, bsend etc.) do stack checking. That could improve it a bit more.
As for gcc/clang work, I definitely don't have enough free time to engage in that. So you'd be on your own there. Also, it's a long-distance run. It takes years till the a compiler version becomes the standard.

sustrik commented 8 years ago

As for the patches, I can mere the one that inlines list/slist functions. The other one (jump buffer size) can be left unmerged until you feel happy about the whole stack switching business. Up to you.

raedwulf commented 8 years ago

It assumes that the error'ing code doesn't write 0xAA by accident. Admittedly, rare, but just the worse-case scenario. So the outcome would be false reports (not critical, but not useful if the stack still overflows and you don't know why). An example easily triggering this behaviour would be to say memcpy part of the 0xAA stack too early. Or simply have a variable that contains the byte 0xAA somewhere earlier on stack, will under-report the size.
Yes we can, the stkchk approach is to catch most cases (if not all).
Yes, although that gives a slight runtime overhead when not needing the stack checking. It is probable that this is not noticeable, however (considering that all the libdill/dsock functions are likely to be triggering syscalls).
It is possible to do with clang >= 3.4 and gcc >=4.6 but with the inconvenience of having to tag non-coroutine functions. Implementing a GCC/Clang plugin would bring it closer to reality but it does complicate the build process (you need to pass -fsplit-stack-explicit -fuse-ld=gold and the plugin loading flag... can't quite recall what it is) to make it work properly. Then we'd need to override __morestack and related functions to provide our own implementation. The main gains would be that we don't need a guard page & stkchk, the compiler does the work for us.

Yes, that sounds good with just the inlining - the jump buffer size and new changes will break the binary ABI so that would need updating. If we implement signal handling, that will also need a binary ABI bump to copy the check/signal. Part of the reason that it's super inefficient with sigsetjmp is that glibc overallocates for the signal mask when, say, the linux kernel only needs 8-bytes. For a quick googling, apparently they have such a big sigset_t size because of GNU Hurd (???).

sustrik commented 8 years ago

Ok, I see. We can also use a block of random data (say 251 bytes -- 251 is a prime!) instead of 0xAA. That would make it very unlikely to underrport the stack size by more than 2-3 bytes. I think this would be appreciated by embedded devs who are facing severe memory limitations and often have no memory overcommit and so they want to set their stack sizes by hand.
I think this could be a simple comparison of SP with a bottom-of-the-stack pointer stored in 'struct cr'. Unlikely to be very expensive given that all those calls are using struct cr anyway, so it'll be in L1 cache.

LoL HURD.

sustrik commented 8 years ago

Btw, can you please review the patch b6a9663 in the master? I've removed some unnecessary complexity there.

raedwulf commented 8 years ago

Commented on the commit :) It looks good.

I'm currently testing still, but it appears that GCC happily inlines coroutines without attribute((noinline)) and they still work! It appears that even for inline functions they automatically preserve the semantics of the call pretty well! Will test on clang and some more evil cases.

sustrik commented 8 years ago

What I've seen in the past (clang, -O2) is that function is inlined, then optimiser rearranges layout of local variables putting local variables of the coroutine before the filler VLA, thus effectively putting them on the stack of the parent coroutine.

raedwulf commented 8 years ago

As I'm doing some quite intrusive changes and will need to check that things don't break unexpectedly on different build configurations - I'm working on a new travis build script (generated with travisgen.c because travis has weird limitations if you try and let it try and build a matrix itself) - it looks very promising already as I'm burning electricity for free at some random data centre:

https://travis-ci.org/raedwulf/libdill/builds/170873214 https://gist.github.com/raedwulf/be27fbe98a22919381bfef8329a4dd46

sustrik commented 8 years ago

To speed it up, you can get rid of the "poll" dimension. It's enough to test poll once per OS. No need to do it for every microarchitecture, compiler etc.

raedwulf commented 8 years ago

I'm observing the same behaviour with clang as well unfortunately. We can either force the functions to be non-inlined by requiring the function to be noinline decorated or wrap the functions differently go(fn, param1, param2) where then we can enforce function pointer based calls or wrap the function in assembly entirely. noinline decorations do not change the abi... So it would only need to be performed with file-local functions or project-wide functions (with link time optimisation). In the long run, I think using the different parameterisation of go is more robust against compiler optimisation problems. Although it does mean I'll have to come up with a different interface to pass the custom allocator to coroutine creation.

raedwulf commented 8 years ago

#define goparams(...) (__VA_ARGS__);dill_epilogue();}}h;})
#define go(fn)({\
        sigjmp_buf *ctx;\
        int h = dill_prologue(&ctx, __FILE__, __LINE__);\
        if(h >= 0) {\
            if(!dill_setjmp(*ctx)) {\
                dill_setsp(hquery(h, NULL));\
                dill_unoptimisable2 = &fn;\
                ((__typeof__(fn) *)(dill_unoptimisable2))goparams

Actually, this works as well and gives the following syntax:

int h = go(worker)(p0,p1,p2);

sustrik commented 8 years ago

I don't think changing the syntax would get us much. On one hand it makes the library less intuitive and elegant, on the other hand the only thing it provides is just one more way to fight the compiler. That being said: 1.) We already have a way to fight the compiler with current syntax. 2.) A clean, long-term solution would be to add noinline-at-call-site attribute to the compiler.

raedwulf commented 8 years ago

The trouble with the current syntax is that there is no control over the function symbol in the macro because the function name and parameters are passed as a single token. If it were possible to capture the function symbol then it is simple to force the function to not be inlined at the callsite as a reference to the function's pointer can be obtained.

raedwulf commented 8 years ago

That being said GCC nested functions and Clang blocks may also help us achieve the same thing with the current syntax.

raedwulf commented 8 years ago

https://github.com/raedwulf/libdill/tree/optimized-limit

Some work in progress on implementing pluggable stack allocators; it hasn't been rebased on the latest HEAD version yet (and contains incomplete changes too).

The basic API works like this:

        int ap = apage(DILL_ALLOC_FLAGS_GUARD, 256 * 1024);
        int ac = acache(ap, DILL_ALLOC_FLAGS_DEFAULT, 256 * 1024, 64);
        setstk(ac);

That sets the default allocator, and mimics the behaviour of the old stack allocator.

Without the pool allocator, users can now allocate custom sized pools with apage and amalloc but not with acache and apool which are both fixed sized to their default allocation value. Example:

int am = amalloc(DILL_ALLOC_FLAGS_DEFAULT, 512); /* 512 is way too small normally */
int cr1 = go(worker(), am, 8192);

Currently, the default allocators are not cleaned up. My current thoughts are to:

if setstk is not called (or called with -1), the default allocator is managed by libdill and closed atexit
if setstk is called with another allocator, the default allocator is marked "to-be-freed" and is closed atexit
any allocator passed to setstk is managed by the user, so the user is responsible for closing them.

Some points in no particular order:

Performance-wise, I lost a little, gained a little - the overhead of adding pluggable allocators balances with the slight performance increase of using the memory pool allocator.
The main benefit of the pool allocator is that all the memory can be allocated before hand, although my implementation doesn't crash and burn if the pool space runs out - it extends the pool if required using the underlying/base allocator.
The code base is not hugely more complex; I removed stack.c because its functionality is now covered by apage.c amalloc.c acache.c.

sustrik commented 8 years ago

Yes, looks good.

An alternative to consider is passing an allocated buffer to go:

void *stk = malloc(1024);
int h = go(worker(), stk, 1024);
...
hclose(h);
free(stk);

You could do some interesting things with that kind of a system:

char stk[1024];
int h = go(worker(), stk, 1024);
...
hclose(h);

raedwulf commented 8 years ago

That makes it more flexible and I'll be able to regain the performance lost by having to query the allocator. I suppose having an explicit note in the documentation to warn off users passing a buffer allocated on the stack because the parent function can prematurely unwind.

However, for nested coroutines, the concurrency is structured... so shouldn't this work?

coroutine void parent() {
    char childstack[8192];
    go(child(), childstack, 8192);
}
coroutine void child() {
// do stuff on child stack
}

int main() {
    go(parent());
}

raedwulf commented 8 years ago

Another thought, I can support both, if you are willing to require apps to have C11 _Generic support to use the headers. It means client applications need to use GCC >= 4.9 and clang >= 3.3. The go macro would detect if the stack is a char/uint/void pointer and then pass that stack directly or if it is a 'int' it knows it's a handle to an allocator.

Half-support is possible for compilers below those listed versions, for example it could fallback to separating the macro into two different macros; Either having the allocator as default go and for the raw buffer version, gob (or some similar naming convention).

sustrik commented 8 years ago

Structured concurrency is done by hand. Thus, you need to hclose() the child before parent exits. If you do so, it should just work.

Re the second comment, I'd go for C-idiomatic API, i.e. no polymorphism. We can have go(), gostack(), etc.

sustrik commented 8 years ago

I've implemented go_stack(fd, ptr, len) on a branch here: https://github.com/sustrik/libdill/tree/stack

Would you like to give it a code review?

raedwulf commented 8 years ago

It looks good. One of the things that I did in my branch was that dill_prologue returned the top of the stack so that hquery did not need to be called - this saved quite a few cycles (because there's no virtual function call).

 /* Stack is supplied by the user.
           Align top of the stack to 8-byte boundary. */
        uintptr_t top = (uintptr_t)ptr;
        top += len;
        top &= ~(uintptr_t)7;
        stacksz = top - (uintptr_t)ptr;
        cr = (struct dill_cr*)top;

We should align to 16-bytes because on x86-64, clang sometimes generates aligned SIMD instructions for moving, which only work on 16-byte alignment. It makes that assumption for stacks.

sustrik commented 8 years ago

16-byte alignment fixed.

As for hquery optimisation, feel free to send a pull request my way. Also any other stuff you would like to get merged before next release.

raedwulf commented 8 years ago

Sure :) I think I'll have some time tomorrow to get it done :)

sustrik commented 7 years ago

Btw, I've spoke with a person from C++ standardization committee yesterday. He mentioned there's a coroutine proposal in works, still in quite an early stage of development. What about speaking to them about supporting go() construct in the standard? It would make all that compiler-fighting headache go away. (C standard would be better of course, but C committee seems to be much more conservartive that C++ one. Also, having it in C++ would make arguing for adding it to C easier.)

raedwulf commented 7 years ago

Interesting! I just noticed there was also a Boost.Coroutine. Btw check http://github.com/raedwulf/libdill/tree/trampoline where I have a different implementation of coroutines that does not require contaminating libdill.h with assembler. It has better semantics - go2 evaluates to two function calls; one prepares for a coroutine and the second performs the stack-switching and jumps. The downside is that it doesn't have the same syntax as our inline go, but it does have the advantage that the parameters are evaluated in the same order as a normal function call and does not rely on having inline assembly to behave correctly in the presence of compiler optimisations. The complex macros there are because the C standard does not support the declaration void functioncall(...); i.e. no arguments to a variable-argument C function, so I use macros to emulate this.

raedwulf commented 7 years ago

I'll do some research, because Boost.Coroutine exists, and chances are they would likely have some links with the standards committee.

raedwulf commented 7 years ago

I will need to test this against our implementation, but this seems fast for such old hardware!!! http://www.boost.org/doc/libs/1_62_0/libs/coroutine/doc/html/coroutine/performance.html

By Boost terminology, libdill has symmetric coroutines... I'm assuming asymmetric coroutines are those that appear in something like python.

This is probably the current proposal: https://github.com/CppCon/CppCon2016/blob/master/Presentations/C%2B%2B%20Coroutines%20-%20Under%20The%20Covers/C%2B%2B%20Coroutines%20-%20Under%20The%20Covers%20-%20Gor%20Nishanov%20-%20CppCon%202016.pdf

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0057r5.pdf

It resembles our coroutines remarkably!

sustrik commented 7 years ago

Ok, they seem to have all the pieces in place. No need to interfere.

As for the trampolines: sane arg evaluation, yay! different syntax, boo! It's a hard trade-off but in the end I would rather go with nice syntax and treat the arg problem as an implementation detail that we may be able to solve in the future (e.g. patching the compilers or the standard).

raedwulf commented 7 years ago

I have a few thoughts here:

Yep the standards committee seem to be on top of it. Boost's version is different but follows even more closely with libdill's style implementation (though ours is much simpler... or it might just be C++ syntax giving me a headache).
Performance-wise, I though Boost.Coroutine had magically made a super-fast implementation but I seemed to have been comparing apples to oranges because I found that it did not have a scheduler implemented in the coroutine library. In fact, they have it implemented in Boost.Fiber which appears much slower than our implementation (it builds on Boost.Coroutine2 and Boost.Context). http://www.boost.org/doc/libs/1_62_0/libs/fiber/doc/html/fiber/performance.html
Boost.Fiber has multi-threading support! A different approach to mine, but their performance differences between thread-safe and non-thread-safe versions makes it very promising that there'll be limited impact on our super-fast speeds.
I'll update #39 with my progress, because the trampoline support is somewhat related.

sustrik commented 7 years ago

Btw, I'll cut a release today. A lot of stuff have accumulated that should go out.

raedwulf commented 7 years ago

That sounds good, the stack allocation code can wait as it's possible to do it manually for now using go_stack.

raedwulf commented 7 years ago

Side note: the syntax for go can be preserved if we wrap the function declaration for coroutines... through a very ugly syntax :(

coroutine(worker, int, param0, int, param1) {
/* do things */
}

which would expand to:

__attribute__((noinline)) void __dill_worker(int param0, int param1) {/* do things */}
int worker(void *stk, int len, const char * file, int len, int param0, int param1) {
    int h = dill_prepare(__dill_worker, stk, len, file, len);
    if(h >= 0) dill_trampoline(param0, param1);
    return h;
}

So, now to launch a coroutine, all you have to do is (much much nicer syntax!):

worker(123);

And this has the same guarantees as my current macros (in fact it uses the same implementation, more or less)

Edit: with a bit more work, you can also achieve better syntax using:

coroutine((worker)((int) param0, (int) param1)) {
/* do things */
}

raedwulf commented 7 years ago

Mmm hangon... I might be able to make go(worker(param0, param1)) work still with dill_trampoline... I'll get back to you on this.

sustrik commented 7 years ago

Ok :)

But I'll make the release anyway. If it works it can get to the next one.

sustrik commented 7 years ago

Btw, the rule of thumb I am using is: Any technical problems we are experiencing today (compiler limitations etc.) will be gone in 10 years. However, any ugliness we add to the API will be with us forever. Thus, solving a technical limitation by uglifying the API is seldom worth it.

As for the API, the one we have now (go(fn(a,b,c))) is pretty close to ideal. The only remaining wart is the coroutine specifier. The good news is that's once the inlining problem is solved in some way we can defined coroutine to expand to nothing and carry on being backward compatible. Also note that this is the reasoning behind coroutines having return type even though it can't be used for anything -- once coroutine specifier is gone, there will be no difference between functions and coroutines.

raedwulf commented 7 years ago

Yep that does makes sense.

The worry I'm having with thread support and half the battle against the compiler at the moment is the fact that the coroutine is partially executed in the context of the caller (i.e. the parameter passing). If there's some way to extract the function token (that's all we need) from the parameters, all these issues go away.

The best I've come up with that allows us to extract the function name token is:

go((worker)(param0, param1));

When it comes to threads, it becomes a bit more difficult if you want to spawn using the same syntax because I think it could lead to race conditions if parameters are still being loaded from the caller's context.

In summary, what I'd like to have is the ability to split the function name worker from the parameters (param0, param1) so that I can do the parameter loading and context switching outside the caller's context.

raedwulf commented 7 years ago

EDIT: actually you said exactly this before in a different discussion! Sorry for repeating.

\ From older discussion on closures ** Another way of thinking about it is, the correct order for calling a function in go is:

evaluate function arguments
switch to new coroutine stack
coroutine function call

At the moment we do:

switch to new coroutine stack
evaluate function arguments (dangerous because we rely on the compiler to use the stack frame pointer to access the arguments and not the stack pointer itself)
coroutine function call

And we currently have no control over 2 and 3. (in the second list).

Note: There's no reason that the compiler should use the stack frame. We coax the compiler into using it by giving it challenges like VLAs, alloca, or in one of my branches I use __builtin_frame_address(0) which are difficult to implement without a stack frame. It doesn't force the compiler to always use the stack frame, it is just that the locality of using these features means that the compiler is more likely to make use of the frame pointer instead of the stack pointer in that instance.

sustrik commented 7 years ago

Ok, understood. So here's a question: Assuming that we can separate the function name from arguments we still have to store the arg values somewhere before we switch the new stack. How are we going to do that?

raedwulf commented 7 years ago

On x86-64, I simply pass the arguments as registers to the context-switched function. On stack-based argument passing systems (i386), the function takes variable arguments, so the arguments can be popped off the stack and pushed onto the new stack before the coroutine is called (which would be done in dill_trampoline).

sustrik commented 7 years ago

In the former case, does that mean we can use only functions with up to N args as coroutines?

raedwulf commented 7 years ago

Currently 6 on x86-64 - but after the register arguments are exhausted, it gets pushed onto the stack. I can detect that case easily and follow the same pattern as stack-based systems.

It requires support from dill_prepare to record the stack pointer before dill_trampoline and then copy that number of stack bytes to the new stack. Most the time on x86-64 this will be 0.

With compiler support, I guess that arguments could be evaluated and written directly to the new stack.

sustrik commented 7 years ago

Ok, let me summarize:

Forcing of the frame could be, in theory, done via a new compiler option.
Arg evaluation re-ordering could be done by compiler, but it would work only if compiler would implement go() syntax wholesale, with stack allocations etc.
If we use trampolines we can solve both problems in a single go with no help from the compiler.
Trampolines must be implemented for each architecture separately. For unsupported architectures we would still need a fallback that would probably fail to do arg evaluation properly.
We are able to handle functions with arbitrary number of arguments using trampolines.
There's no way to do trampolines using current go() syntax.

Q: What about go(fn)(args) syntax? Would that be a possibility?

raedwulf commented 7 years ago

Yes the summary seems correct.

Yes, go(fn)(args) is possible but makes the macro a little bit fragile (only if the user makes a syntax error):

Current macro:

#define go2(fn, ...) \
    ({\
        int h = dill_prepare(fn, NULL, 0, __FILE__, __LINE__);\
        if(h >= 0) dill_trampoline(__VA_ARGS__);\
        h;\
    })

New macro:

#define go2(fn) dill_prepare(fn, NULL, 0, __FILE__, __LINE__); dill_trampoline
//.....
int h = go2(fn)(args);

The macro would expand to:

int h = dill_prepare(fn, NULL, 0, __FILE__, __LINE__);
dill_trampoline(args);

h gets value from dill_prepare; dill_prepare stores handle in internal structure so that the check for a valid handle is performed within libdill.

Note: dill_prepare is just a wrapper round dill_prologue that stores the return values into a structure.

sustrik commented 7 years ago

one more question: would trampolines help us to get rid of thr coroutine specifier?

raedwulf commented 7 years ago

Yep, it can't inline the coroutine when it gets passed as a function pointer, so the specifier coroutine is not needed.

raedwulf commented 7 years ago

In reply to 1. earlier

Forcing of the frame could be, in theory, done via a new compiler option.

More specifically, evaluating N arguments and pushing them on the new stack minus some constant C. A copy incurs at most n extra reads and n writes to the stack, where n = N - min(N, C).

Platform	C
i386	0
i386 fast-call	2
x86-64	6
ppc32/64	8
ARM	4
ARM64	8
MIPS	8

So coroutines will be slower on i386-architectures and with no-noticeable differences on ARM/PowerPC where arguments are also passed mainly as registers. On i386, there is __attribute__((fastcall)) which could be marked on a coroutine (passing the first two arguments in registers), but the library would have to be told about it in order to be able to use such functions as coroutines.

raedwulf commented 7 years ago

Also there's one restriction with dill_trampoline - the first argument cannot be a struct or floating point value passed by value i.e. it must be an integral type (anything that can be cast to void * or size_t). Floating-point values can be supported; custom structs (by-value) are not possible, unfortunately. Note that it only ever affects the first argument passed. The rest are free from restriction.

The reason for this is ISO C's variadic functions must have at least one fixed argument. Unfortunately, that means we can't let the compiler figure out what the first argument is nor predict what type the user might use for the first argument.

EDIT: in my initial version, I reused the first parameter to pass information about the coroutine to the coroutine itself.

sustrik / libdill

Libdill performance analysis (and muddled thoughts) #30

Forward

Intro

Some micro-bench results

The Problem(s)

Signal Handling

Coroutine Limits In Performance

What happens right now

What can happen?

Memory Pool Support

Segmented stack support (controversial)

The Problem

Quick Primer

The Problems Go and Rust have found

And how we can handle them

Implementation Issues

Using -fsplit-stack

Recommended solution (stack switching and not stack segmenting!)