Closed raedwulf closed 7 years ago
Wow. Impressive.
Some comments, in no particular order:
a = myallocator(...)
go(foo(), a);
Some allocators could be exported as global variables:
go(foo(), default_stack_allocator);
Quick replies following your numbering:
#define coroutine __attribute__((split_stack))
rather having the opposite problem of declaring non-coroutines everywhere. The alternative I had in mind was having the user create a program that compiles -fsplit-stack on a C file with coroutines and omitting that flag without. In either case, it's burdensome, and hence, compiler support would be desired.proc
is for fork
). We could register a user-thread specific signal handler that is called by the scheduler.go(worker(), memalign_allocator(dstkguard, 8192));
coroutine
specifier basically forces the function to be non-inlined. If I can force that same thing to occur at the call-site (I suspect with the C rules of volatile
and asm
), then the coroutine
specifier can actually be dropped completely. That being said, calling the function through its pointer will also work (although, again, will need to check the performance implications of that).Interestingly, my initial approach to stack segments was to manually insert a new macro, stkchk()
, but I shied away when thinking about external functions - but with the scratch buffer, stkchk()
becomes a viable approach to protecting the stack within coroutine functions.
In short, libdill-based coroutines do not normally grow their stack excessively before being context-switched out. The only cases are:
Adding stkchk()
to each one of these cases would help prevent/detect an overflow. The only ones that make an eyesore are VLAs. If we assume that we cover all of 1,2,3, only code local to coroutine can trigger 4 to stack overflow provided none of the other calls are present (making it exceeding rare and obvious to the programmer).
volatile int somevalue = 123;
coroutine void helloworld() {
int bigbuffer[1024];
stkchkbv(dostuffonsamebuffer()); /* need to check the stack before and after call */
int stuff = stkchkb(dostuffonsamebufferreturnint()); /* same but returns a value */
scratch(externalfunction()); /* scratch buffer avoids the need completely */
void *dynamem = dalloca(1024); /* we wrap alloca so it checks the stack */
int vlarray[somevalue];
stkchk(); /* you can also use stkchkbv() but that does two checks back to back */
}
Sadly, due to the semantics of C's typeof and GNU's __auto_type, we can't detect void functions in macros (apart from a compiler error).
That being said. We/A user can make these look a lot nicer by wrapping everything in their own macros if they wanted to.
volatile int somevalue = 123;
#define safe_dostuffonsamebuffer() stkchkbv(dostuffonsamebuffer)
#define scratch_externalfunction(a) scratch(externalfunction(a))
#define safe_dostuffonsamebufferreturnint() stkchkb(dostuffonsamebufferreturnint)
#define vla(t,n,s) t n[s]; stkchk();
coroutine void helloworld() {
int bigbuffer[1024];
safe_dostuffonsamebuffer(); /* need to check the stack before and after call */
int stuff = safe_dostuffonsamebufferreturnint(); /* same but returns a value */
scratch_externalfunction(a); /* scratch buffer avoids the need completely */
void *dynamem = dalloca(1024); /* we wrap alloca so it checks the stack */
vla(int, vlarray, somevalue);
}
As for the patches, I can mere the one that inlines list/slist functions. The other one (jump buffer size) can be left unmerged until you feel happy about the whole stack switching business. Up to you.
Yes, that sounds good with just the inlining - the jump buffer size and new changes will break the binary ABI so that would need updating. If we implement signal handling, that will also need a binary ABI bump to copy the check/signal. Part of the reason that it's super inefficient with sigsetjmp is that glibc overallocates for the signal mask when, say, the linux kernel only needs 8-bytes. For a quick googling, apparently they have such a big sigset_t size because of GNU Hurd (???).
LoL HURD.
Btw, can you please review the patch b6a9663 in the master? I've removed some unnecessary complexity there.
Commented on the commit :) It looks good.
I'm currently testing still, but it appears that GCC happily inlines coroutines without attribute((noinline)) and they still work! It appears that even for inline functions they automatically preserve the semantics of the call pretty well! Will test on clang and some more evil cases.
What I've seen in the past (clang, -O2) is that function is inlined, then optimiser rearranges layout of local variables putting local variables of the coroutine before the filler VLA, thus effectively putting them on the stack of the parent coroutine.
As I'm doing some quite intrusive changes and will need to check that things don't break unexpectedly on different build configurations - I'm working on a new travis build script (generated with travisgen.c because travis has weird limitations if you try and let it try and build a matrix itself) - it looks very promising already as I'm burning electricity for free at some random data centre:
https://travis-ci.org/raedwulf/libdill/builds/170873214 https://gist.github.com/raedwulf/be27fbe98a22919381bfef8329a4dd46
To speed it up, you can get rid of the "poll" dimension. It's enough to test poll once per OS. No need to do it for every microarchitecture, compiler etc.
I'm observing the same behaviour with clang as well unfortunately. We can either force the functions to be non-inlined by requiring the function to be noinline
decorated or wrap the functions differently go(fn, param1, param2)
where then we can enforce function pointer based calls or wrap the function in assembly entirely. noinline
decorations do not change the abi... So it would only need to be performed with file-local functions or project-wide functions (with link time optimisation). In the long run, I think using the different parameterisation of go
is more robust against compiler optimisation problems. Although it does mean I'll have to come up with a different interface to pass the custom allocator to coroutine creation.
#define goparams(...) (__VA_ARGS__);dill_epilogue();}}h;})
#define go(fn)({\
sigjmp_buf *ctx;\
int h = dill_prologue(&ctx, __FILE__, __LINE__);\
if(h >= 0) {\
if(!dill_setjmp(*ctx)) {\
dill_setsp(hquery(h, NULL));\
dill_unoptimisable2 = &fn;\
((__typeof__(fn) *)(dill_unoptimisable2))goparams
Actually, this works as well and gives the following syntax:
int h = go(worker)(p0,p1,p2);
I don't think changing the syntax would get us much. On one hand it makes the library less intuitive and elegant, on the other hand the only thing it provides is just one more way to fight the compiler. That being said: 1.) We already have a way to fight the compiler with current syntax. 2.) A clean, long-term solution would be to add noinline-at-call-site attribute to the compiler.
The trouble with the current syntax is that there is no control over the function symbol in the macro because the function name and parameters are passed as a single token. If it were possible to capture the function symbol then it is simple to force the function to not be inlined at the callsite as a reference to the function's pointer can be obtained.
That being said GCC nested functions and Clang blocks may also help us achieve the same thing with the current syntax.
https://github.com/raedwulf/libdill/tree/optimized-limit
Some work in progress on implementing pluggable stack allocators; it hasn't been rebased on the latest HEAD version yet (and contains incomplete changes too).
The basic API works like this:
int ap = apage(DILL_ALLOC_FLAGS_GUARD, 256 * 1024);
int ac = acache(ap, DILL_ALLOC_FLAGS_DEFAULT, 256 * 1024, 64);
setstk(ac);
That sets the default allocator, and mimics the behaviour of the old stack allocator.
Without the pool allocator, users can now allocate custom sized pools with apage
and amalloc
but not with acache
and apool
which are both fixed sized to their default allocation value. Example:
int am = amalloc(DILL_ALLOC_FLAGS_DEFAULT, 512); /* 512 is way too small normally */
int cr1 = go(worker(), am, 8192);
Currently, the default allocators are not cleaned up. My current thoughts are to:
setstk
is not called (or called with -1), the default allocator is managed by libdill and closed atexit
setstk
is called with another allocator, the default allocator is marked "to-be-freed" and is closed atexit
setstk
is managed by the user, so the user is responsible for closing them.Some points in no particular order:
stack.c
because its functionality is now covered by apage.c amalloc.c acache.c
.Yes, looks good.
An alternative to consider is passing an allocated buffer to go:
void *stk = malloc(1024);
int h = go(worker(), stk, 1024);
...
hclose(h);
free(stk);
You could do some interesting things with that kind of a system:
char stk[1024];
int h = go(worker(), stk, 1024);
...
hclose(h);
That makes it more flexible and I'll be able to regain the performance lost by having to query the allocator. I suppose having an explicit note in the documentation to warn off users passing a buffer allocated on the stack because the parent function can prematurely unwind.
However, for nested coroutines, the concurrency is structured... so shouldn't this work?
coroutine void parent() {
char childstack[8192];
go(child(), childstack, 8192);
}
coroutine void child() {
// do stuff on child stack
}
int main() {
go(parent());
}
Another thought, I can support both, if you are willing to require apps to have C11 _Generic
support to use the headers. It means client applications need to use GCC >= 4.9 and clang >= 3.3.
The go
macro would detect if the stack is a char/uint/void pointer and then pass that stack directly or if it is a 'int' it knows it's a handle to an allocator.
Half-support is possible for compilers below those listed versions, for example it could fallback to separating the macro into two different macros; Either having the allocator as default go
and for the raw buffer version, gob
(or some similar naming convention).
Structured concurrency is done by hand. Thus, you need to hclose() the child before parent exits. If you do so, it should just work.
Re the second comment, I'd go for C-idiomatic API, i.e. no polymorphism. We can have go(), gostack(), etc.
I've implemented go_stack(fd, ptr, len) on a branch here: https://github.com/sustrik/libdill/tree/stack
Would you like to give it a code review?
It looks good. One of the things that I did in my branch was that dill_prologue
returned the top of the stack so that hquery did not need to be called - this saved quite a few cycles (because there's no virtual function call).
/* Stack is supplied by the user.
Align top of the stack to 8-byte boundary. */
uintptr_t top = (uintptr_t)ptr;
top += len;
top &= ~(uintptr_t)7;
stacksz = top - (uintptr_t)ptr;
cr = (struct dill_cr*)top;
We should align to 16-bytes because on x86-64, clang sometimes generates aligned SIMD instructions for moving, which only work on 16-byte alignment. It makes that assumption for stacks.
16-byte alignment fixed.
As for hquery optimisation, feel free to send a pull request my way. Also any other stuff you would like to get merged before next release.
Sure :) I think I'll have some time tomorrow to get it done :)
Btw, I've spoke with a person from C++ standardization committee yesterday. He mentioned there's a coroutine proposal in works, still in quite an early stage of development. What about speaking to them about supporting go() construct in the standard? It would make all that compiler-fighting headache go away. (C standard would be better of course, but C committee seems to be much more conservartive that C++ one. Also, having it in C++ would make arguing for adding it to C easier.)
Interesting! I just noticed there was also a Boost.Coroutine. Btw check http://github.com/raedwulf/libdill/tree/trampoline where I have a different implementation of coroutines that does not require contaminating libdill.h with assembler. It has better semantics - go2
evaluates to two function calls; one prepares for a coroutine and the second performs the stack-switching and jumps. The downside is that it doesn't have the same syntax as our inline go
, but it does have the advantage that the parameters are evaluated in the same order as a normal function call and does not rely on having inline assembly to behave correctly in the presence of compiler optimisations. The complex macros there are because the C standard does not support the declaration void functioncall(...);
i.e. no arguments to a variable-argument C function, so I use macros to emulate this.
I'll do some research, because Boost.Coroutine
exists, and chances are they would likely have some links with the standards committee.
I will need to test this against our implementation, but this seems fast for such old hardware!!! http://www.boost.org/doc/libs/1_62_0/libs/coroutine/doc/html/coroutine/performance.html
By Boost terminology, libdill has symmetric coroutines... I'm assuming asymmetric coroutines are those that appear in something like python.
This is probably the current proposal: https://github.com/CppCon/CppCon2016/blob/master/Presentations/C%2B%2B%20Coroutines%20-%20Under%20The%20Covers/C%2B%2B%20Coroutines%20-%20Under%20The%20Covers%20-%20Gor%20Nishanov%20-%20CppCon%202016.pdf
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0057r5.pdf
It resembles our coroutines remarkably!
Ok, they seem to have all the pieces in place. No need to interfere.
As for the trampolines: sane arg evaluation, yay! different syntax, boo! It's a hard trade-off but in the end I would rather go with nice syntax and treat the arg problem as an implementation detail that we may be able to solve in the future (e.g. patching the compilers or the standard).
I have a few thoughts here:
Btw, I'll cut a release today. A lot of stuff have accumulated that should go out.
That sounds good, the stack allocation code can wait as it's possible to do it manually for now using go_stack
.
Side note: the syntax for go can be preserved if we wrap the function declaration for coroutines... through a very ugly syntax :(
coroutine(worker, int, param0, int, param1) {
/* do things */
}
which would expand to:
__attribute__((noinline)) void __dill_worker(int param0, int param1) {/* do things */}
int worker(void *stk, int len, const char * file, int len, int param0, int param1) {
int h = dill_prepare(__dill_worker, stk, len, file, len);
if(h >= 0) dill_trampoline(param0, param1);
return h;
}
So, now to launch a coroutine, all you have to do is (much much nicer syntax!):
worker(123);
And this has the same guarantees as my current macros (in fact it uses the same implementation, more or less)
Edit: with a bit more work, you can also achieve better syntax using:
coroutine((worker)((int) param0, (int) param1)) {
/* do things */
}
Mmm hangon... I might be able to make go(worker(param0, param1))
work still with dill_trampoline
... I'll get back to you on this.
Ok :)
But I'll make the release anyway. If it works it can get to the next one.
Btw, the rule of thumb I am using is: Any technical problems we are experiencing today (compiler limitations etc.) will be gone in 10 years. However, any ugliness we add to the API will be with us forever. Thus, solving a technical limitation by uglifying the API is seldom worth it.
As for the API, the one we have now (go(fn(a,b,c))) is pretty close to ideal. The only remaining wart is the coroutine specifier. The good news is that's once the inlining problem is solved in some way we can defined coroutine to expand to nothing and carry on being backward compatible. Also note that this is the reasoning behind coroutines having return type even though it can't be used for anything -- once coroutine specifier is gone, there will be no difference between functions and coroutines.
Yep that does makes sense.
The worry I'm having with thread support and half the battle against the compiler at the moment is the fact that the coroutine is partially executed in the context of the caller (i.e. the parameter passing). If there's some way to extract the function token (that's all we need) from the parameters, all these issues go away.
The best I've come up with that allows us to extract the function name token is:
go((worker)(param0, param1));
When it comes to threads, it becomes a bit more difficult if you want to spawn using the same syntax because I think it could lead to race conditions if parameters are still being loaded from the caller's context.
In summary, what I'd like to have is the ability to split the function name worker
from the parameters (param0, param1)
so that I can do the parameter loading and context switching outside the caller's context.
EDIT: actually you said exactly this before in a different discussion! Sorry for repeating.
\ From older discussion on closures **
Another way of thinking about it is, the correct order for calling a function in go
is:
At the moment we do:
And we currently have no control over 2 and 3. (in the second list).
Note: There's no reason that the compiler should use the stack frame. We coax the compiler into using it by giving it challenges like VLAs, alloca
, or in one of my branches I use __builtin_frame_address(0)
which are difficult to implement without a stack frame. It doesn't force the compiler to always use the stack frame, it is just that the locality of using these features means that the compiler is more likely to make use of the frame pointer instead of the stack pointer in that instance.
Ok, understood. So here's a question: Assuming that we can separate the function name from arguments we still have to store the arg values somewhere before we switch the new stack. How are we going to do that?
On x86-64, I simply pass the arguments as registers to the context-switched function.
On stack-based argument passing systems (i386), the function takes variable arguments, so the arguments can be popped off the stack and pushed onto the new stack before the coroutine is called (which would be done in dill_trampoline
).
In the former case, does that mean we can use only functions with up to N args as coroutines?
Currently 6 on x86-64 - but after the register arguments are exhausted, it gets pushed onto the stack. I can detect that case easily and follow the same pattern as stack-based systems.
It requires support from dill_prepare
to record the stack pointer before dill_trampoline
and then copy that number of stack bytes to the new stack. Most the time on x86-64 this will be 0.
With compiler support, I guess that arguments could be evaluated and written directly to the new stack.
Ok, let me summarize:
Q: What about go(fn)(args) syntax? Would that be a possibility?
Yes the summary seems correct.
Yes, go(fn)(args)
is possible but makes the macro a little bit fragile (only if the user makes a syntax error):
Current macro:
#define go2(fn, ...) \
({\
int h = dill_prepare(fn, NULL, 0, __FILE__, __LINE__);\
if(h >= 0) dill_trampoline(__VA_ARGS__);\
h;\
})
New macro:
#define go2(fn) dill_prepare(fn, NULL, 0, __FILE__, __LINE__); dill_trampoline
//.....
int h = go2(fn)(args);
The macro would expand to:
int h = dill_prepare(fn, NULL, 0, __FILE__, __LINE__);
dill_trampoline(args);
h
gets value from dill_prepare
; dill_prepare
stores handle in internal structure so that the check for a valid handle is performed within libdill.
Note: dill_prepare
is just a wrapper round dill_prologue
that stores the return values into a structure.
one more question: would trampolines help us to get rid of thr coroutine specifier?
Yep, it can't inline the coroutine when it gets passed as a function pointer, so the specifier coroutine
is not needed.
In reply to 1. earlier
- Forcing of the frame could be, in theory, done via a new compiler option.
More specifically, evaluating N
arguments and pushing them on the new stack minus some constant C
. A copy incurs at most n
extra reads and n
writes to the stack, where n = N - min(N, C)
.
Platform | C |
---|---|
i386 | 0 |
i386 fast-call | 2 |
x86-64 | 6 |
ppc32/64 | 8 |
ARM | 4 |
ARM64 | 8 |
MIPS | 8 |
So coroutines will be slower on i386-architectures and with no-noticeable differences on ARM/PowerPC where arguments are also passed mainly as registers. On i386, there is __attribute__((fastcall))
which could be marked on a coroutine (passing the first two arguments in registers), but the library would have to be told about it in order to be able to use such functions as coroutines.
Also there's one restriction with dill_trampoline
- the first argument cannot be a struct or floating point value passed by value i.e. it must be an integral type (anything that can be cast to void *
or size_t
).
Floating-point values can be supported; custom structs (by-value) are not possible, unfortunately.
Note that it only ever affects the first argument passed. The rest are free from restriction.
The reason for this is ISO C's variadic functions must have at least one fixed argument. Unfortunately, that means we can't let the compiler figure out what the first argument is nor predict what type the user might use for the first argument.
EDIT: in my initial version, I reused the first parameter to pass information about the coroutine to the coroutine itself.
Forward
I've been working on this for 3-4 days, so this text might be slightly scrambled as I've been reworking/research/implementing different parts of libdill. I initially looked into implemented segmented stacks but have come up with a faster, more refined version of simply stack switching.
Intro
What started with some minor tweaks to the libdill code base, ended up being a significant optimisation effort within libdill's core coroutine code. My initial non-intrusive changes are available here: https://github.com/raedwulf/libdill/tree/optimized-limit
Martin, this can be merged if you want, although it might be best once I've actually completed the rest of my changes.
Some micro-bench results
On a i7 4770k:
More noticeable on a old Core 2 Duo:
This is probably the limit of optimisation though inlining (I shaved off one more nanosecond with an intrusive change for my i7 4770k but it isn't worth the effort/code clarity to keep).
As a comparison, I fired up go and scraped out some direct translations of our perf programs: https://gist.github.com/raedwulf/e6bbb76a3becb5fa14eb27e670260b1b https://gist.github.com/raedwulf/e11f7bf6a25eabad3ed3e439f14b262d https://gist.github.com/raedwulf/4a2c6c5b6209c0dd839fb95ec46378aa
This is far better than on Go (on i7 4770k): context switches take 331ns and creation/termination: 237ns gccgo is even worse: 825 ns context switching and 3652 ns creation/termination ... OUCH
As I have very little experience with Go, check those gists in case I've actually missed something which causes such poor performance!
The Problem(s)
Signal Handling
There is no clear way to deal with signals within libdill. The signal mask belongs to the original context of the process and if that is changed within a coroutine, it applies everywhere. Furthermore, the signal handler for a trapped signal will be executed in context of a coroutine. This may be an issue for future more complex applications.
This may also be the reason why libmill/dill performs so well vs Go! To preserve the signal masks across different coroutines, the syscall
SYS_rt_sigprocmask
needs to be made. This kills performance - to try yourself, comment out the libdill's assembly routines and usesigsetjmp(ctx, 1)
.To resolve this problem, I initially thought that optimising the context switching mechanism so that the signal mask is only set when the mask is different between coroutines (using cached values). In an implementation, that would involve wrapping each of the
sigproc*
commands with dill equivalents (like proc for fork).The pathological case would be thrashing between two or more coroutines that have different signal masks. Since, if anywhere, the signal handler should probably be done in the same context as the scheduler and control always returns there, we still have a problem of thrashing.
The best tradeoff between semantics and performance I've come up with is to have signals handled in two different ways. Signals that need to be handled immediately (e.g.
SIGSEGV
,SIGKILL
, etc...) are handled in all contexts (thus the signal mask is the same in the scheduler and coroutines) and signals that are used to communicate behaviour (SIGINT
,SIGUSR1
,SIGUSR2
etc.) that are handled indill_wait
usingepoll_pwait
or kqueue'sEVFILT_SIGNAL
.From there, it's a design choice whether to simulate a read-write based interface for signals, implementing a signal handler callback within the coroutine and/or emulate
sigpending
,sigwait
, andsigtimedwait
.Coroutine Limits In Performance
What happens right now
However, in digging, one of the limitations of libmill/dill is that the system runs out of mmaps on 32k coroutines - whispers is a perfect example. For networking servers with many connections, this can be an issue, although you can adjust the number of mmaps using:
sysctl -w vm.max_map_count=65530 # default
However, the performance of libdill degrades to that below of Go and increases with each addition coroutine. In my tests, libdill was averaging at around 4000-4500 ns per "whisper" on the whisper benchmark and Go was speeding up, till it averaged around 2200 ns per whisper @ 100000 coroutines.
Now, disabling the guard page (and thus no
mmap
throughmprotect
), usingDILL_NOGUARD
, improves the performance of libdill into the same order of magnitude of Go (actually slightly faster). However, we do lose the ability to check for stack overflows.Memory overhead is not as bad the allocated size suggests as the majority of the memory in a stack is not resident until it is touched. This translates to being a single page of memory (either 4096 or 8192 bytes). On my system with 32gb RAM, I can handle around 6 million resident coroutines in the whisper benchmark before it gives up and starts paging like crazy. In a real application, I suspect this would be closer to 4 - 5 million in-memory.
What can happen?
I've currently updated my unpushed code base with an extended
go()
interface that allows optional parameters (through some standards compliant C macro magic). In particular, you can now specify,which means, it'll choose the best memory allocation mechanism between malloc & posix_memalign, use page guards, and allocate a 8192 byte stack.
Omitting parameters will give defaults (currently set to dstkbest and 256kb stack) so that the original libdill interface is possible.
The amount of additional code required is minimal, as most of this is already implemented. The main code added is to expose the choice out at the API level.
The features I currently want to add:
Memory Pool Support
Given the possibility that millions of coroutines are possible, it would be nice allocate smaller sized stacks if it is known that the stack won't exceed a certain size. For example, if the call stack within the coroutine is reasonably flat - i.e. it only travels through libdill and dsock in it's processing. Some analysis would need to be done here, but I would assume this is reasonable and thus stacks could be as small as 1024 bytes.
This alone would be useful and my implementation is around 90% there for supporting this. Passing
(dstkfixed | dstkhuge | dstkmmap)
would give you a fixed sized allocated coroutine using aSegmented stack support (controversial)
The Problem
The ideal situation is:
What we currently have is;
Segmented stacks have problems. There's no denying that. But segmented stacks also have solutions.
Quick Primer
Stacks are usually big contiguous blocks of memory pre-allocated up front. This means they're fixed and overflowing can be easily triggered. Segmented stacks are non-contiguous. A small stack is initially allocated and when an overflow is detected, another block is then allocated.
The original gcc -fsplit-stack design was presented was presented here.
The Problems Go and Rust have found
In Rust and Go, it's assumed that any function can be a coroutine and can call any other function and can be interrupted at any time. This makes segmented stacks problematic.
However, in libdill, we can make a very important assumption that when calling a function that does not use stack splitting support, we don't have context switch nor callbacks (unless you have some code that links with libdill that calls the scheduler) which would have a lifetime that extends after a libdill context switch. This fact means we can make coroutines split their stack any time to a single/global scratch buffer/stack. This scratch buffer/stack could actually be the application stack as it would not be used at that time.
Implementation Issues
Using -fsplit-stack
Both GCC >4.6 and LLVM > 3.4 has segmented stacks implemented with
-fsplit-stack
. However, to get them working, the interface provided by libgcc is woefully painful to use. One could override__morestack
and__morestack_allocate_stack_space
to support libdill, and it would very likely work pretty well. However,-fsplit-stack
is a global flag which applies to all functions compiled in a *.c file. This makes it cumbersome in the sense you have to explicitly mark all other functions as__attribute__((no_split_stack))
and leave the coroutines unmarked, ironically.There's ways around it, like submitting a patch or creating a plugin for GCC/LLVM to have a flag like
-fsplit-stack-explicit
but the cost/benefit of developing this approach is currently on the fence.Recommended solution (stack switching and not stack segmenting!)
However, given that this is an optimisation for coroutines, a neater solution would be to set a small size for coroutines and then wrap all functions that are likely to trigger a stack overflow with a "stack switcher". Is it safe? As safe as our coroutines, because it does a perfect tiny subset of what our context switching does; it keeps all the registers but changes the stack into a scratch buffer. Thus is is super cheap - much much cheaper than context switching.
In fact, I've just realised that I've already written a generic macro that switches the stackpointer... so I actually do support this in my current code base already. I'll wrap this up with some of my memory pool optimisation and show off this technique in a test.
The TLDR example:
So for a constrained memory coroutine, the
scratch
macro gives the control for the application writer to optimise stack usage without running into any problems (except still allowing them to shoot themselves in the foot). The one advantage of -fsplit-stack is that it does this automatically at a runtime cost, making sure that it is a lot harder for the application writer to shoot themselves in the foot, but it moves the fight against the compilers instead.