Open Alcaro opened 4 years ago
Adding an ifdef to libco.h
would definitely make it less pleasantly simple. However, if we're leaving a lot of performance on the table by not doing that, maybe we should. On the other hand, if it provides no speed-up or only a little, it might not be worth it.
I'd love to see any benchmarks you can run, and maybe we can get people on the byuu.org discord to help run benchmarks on platforms you don't have access to.
If we go ahead with this (adding #ifdef
in libco.h
), it might be worth it to convert entire libco
to a single header file (like the STB libraries). This way we won't have to deal with both an ugly header AND build system integration.
If it's only a header file, don't you get linker-time problems when the same symbol is defined in different translation units? I know C++ has a system to silently deduplicate symbols, but I don't think that works for ANSI C.
Here are the details straight from the horse's mouth: https://github.com/nothings/stb/blob/master/docs/stb_howto.txt
In short, you can either define the functions as static inline
(but it results in duplication, might not be a problem for libco
since the functions are so small), or you can wrap the actual functions in a #ifdef LIBCO_IMPLEMENTATION ... #endif
block and ask the user to define LIBCO_IMPLEMENTATION
in only one translation unit.
EDIT: The static inline
might not work for C89, I only have experience with C99 and C11.
With a GCC-powered co_switch, being duplicated and inlined is a feature, not a bug. If it's not inlined, it must preserve all callee save regs, just like the current one.
Either way, got it working. Seems to have boosted framerate on bsnes v115 on Super Mario World title screen on Linux from 167 to 168. Maybe things aren't inlined enough to have the impact I was hoping for (I didn't check the resulting assembly), maybe the libco overhead is near-exclusively because it screws up branch prediction, maybe GCC isn't very smart about asm (especially ones with huge clobber lists), maybe bsnes spends more time emulating than switching (maybe it'd have more of an impact in higan, or with the slow PPU), maybe it gives better results on Clang, newer GCC, or different hardware.
Here's the proof of concept patch (feel free to revert the sourceview hackery). I'll finish it if someone finds a platform where it helps (and if I can figure out how to handle that conditional thread local - should've told caller to pass the previous cothread); if it's insignificant everywhere, well, some experiments fail.
Seems like the BizHawk folks have independently implemented the same idea. Just thought I'd mention it here.
If we implement coswitch as a #define to a GCC __asm_\, instead of the current machine code trickery, we can let GCC deal with spilling the registers - and, more importantly, not spilling registers that are currently unused.
I believe this would yield a performance boost (though most likely small, sometimes zero depending on platform and how the callers look). The drawback is an ifdef in the header (can't put it in the backend, it'd just make gcc dump every callee preserve reg to the stack, just like the current backends).
Does this sound like an interesting optimization, or would ifdefs in headers be unappealing? If yes, I'll implement it, check how bsnes performance reacts, and submit a PR later today or tomorrow.
(GCC ASM obviously won't work on MSVC, but we can keep the current backends for that. Will work on Clang, though probably not clang-cl.)