Open grantek opened 1 year ago
Templates always seem to be slower. They might be neat in terms of code, but to this day compilers seem to generate significantly worse code without fail.
In many jobs I've worked, one of the first things we do is remove templates if we're struggling for performance on a particular platform; it's considered one of the easiest wins and can be the difference between something shipping and not shipping.
That said, the devs seem to love them, so I doubt you'll get very far trying to remove them here.
Templates always seem to be slower. They might be neat in terms of code, but to this day compilers seem to generate significantly worse code without fail.
We've profiled Hyperstone and DSP16 extensively and using templates to avoid conditional branches definitely makes hot functions faster. I already explained why the template doesn't actually avoid any conditional branches in this case on the previous (broken) PR. The performance improvement in that case had nothing to do with the template anyway, it was because the code wasn't dealing with all the cases correctly.
@cracyc
I noticed this while poking around the X68000 driver trying to learn about the system and discussed it in #10719, which was reverted due to an unrelated bug. Writing this up here mainly as a place to put my benchmarks, but I have one or two commits ready to resolve it.
In a recent refactor of screen_update, the function get_gfx_pixel was reworked into a template based on the value of a bool, in a pattern used elsewhere to optimise the logic in each generated version.
I was mainly trying to untangle the variable names to make them more descriptive of the aspect of blending/translucency that was happening, but I found that passing the variable as a function argument performed better. The discussion in #10719 noted that the template was only saving a couple of branches.
I ran some benchmarks on a Core i5-7600,
mame
was compiled with the default settings in the makefile, andmamed
withDEBUG=1
. (edit: removed x68k_v branch benchmarks)Reverting the template to a function argument (x68k_v-revert-template) noticeably and consistently improves performance, but as an experiment I also tried rewriting the function manually into a version for blending vs. not (x68k_v-rewritten_gfx_pix):
The rewritten version also seems to perform better, which is interesting. The less optimised
mamed
builds do have higher numbers in the non-templated benchmarks, but are too variable on my system to make a clear distinction.