mamedev / mame

MAME
https://www.mamedev.org/
Other
8.06k stars 2.01k forks source link

x68k_v.cpp: Slight performance regression in templated get_gfx_pixel #10769

Open grantek opened 1 year ago

grantek commented 1 year ago

I noticed this while poking around the X68000 driver trying to learn about the system and discussed it in #10719, which was reverted due to an unrelated bug. Writing this up here mainly as a place to put my benchmarks, but I have one or two commits ready to resolve it.

In a recent refactor of screen_update, the function get_gfx_pixel was reworked into a template based on the value of a bool, in a pattern used elsewhere to optimise the logic in each generated version.

I was mainly trying to untangle the variable names to make them more descriptive of the aspect of blending/translucency that was happening, but I found that passing the variable as a function argument performed better. The discussion in #10719 noted that the template was only saving a couple of branches.

I ran some benchmarks on a Core i5-7600, mame was compiled with the default settings in the makefile, and mamed with DEBUG=1. (edit: removed x68k_v branch benchmarks)

mame@debian-bullseye:~/mame$ for BRANCH in master; do for MAME in mame mamed; do for I in {1..5}; do echo ${MAME}-${BRANCH} $I; SDL_VIDEODRIVER=dummy ./${MAME}-${BRANCH} -bench 600 x68000 shangon 2>/dev/null; done; done; done
mame-master 1
Average speed: 478.26% (599 seconds)
mame-master 2
Average speed: 477.15% (599 seconds)
mame-master 3
Average speed: 482.34% (599 seconds)
mame-master 4
Average speed: 476.23% (599 seconds)
mame-master 5
Average speed: 441.77% (599 seconds)

mamed-master 1
Average speed: 310.10% (599 seconds)
mamed-master 2
Average speed: 311.12% (599 seconds)
mamed-master 3
Average speed: 306.51% (599 seconds)
mamed-master 4
Average speed: 308.88% (599 seconds)
mamed-master 5
Average speed: 310.33% (599 seconds)

Reverting the template to a function argument (x68k_v-revert-template) noticeably and consistently improves performance, but as an experiment I also tried rewriting the function manually into a version for blending vs. not (x68k_v-rewritten_gfx_pix):

mame@debian-bullseye:~/mame$ for BRANCH in x68k_v-revert-template x68k_v-rewritten_gfx_pix; do for MAME in mame mamed; do for I in {1..5}; do echo ${MAME}-${BRANCH} $I; SDL_VIDEODRIVER=dummy ./${MAME}-${BRANCH} -bench 600 x68000 shangon 2>/dev/null; done; done; done
mame-x68k_v-revert-template 1
Average speed: 497.35% (599 seconds)
mame-x68k_v-revert-template 2
Average speed: 492.78% (599 seconds)
mame-x68k_v-revert-template 3
Average speed: 496.82% (599 seconds)
mame-x68k_v-revert-template 4
Average speed: 497.03% (599 seconds)
mame-x68k_v-revert-template 5
Average speed: 499.17% (599 seconds)

mamed-x68k_v-revert-template 1
Average speed: 321.25% (599 seconds)
mamed-x68k_v-revert-template 2
Average speed: 313.51% (599 seconds)
mamed-x68k_v-revert-template 3
Average speed: 319.75% (599 seconds)
mamed-x68k_v-revert-template 4
Average speed: 316.38% (599 seconds)
mamed-x68k_v-revert-template 5

mame-x68k_v-rewritten_gfx_pix 1
Average speed: 495.79% (599 seconds)
mame-x68k_v-rewritten_gfx_pix 2
Average speed: 494.36% (599 seconds)
mame-x68k_v-rewritten_gfx_pix 3
Average speed: 500.45% (599 seconds)
mame-x68k_v-rewritten_gfx_pix 4
Average speed: 503.51% (599 seconds)
mame-x68k_v-rewritten_gfx_pix 5
Average speed: 498.89% (599 seconds)

mamed-x68k_v-rewritten_gfx_pix 1
Average speed: 317.63% (599 seconds)
mamed-x68k_v-rewritten_gfx_pix 2
Average speed: 306.05% (599 seconds)
mamed-x68k_v-rewritten_gfx_pix 3
Average speed: 321.09% (599 seconds)
mamed-x68k_v-rewritten_gfx_pix 4
Average speed: 316.00% (599 seconds)
mamed-x68k_v-rewritten_gfx_pix 5

The rewritten version also seems to perform better, which is interesting. The less optimised mamed builds do have higher numbers in the non-templated benchmarks, but are too variable on my system to make a clear distinction.

ghost commented 1 year ago

Templates always seem to be slower. They might be neat in terms of code, but to this day compilers seem to generate significantly worse code without fail.

In many jobs I've worked, one of the first things we do is remove templates if we're struggling for performance on a particular platform; it's considered one of the easiest wins and can be the difference between something shipping and not shipping.

That said, the devs seem to love them, so I doubt you'll get very far trying to remove them here.

cuavas commented 1 year ago

Templates always seem to be slower. They might be neat in terms of code, but to this day compilers seem to generate significantly worse code without fail.

We've profiled Hyperstone and DSP16 extensively and using templates to avoid conditional branches definitely makes hot functions faster. I already explained why the template doesn't actually avoid any conditional branches in this case on the previous (broken) PR. The performance improvement in that case had nothing to do with the template anyway, it was because the code wasn't dealing with all the cases correctly.

simzy39 commented 1 year ago

@cracyc