Profiling and inlining - Githubissues

DirtyHairy commented 7 years ago

The 6502.ts core is written with Javascript JIT inlining in mind. The C++ implementation should be profiled and checked for possible performance gains through inlining.

sa666666 commented 7 years ago

@DirtyHairy, are you familiar with any profiling tools for C++, or the process to follow to optimize in such cases? I've tried compiling Stella with profiling enabled, and it does tell what percentage of the time each method is being called. Unsurprisingly, the bulk of the time is spent in TIA::renderpixel() and TIA::tickHframe(), then the various Player and Missile tick() and render() methods. I've inlined as much as possible in these methods, except for Player/Missile tick()/render(). I can try those too, but at this point I don't think that function call overhead is the issue.

I've done a very rudimentary analysis of how much overall CPU time is being taken (using the System Activity monitor in KDE, basically a frontend to 'top'). In Stella 4.7.3, a typical run shows 2% CPU usage on my system, dropping to 1% when I turn off TV effects. The same test with the current git code shows between 5-7% CPU usage, with TV effects not having much of an impact.

I know that Stella now has much more accurate TIA emulation, and this may be the price we pay for that (and if so, it's fine by me). But this indicates that the emulation has slowed down by at least 300%, and if possible I'd like to get some of that speed back. Any suggestion on how to proceed?

DirtyHairy commented 7 years ago

I have done a lot of numbercrunching in the past and have some experience with profiling and optimization. It was FORTRAN back then, but I don't think that's making a big difference :smirk: The tooling that I was going to use was gprof, gcov and whatever else google might turn up. As for the consequences, I don't have a plan laid out.

I already have done some profiling on the Javascript version in the past, but the computational costs are different there and dominated by the overhead of function dispatch and the related type assertions. Without looking at the profiles, I agree with you that, in its current state, function dispatch is likely not the dominating issue in the C++ version anymore.

That said, I have some ideas on how some things could be optimized. In particular, the various render methods could be change to be evaluated only when relevant quantities change (instead of every cycle), turning the corresponding parts of TIA.cxx essentially into pure lookups (at the expense of slightly increased costs elsewhere). Similar optimization potential might exist in other places, and I hope that a profile with per-line resolution will be able to identify this.

However, I am not surprised that the new core is slower by an integer factor and, while there still might be room for maybe 10% - 20% improvement, my gut feeling is there won't be any groundbreaking progress here. After all, the old core was doing an amazing job at being fast while retaining impressive accuracy, while the new core is essentially a microsimulation of the TIA (even though not a the gate level).

DirtyHairy commented 7 years ago

I spent some time throwing Stella at oprofile and didn't get any conclusive results yet. I am not content with the output of oprofile yet, so I'll try give it another try :smirk:

DirtyHairy commented 7 years ago

I've had more profiling success with valgrind / callgrind . The results suggest that, even at -O2, about 20% of time are wasted in the STL vector implementation in DelayQueue. In other words, there should be room for improvement by rewriting DelayQueue to use plain arrays instead of STL vectors.

For reference, here are the callgrind profiles (one for -O0, one for -O2): callgrind.zip. You can view them using kcachegrind.

I will proceed to rewrite the DelayQueue. Other than that, I can't spot any low-hanging fruits in those profiles.

DirtyHairy commented 7 years ago

I have rewritten DelayQueue & friends in 744571b. While profiling results still suggest considerable time spent in DelayQueue::execute, typical CPU usage on my laptop has gone down from ~12% to ~9%, so I consider this a success (Stella 4 is ~6%). From looking at the profile I don't see anything else that sticks out, so i am closing this ticket.

DirtyHairy commented 7 years ago

Hm. A bit of dirty testing shows that there might be some microoptimizations in DelayQueue (replacing the modulus with bit masks) that could squeeze out another 5% or so. However, this is architecture dependent and would reduce readability, so I'll skip this for now.

stella-emu / stella

Profiling and inlining #7