Optimizing TIA rendering

thrust26 commented 7 years ago

The code seems to be optimizable. Simply moving the vblank() check to the beginning improves maximum framerate by 20..25% in my tests.

And maybe it would be better to retrieve to colors in priority order and stop retrieving when an object is enabled?

EDIT: Looks like my tests are not reliable. The difference seems much smaller than I first thought (<5%).

BTW: I noticed that at maximum frame rate, the sound gets way behind.

thrust26 commented 7 years ago

Worked on the method and profiled my results with VS with Boulder Dash's attraction mode for 90 seconds. The attached code improves the CPU usage of renderPixel from 7.4% to 5.4% (=~35% faster).

So overall Stella improved by ~2%. 😄

tia.zip

sa666666 commented 7 years ago

Implemented in https://github.com/stella-emu/stella/commit/18568cfbeb2b1d235bfdabeee99e15a406d4ab2a. I only made one change; I moved the test for vblank() to the ELSE part instead of the IF part. My reasoning is that the true part actual rendering) happens much more often than the false part (vblank). If this regresses your improvements, I can change it back.

thrust26 commented 7 years ago

I doubt it makes any (even measurable) difference.

DirtyHairy commented 7 years ago

:+1: Nice that you managed to squeeze something out this way --- renderPixel didn't stand out as a hotspot when I was profiling the code. At which optimization level is VS profiling the code? I would have assumed that, after inlining, the gain of short-circuiting the logic should be barely noticable. The reason why all individual render calls happen in every evaluation is mainly historical, I think --- more logic used to be hidden there.

thrust26 commented 7 years ago

Full Optimization /Ox Maybe the VS optimization is less good than your compiler?

The main gain (~80%) is coming from skipping the color calculation during VBlank. Maybe this can be even improved by using YStart and Height?

Here is the result after my optimization: stella_profiling

sa666666 commented 7 years ago

I'm surprised that TIASurface::render() is happening so quickly (not complaining though). That's where the image is scaled, TV effects performed, etc. Did you try testing on a ROM that uses phosphor mode, and also have TV effects turned on? I'll bet in such a case, that method will take more time.

EDIT: I'm not sure how to interpret the data; what is the difference between "Total CPU" and "Self CPU"? Also, I don't know if I even use full optimization when making Stella builds for WIndows. Perhaps I should start doing that??

thrust26 commented 7 years ago

With TV effects and phosphor enabled TIASurface::render() and AtariNTSC::render() really come into play. stella_profiling_rgb

sa666666 commented 7 years ago

OK, can you now use TV effects but not phosphor? I suspect that AtariNTSC::render will be the same, but TIASurface::render will be much faster.

thrust26 commented 7 years ago

You suspected right. 😃 So phosphor rendering is a spot for optimizing.

Could we optimize height? So that only the really displayed area is rendered? stella_profiling_rgb_only

sa666666 commented 7 years ago

It's fairly easy to see why this is the case. Blargg TV effect calls AtariNTSC::render, and Blargg + phosphor calls AtariNTSC::render + a loop to copy the old buffer to the new. But since we do it after Blargg, the framebuffer is 560 pixels wide instead of 160. You may recall we all discussed this in some detail in another thread. Doing the phosphor before Blargg would result in it happening on a smaller image, but then Blargg would have to modified to work with phosphor mode (and we determined that to be too difficult).

The code is at lines 379-397 of src/emucore/TIASurface.cxx. I don't know if we can optimize this further.

BTW 1: The code currently only renders based on the height that is seen. So no savings there. And it only renders the smallest possible image, and leaves stretching to the hardware. So no savings there either (it will process an image of ~160x200 for non-Blargg, and 560x200 for Blargg).

BTW 2: If we're going to continue this discussion here, perhaps it's best to reopen this issue and name it something else :smile:

thrust26 commented 7 years ago

The code currently only renders based on the height that is seen.

Yes and no. myTIA->height() returns the height as defined in Game Properties or 210 if nothing is defined. Depending on the game, I am sure this can be reduced. Either by allowing values below 210 and/or by detecting when the last object is displayed (e.g. VBLank is enabled). I remember we discussed this before and it turned out to be a bit more complex.

As for accelerating the code, I have two ideas for now (note: 18.2% of the 19.7% are spend in getRGBPhosphor):

The values returned by getPhosphor should be precalculated (256x256 bytes + the same for grayscale).
Maybe cache the getRGBPhosphor values? Since 2^24x2^24 is way too big, maybe we could use a hashmap instead? It depends on how many different values can occur. Probably too many.

So first we should try 1.

sa666666 commented 7 years ago

I will try (1) in the next day or so. We can tackle the height later, if we can deal with the complexities you mentioned.

thrust26 commented 7 years ago

Already implemented (1) with nice results (overall: 19.7% -> 11.2%, getRGBPhosphor(): 18.2% -> 8.1%): TIASurface.zip stella_profiling_rgb_phosphor_new Grayscale still has potiential, but first I need an idea.

thrust26 commented 7 years ago

EDIT: I'm not sure how to interpret the data; what is the difference between "Total CPU" and "Self CPU"? Also, I don't know if I even use full optimization when making Stella builds for WIndows. Perhaps I should start doing that??

AFAIK Total CPU returns the aggregated time of the routine and all subroutines, while Self CPU only shows the time spend in the routine itself (and inlined code). So the sum for column Self CPU should be 100%.

Not sure if full optimization really makes a difference.

thrust26 commented 7 years ago

The parameter shift in getRGBPhosphor() seems to be only used for the debugger. Isn't phosphor disabled for the debugger?

sa666666 commented 7 years ago

Yes, the shift only seems to be used in the debugger. Phosphor for the debugger isn't disabled, but since you can only see one frame at a time, it effectively is disabled. So I think we can remove the shift parameter from getRGBPhosphor completely.

I added your code in https://github.com/stella-emu/stella/commit/1dc78a9b5b8bb809e59f8a24e630d812609be373. I made the phosphor LUT be calculated just once, in the c'tor. And I removed the shift parameter as discussed above. Please retest to confirm the speedups.

thrust26 commented 7 years ago

I think you have to move the phosphor calculation back, because blend value changes have no effect now until you reload.

sa666666 commented 7 years ago

Oops, fixed in https://github.com/stella-emu/stella/commit/8759f340e874df2f4cdc9b05192ce3eaa366ca00. I looked at it and thought "we're going through the same loop, so why do it every time", but I forgot that getPhosphor() is dynamic, and depends on the current blend level.

thrust26 commented 7 years ago

Profiling with VS doesn't work at the moment, it's only outputting addresses, not methods. I had to stash my local changes, probably that changed some settings. May take a while until I got that fixed.

EDIT: Instructions here: https://msdn.microsoft.com/en-us/library/fsk896zz.aspx

sa666666 commented 7 years ago

OK, there are two more possibilities to consider:

as @thrust26 mentioned, using the actual height of the image (ie, figuring out exactly where, height-wise, the display starts and stops, and only rendering there (and blanking the rest)
for phosphor mode, doing the blending before Blargg

The first will require help from the TIA, as only it can know where it starts and stops. So we need to speak to @DirtyHairy about that one.

The second could potentially be faster than now, since pre-Blargg the image is 160 pixels wide, and post-Blargg it's 560 pixels wide. That's a decrease of 3.5x if we do blending before Blargg. What remains to be seen is (a) is this semantically correct (ie, will it still lead to correct output) and (b) whether that new approach/code will be fast enough to not negate any of the savings.

sa666666 commented 7 years ago

I don't know if you want to try this, but consider having a look at https://tortoisegit.org/docs/tortoisegit/tgit-dug-reset.html. It details how to do a hard reset on your local repo, which basically resets everything back to the way it was at initial checkout. Maybe it's better than using stash??

thrust26 commented 7 years ago

Regarding the 2nd point, I still think phosphor has to be done after Blargg. That's because Blargg mostly emulates the effects of the electronics behind the screen, while phosphor is happening directly on the screen. Thus it happens later in the picture generation.

However, if we are really brave, we could try to integrate phosphor into Blargg and maybe save some time this way.

sa666666 commented 7 years ago

I am content to not have to touch Blargg if we can get our speedups in another way. We already discussed that one in some detail, and as you say, it seems that it should be done in the current order anyway.

EDIT: I guess the easiest thing to do is change the phosphor loop from using AtariNTSC::outWidth(kTIAW) to simply kTIAW and see what effect it has on the CPU usage. Obviously the output will be incorrect, but at least it would give us an idea of the maximum possible speedup to expect. And if it's not much, then it's not worth pursuing.

thrust26 commented 7 years ago

I started working at a new project this week. There they are using GIT, so I have to learn it now anyway. 😃

sa666666 commented 7 years ago

You can't escape GIT :smiling_imp:

thrust26 commented 7 years ago

🏃

thrust26 commented 7 years ago

OK, profiling works again. And the results are as expected.

sa666666 commented 7 years ago

OK, I'm logging off for the evening. Maybe you can try my suggestion from above (on line 388 in TIASurface.cxx, change AtariNTSC::outWidth(kTIAW) to kTIAW and re-test). The image will still be displayed, but phosphor will only be applied for 160 pixels on each line. But it will tell use the maximum possible speedup we could get.

thrust26 commented 7 years ago

Maximum speed should be about 8.1%/3.5. So we would gain 5-6%.

DirtyHairy commented 7 years ago

Git will find you, wherever you hide :smile:

DirtyHairy commented 7 years ago

The profiles look pretty similar to my own findings with GCC and cachegrind in #7. I'll redo these and make screenshots later this week so we can compare the two compilers.

The main gain (~80%) is coming from skipping the color calculation during VBlank. Maybe this can be even improved by using YStart and Height?

renderPixel is only called if the FrameManager is in rendering state, so this information is already being used :smirk: In fact, I am surprised that this change is that effective --- I'll have to port it to 6502.ts I guess.

thrust26 commented 7 years ago

renderPixel is only called if the FrameManager is in rendering state, so this information is already being used 😏 In fact, I am surprised that this change is that effective --- I'll have to port it to 6502.ts I guess.

Hm, then something in the code seems not quite right, doesn't it?

DirtyHairy commented 7 years ago

Hm, then something in the code seems not quite right, doesn't it?

That's possible, but if FrameManager::isRendering were broken, the code should have segfaulted long ago :smirk: There are also legitimate reasons for vblank to occur during rendering (the frame height is fixed and not determined dynamically), so there will definitely be an effect.

I can't imagine this amounts to 80%, but I am not very confident in the precision and interpretation of numbers gained through this kind of microprofiling anyway. You are profiling a very small, hot region of code, and any change will have an impact. Moving the vblank check might have other side effects on the generated code that affect things caching, register spilling, branch prediction etc.

Looking at the generated assembly might help to clarify what's going on, but I don't think it is worth the effort. Checking how often which branch (vblank on/off) is taken with a simple print statement should be worth the effort, will do so this week :smiley:

thrust26 commented 7 years ago

I meant (if I got you right), then renderPixel should not be called while VBlank is active, but yet it is. So there might be room for improvement?

This is checked before: bool TIA::isRendering() const { return myState == State::frame && myHasStabilized; }

DirtyHairy commented 7 years ago

I'll definitely have a closer look, it still may be a smoking gun of some other defect.

thrust26 commented 7 years ago

~~I debugged renderPixel() again. vblank() never returns true, so the else part is never executed. 😕 ❓~~

~~Whatever caused my improvements, its not the vblank() check. Maybe the compiler can optimize this better?~~

EDIT: After some more debugging, I found (again), that you cannot trust VS. Even with all optimizations disabled, it still optimizes! And then some breakpoints don't work anymore (without any notice of course). 😠

Anyway, finally I was able to confirm that renderPixel() is definitely called while VBlank is enabled.

Could this be due to the different heights and positions of the various kernels (copyright notice, AtariAge logo, title screen and game screen)? I noticed that one of my own ROMs (I cannot share in public), which got a title added, is displayed incorrectly centered later. The top is cut off.

sa666666 commented 7 years ago

Maybe it's not the vblank() check at all. You also changed methods, and introduced IF statements. Previously, all 7 methods were always called, but now there's the possibility to short-circuit and not do all of them. Perhaps this is where the improvement comes from.

DirtyHairy commented 7 years ago

Could this be due to the different heights and positions of the various kernels (copyright notice, AtariAge logo, title screen and game screen)?

Definitely, that's what I meant before.

Have you checked how stable the numbers for your performance improvements (and their attribution to the various changes you made) are between different profiler runs?

thrust26 commented 7 years ago

Definitely, that's what I meant before.

If the code could detect screen changes, it could adapt to that, e.g. slowly reducing the rendered area, but increasing it instantly when required.

The number are pretty stable, though by far not as precise as indicated by the two digits after the comma. The timing varies by less than 1% (up to ~0.5%).

But it would be cool if you could verify some results.

thrust26 commented 7 years ago

More ideas for improvements:

@DirtyHairy z26 used to identify areas, which have been changed during the last frame and update only those. Maybe we could do something similar for Stella?
@sa666666 We could use the GPU for TV effects and phosphor. Does SDL2 (or a different library) provide direct hardware rendering access?
Use multiple threads for rendering. Even slow hardware often has multiple cores.

sa666666 commented 7 years ago

Responding to each point in turn:

What you're talking about is called 'dirty rectangle' updates. Stella is using SDL2 behind the scenes, and in most cases it is using hardware acceleration (Direct3D for Windows, OpenGL for Linux, OSX). Conventional wisdom is that dirty rects don't work for hardware acceleration; it's often faster to send the entire image (160x200 or 560x200) than to track where the changes are and only send that data.
This is definitely an option, but greatly increases complexity. SDL2 definitely has the ability to drop back to lower levels; in fact, up to release 3.9.3 (last in the 3.x series), it was using OpenGL entirely. I wrote hand-crafted code for rendering, but still used software for doing TV effects and phosphor. The main problem is that once we go to a specific toolkit, we're stuck with it. So that means OpenGL, since it's the most cross-platform. And then we create shader programs (or similar) that are OpenGL-only. Problem is that OpenGL isn't everywhere, and Direct3D works much better in Windows. The current code uses SDL2 abstraction to interface with the best possible hardware libraries on each system. We break the abstraction by going lower-level, and all the problems that entails.
I would only want to go to threading as an absolute last resort; IMO the added complexity isn't worth it for what we'd gain.

Only updating the image from start scanline to end scanline is a good compromise. The entire data itself would still be sent to the video card, but we'd save on calls to getPhoshor and friends for lines that don't need it. After all, sending the data to the card only accounted for 2% runtime; it's the phosphor calculations that kill it.

I think we may have to accept that, barring any further optimizations like those from the past week, the new core is just slower than the old one. Not a big surprise really, as it's emulating at a much lower and more accurate level.

And this will continue when we get to the sound code. It looks like it will be slower too, but again, much more accurate.

thrust26 commented 7 years ago

Conventional wisdom is that dirty rects don't work for hardware acceleration

True, but we are doing software rendering for TV effects and phosphor, aren't we? There dirty rectangles might work, no?

About GPU rendering: Does SDL2 have any high level support for this?

I would only want to go to threading as an absolute last resort; IMO the added complexity isn't worth it for what we'd gain.

I only wanted to use it for the loops in TIASurface::render(). My threading knowledge is very limited, is it really that complex?

thrust26 commented 7 years ago

Minimal changes to getRGBPhosphor. Profiling indicates a small speedup.

// - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
inline uInt32 TIASurface::getRGBPhosphor(const uInt32 c, const uInt32 p) const
{
  #define TO_RGB(color, red, green, blue) \
    const uInt8 red = color >> 16; const uInt8 green = color >> 8; const uInt8 blue = color;

  TO_RGB(c, rc, gc, bc);
  TO_RGB(p, rp, gp, bp);

  // Mix current calculated frame with previous displayed frame
  const uInt8 rn = myPhosphorPalette[rc][rp];
  const uInt8 gn = myPhosphorPalette[gc][gp];
  const uInt8 bn = myPhosphorPalette[bc][bp];

  return (rn << 16) | (gn << 8) | bn;
}

sa666666 commented 7 years ago

Yes, I guess dirty rects might work for tracking changes in the software rendering part of the pipeline. I guess only drawing from start scanline to end scanline is a form of this. The (potential) issue is that tracking what has changed is just as slow as just doing the normal work. But you're welcome to give it a try :smiling_imp:

IME, threading itself conceptually isn't that difficult. Using threading and actually getting a speedup is much harder. Sometimes you do just as much work with thread synchronization as you would by not using threading at all. Sort of like what I describe in the previous paragraph, I guess. And then there are times that threading won't help because the process is inherently non-parallel.

thrust26 commented 7 years ago

I suppose only @DirtyHairy can evaluate the overhead for dirty rectangles without implementing it. Let's wait what he says.

I remember those threading problems, but the task here seems simple enough. And the execution time seems to be constant for each line, so it should be rather easy to get them running in parallel. Or am I missing something here? We could give it a try for phosphor only.

sa666666 commented 7 years ago

For the loops in TIASurface::render(), it might also be beneficial to use a pointer directly, instead of accessing it with [] and addition. But we're really getting into micro-optimizations at this point.

thrust26 commented 7 years ago

One more idea:

We could save some shifting if we use three different myPhosphorPalettes. But the memory overhead is probably not worth it.

sa666666 commented 7 years ago

Considering that each palette is 64KB, yes, probably not worth the overhead.

sa666666 commented 7 years ago

Threading wouldn't work on Blargg anyway, I think, since each pixel is formed by looking at its nearest neighbours (both above and below the current scanline), so it necessarily can't do one line until the previous one is done. But there is no such restriction on the loops for blending, so I guess it's possible there.

sa666666 commented 7 years ago

Implemented the change mentioned above in https://github.com/stella-emu/stella/commit/1e7e4cbe207f1e10dc851f07204735897eb2eab4.

stella-emu / stella

Optimizing TIA rendering #185