Optimizing TIA rendering

thrust26 commented 7 years ago

The code seems to be optimizable. Simply moving the vblank() check to the beginning improves maximum framerate by 20..25% in my tests.

And maybe it would be better to retrieve to colors in priority order and stop retrieving when an object is enabled?

EDIT: Looks like my tests are not reliable. The difference seems much smaller than I first thought (<5%).

BTW: I noticed that at maximum frame rate, the sound gets way behind.

sa666666 commented 7 years ago

My opinions on OMP vs. C++11 threading:

If OMP is too difficult or not performant, we should just drop it and attempt C++11 threads
I'm not entirely convinced that threading will help at all anyway
We probably shouldn't spend too much time on this

I'm not saying outright to just close the book on this approach, just that we shouldn't be overly consumed by getting it to work. IMO, improvements to the code itself are more important than threading improvements.

@thrust26, what is your current code? There have been several iterations now, but I haven't committed anything else yet. Also, maybe you can try your 'google code' with 4 threads instead of 2, to see what kind of performance it gives.

thrust26 commented 7 years ago

You mean the non-blargg with phosphor code?

I agree about the first point of OMP. Christian and I discussed this on gitter already. The 2nd point (and why we should try to use threading) is there to help slow, multi core devices.

sa666666 commented 7 years ago

@thrust26, what you called 'hand-written' code, or 'google code'. It's getting hard to keep track of things now, since this issue contains almost 100 messages and several different code examples.

What I was referring to is your original code that uses 2 C++11 threads. It was operating on the Blargg+phosphor code path, but obviously only applied to the phosphor loop (since we haven't even looked optimizing Blargg yet).

I suggest to use 4 threads instead of 2, to see if (a) it makes things any faster, and more importantly (b) that it doesn't drastically kill performance like OMP.

sa666666 commented 7 years ago

I feel we're all ping-ponging a bit on this stuff, basically going all over the place. For example, in the process of optimizing the phosphor loop for use without threading, we may be making it more difficult to do with threading. So we need to decide which approach to take, and work from there.

Also, it's 10AM here now. At 1PM (3 hours from now) I won't be available for at least 3 hours (invigilating my final exam), so if there's anything you need from me, now is the time.

thrust26 commented 7 years ago

There you go!

See line 40 for enabling passive OMP wait policy. case Filter::BlarggPhosphor: contains multiple variations of threading (and one without). I have no clue how to measure the effects. Probably a better profiler can do.

TIASurface_lots_of_threading.zip

thrust26 commented 7 years ago

Applied threading to Blargg now too. This and cleaned TIASurface attached.

Profiling looks good: stella_profiling_multithreaded_with_blarg

AtariNTSC.zip

Question: Should we really split into the number of cores? Or should we reserve some cores, e.g. for the OS? Also, should we limit the threads? The threads are not running very long, so the overhead may eventually become larger than the gain.

I have the vague feeling that 2..4 threads are enough.

sa666666 commented 7 years ago

Sorry, still having a brain fart and still not totally understanding the results of those graphs. In any event, as long as nothing slowed down, I'd be happy :smile:

As for limiting threads and reserving for the OS, I think 4 is probably enough too. But I'd like to hear what @DirtyHairy has to say, and also some testing from him in Linux.

EDIT: I also want to test on my 16-thread machine to see what happens.

sa666666 commented 7 years ago

@thrust, the following patch solves your issue about passing the phosphor palette to AtariNTSC. I tried cleaner ways of doing it, but basically resorted to just copying the data. It's only 64KB, and only done when phosphor changes. pass_phosphor.diff.zip

thrust26 commented 7 years ago

Attached the lastest source code.

threads are now limited to 4 max
blargg and phosphor are now done in the same thread

Rendering_Blargg_with_Phosphor.zip

And here profiling the result twice. This should give an indication about the (rather low) precision of the values. twice_blargg_with_phosphor

sa666666 commented 7 years ago

The attached git diff patch now compiles in Linux with gcc and clang. @DirtyHairy, you can apply it to your threading branch when you like. threading.diff.zip

thrust26 commented 7 years ago

An small update to the existing threading (if we want to continue to use it). This now makes the main thread busy too and spawns one extra thread less:

// - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
void AtariNTSC::render(const uInt8* atari_in, const uInt32 in_width, const uInt32 in_height,
  void* rgb_out, const uInt32 out_pitch, uInt32* rgb_in)
{
  // spawn myNumThreads - 1 threads...
  for (uInt32 i = 1; i < myNumThreads; i++)
    myThreads[i] = std::thread([=] {
    rgb_in == NULL ? renderThread(atari_in, in_width, in_height, myNumThreads, i, rgb_out, out_pitch) :
      renderWithPhosphorThread(atari_in, in_width, in_height, myNumThreads, i, rgb_in, rgb_out, out_pitch);
  });
  // make the main thread busy too
  rgb_in == NULL ? renderThread(atari_in, in_width, in_height, myNumThreads, 0, rgb_out, out_pitch) :
    renderWithPhosphorThread(atari_in, in_width, in_height, myNumThreads, 0, rgb_in, rgb_out, out_pitch);
  // ...and make them join again
  for (uInt32 i = 1; i < myNumThreads; i++)
    myThreads[i].join();

  // Copy phosphor values into out buffer 
  if (rgb_in != NULL)  
    memcpy(rgb_out, rgb_in, in_height * out_pitch);
}

thrust26 commented 7 years ago

Forgot this minor one in AtariNTSC::initialize(), line 40:

myThreads = new std::thread[myNumThreads - 1];

thrust26 commented 7 years ago

Now that the threading code has been committed, shall we close this one or keep it open for more rendering thread alternatives?

Also, should we allow controlling threading with a parameter? E.g. -threads 1..n which defines the maximum number of threads to use. -threads would ask Stella to automatically define a suitable value.

sa666666 commented 7 years ago

I think since we moved discussion of this to gitter, this one can be closed. As for allowing a choice of threads, is there really much point? Maybe the choice should simply be on or off??

thrust26 commented 7 years ago

Agreed, on and off would do for a start.

sa666666 commented 7 years ago

OK, I'll get it added later this evening.

stella-emu / stella

Optimizing TIA rendering #185