Closed thrust26 closed 7 years ago
My opinions on OMP vs. C++11 threading:
I'm not saying outright to just close the book on this approach, just that we shouldn't be overly consumed by getting it to work. IMO, improvements to the code itself are more important than threading improvements.
@thrust26, what is your current code? There have been several iterations now, but I haven't committed anything else yet. Also, maybe you can try your 'google code' with 4 threads instead of 2, to see what kind of performance it gives.
You mean the non-blargg with phosphor code?
I agree about the first point of OMP. Christian and I discussed this on gitter already. The 2nd point (and why we should try to use threading) is there to help slow, multi core devices.
@thrust26, what you called 'hand-written' code, or 'google code'. It's getting hard to keep track of things now, since this issue contains almost 100 messages and several different code examples.
What I was referring to is your original code that uses 2 C++11 threads. It was operating on the Blargg+phosphor code path, but obviously only applied to the phosphor loop (since we haven't even looked optimizing Blargg yet).
I suggest to use 4 threads instead of 2, to see if (a) it makes things any faster, and more importantly (b) that it doesn't drastically kill performance like OMP.
I feel we're all ping-ponging a bit on this stuff, basically going all over the place. For example, in the process of optimizing the phosphor loop for use without threading, we may be making it more difficult to do with threading. So we need to decide which approach to take, and work from there.
Also, it's 10AM here now. At 1PM (3 hours from now) I won't be available for at least 3 hours (invigilating my final exam), so if there's anything you need from me, now is the time.
There you go!
See line 40 for enabling passive OMP wait policy.
case Filter::BlarggPhosphor:
contains multiple variations of threading (and one without). I have no clue how to measure the effects. Probably a better profiler can do.
Applied threading to Blargg now too. This and cleaned TIASurface attached.
Profiling looks good:
Question: Should we really split into the number of cores? Or should we reserve some cores, e.g. for the OS? Also, should we limit the threads? The threads are not running very long, so the overhead may eventually become larger than the gain.
I have the vague feeling that 2..4 threads are enough.
Sorry, still having a brain fart and still not totally understanding the results of those graphs. In any event, as long as nothing slowed down, I'd be happy :smile:
As for limiting threads and reserving for the OS, I think 4 is probably enough too. But I'd like to hear what @DirtyHairy has to say, and also some testing from him in Linux.
EDIT: I also want to test on my 16-thread machine to see what happens.
@thrust, the following patch solves your issue about passing the phosphor palette to AtariNTSC. I tried cleaner ways of doing it, but basically resorted to just copying the data. It's only 64KB, and only done when phosphor changes. pass_phosphor.diff.zip
Attached the lastest source code.
Rendering_Blargg_with_Phosphor.zip
And here profiling the result twice. This should give an indication about the (rather low) precision of the values.
The attached git diff patch now compiles in Linux with gcc and clang. @DirtyHairy, you can apply it to your threading branch when you like. threading.diff.zip
An small update to the existing threading (if we want to continue to use it). This now makes the main thread busy too and spawns one extra thread less:
// - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
void AtariNTSC::render(const uInt8* atari_in, const uInt32 in_width, const uInt32 in_height,
void* rgb_out, const uInt32 out_pitch, uInt32* rgb_in)
{
// spawn myNumThreads - 1 threads...
for (uInt32 i = 1; i < myNumThreads; i++)
myThreads[i] = std::thread([=] {
rgb_in == NULL ? renderThread(atari_in, in_width, in_height, myNumThreads, i, rgb_out, out_pitch) :
renderWithPhosphorThread(atari_in, in_width, in_height, myNumThreads, i, rgb_in, rgb_out, out_pitch);
});
// make the main thread busy too
rgb_in == NULL ? renderThread(atari_in, in_width, in_height, myNumThreads, 0, rgb_out, out_pitch) :
renderWithPhosphorThread(atari_in, in_width, in_height, myNumThreads, 0, rgb_in, rgb_out, out_pitch);
// ...and make them join again
for (uInt32 i = 1; i < myNumThreads; i++)
myThreads[i].join();
// Copy phosphor values into out buffer
if (rgb_in != NULL)
memcpy(rgb_out, rgb_in, in_height * out_pitch);
}
Forgot this minor one in AtariNTSC::initialize()
, line 40:
myThreads = new std::thread[myNumThreads - 1];
Now that the threading code has been committed, shall we close this one or keep it open for more rendering thread alternatives?
Also, should we allow controlling threading with a parameter? E.g. -threads 1..n which defines the maximum number of threads to use. -threads would ask Stella to automatically define a suitable value.
I think since we moved discussion of this to gitter, this one can be closed. As for allowing a choice of threads, is there really much point? Maybe the choice should simply be on or off??
Agreed, on and off would do for a start.
OK, I'll get it added later this evening.
The code seems to be optimizable. Simply moving the vblank() check to the beginning improves maximum framerate by 20..25% in my tests.
And maybe it would be better to retrieve to colors in priority order and stop retrieving when an object is enabled?
EDIT: Looks like my tests are not reliable. The difference seems much smaller than I first thought (<5%).
BTW: I noticed that at maximum frame rate, the sound gets way behind.