logicomacorp / WaveSabre

Official WaveSabre repository
MIT License
246 stars 34 forks source link

Multithreaded rendering in WaveSabrePlayerLib #5

Closed yupferris closed 5 years ago

yupferris commented 5 years ago

Since track rendering is inherently parallelizable, let's take advantage of this by dispatching track rendering to one or more worker threads instead of always rendering them in serial.

The result is typically a 40% improvement in rendering speed going from 1-2 cores, with an additional few percent going to 3 or 4 cores (sometimes up to nearly 60%!).

Some concrete measurements for rendering entire songs:

improvements

And a bit more visual representation of the rendering being handled by a single core vs several cores:

trashpanda-single

trashpanda-multi3

As expected, we see a slight drop in performance for the single-core case due to the additional overhead, but it's not significant. I don't suspect any user would prefer to use that anyways and we shouldn't have to provide an alternative implementation just for that case; users can always fork the player code and do this themselves if this is absolutely necessary (which is in the git history, and this PR can also serve as a guide to remove it).

Implementation-wise, this only really involves managing a user-defined number of worker threads (3 is used in examples here) and some small administration code, which has been designed to be as lightweight as possible and especially not allocate any memory dynamically. All player types (RealtimePlayer, PreRenderPlayer, and WavWriter) take advantage of this, so even if a user has to pre-render a song (which we offer an option for in our intros), it should still be faster to do so (and possible to allocate different thread counts to that if desired). Note that I would certainly not mind getting some more eyes on this code as I did have some issues getting it to work initially, but I think I've isolated and fixed all of the problems (both here and in isolated minimal reproductions) and this appears to be quite robust now.

To compare size with the new approach, I've packed trashpanda with the single-core approach we had before (effectively matching the public release) and with 3 audio rendering cores. The multi-core executable is 188 bytes larger (60756 bytes before padding, vs 60568 bytes before padding). Admittedly that's somewhat larger than I expected, but still not very significant in terms of a 64k, and more than pays for itself with such a massive performance increase. If you'd like to test this build yourself, I've made it available here.

This PR also introduces some additional small cleanups/fixes (see commits for details). One such cleanup is the introduction of the CriticalSection class which should be a bit easier to use than the win32 API directly, as well as making it easier to implement platform-specific versions of this in the future.

yupferris commented 5 years ago

I was thinking that if we do still want to support the single-core case and not spawn extra rendering threads if we don't have to, it's not actually that hard. These should be the necessary changes for that (they look a lot worse than they are really; it's mostly just a couple conditionals):

diff --git a/WaveSabrePlayerLib/src/RealtimePlayer.cpp b/WaveSabrePlayerLib/src/RealtimePlayer.cpp
index 74d2eb7..896c2e3 100644
--- a/WaveSabrePlayerLib/src/RealtimePlayer.cpp
+++ b/WaveSabrePlayerLib/src/RealtimePlayer.cpp
@@ -6,7 +6,8 @@ namespace WaveSabrePlayerLib
                : song(song)
                , numRenderThreads(numRenderThreads)
                , bufferSizeMs(bufferSizeMs)
-               , songRenderer(new SongRenderer(song, numRenderThreads))
+               // This SongRenderer instance is just used for getting temp/sample rate etc, so let's not spawn extra render threads for it^M
+               , songRenderer(new SongRenderer(song, 1))^M
                , renderThread(nullptr)
        {
        }
diff --git a/WaveSabrePlayerLib/src/SongRenderer.cpp b/WaveSabrePlayerLib/src/SongRenderer.cpp
index bdf1808..b4ee89e 100644
--- a/WaveSabrePlayerLib/src/SongRenderer.cpp
+++ b/WaveSabrePlayerLib/src/SongRenderer.cpp
@@ -63,24 +63,30 @@ namespace WaveSabrePlayerLib
                }

                this->numRenderThreads = numRenderThreads;
-               renderThreads = new HANDLE[numRenderThreads];
-               renderThreadShutdown = false;
-
-               for (int i = 0; i < numRenderThreads; i++)
+               if (numRenderThreads > 1)^M
                {
-                       renderThreads[i] = CreateThread(0, 0, renderThreadProc, (LPVOID)this, 0, 0);
-                       SetThreadPriority(renderThreads[i], THREAD_PRIORITY_HIGHEST);
+                       renderThreads = new HANDLE[numRenderThreads];^M
+                       renderThreadShutdown = false;^M
+^M
+                       for (int i = 0; i < numRenderThreads; i++)^M
+                       {^M
+                               renderThreads[i] = CreateThread(0, 0, renderThreadProc, (LPVOID)this, 0, 0);^M
+                               SetThreadPriority(renderThreads[i], THREAD_PRIORITY_HIGHEST);^M
+                       }^M
                }
        }

        SongRenderer::~SongRenderer()
        {
-               // We don't need to enter/leave a critical section here since we're the only writer at this point.
-               renderThreadShutdown = true;
+               if (numRenderThreads > 1)^M
+               {^M
+                       // We don't need to enter/leave a critical section here since we're the only writer at this point.^M
+                       renderThreadShutdown = true;^M

-               WaitForMultipleObjects(numRenderThreads, renderThreads, TRUE, INFINITE);
+                       WaitForMultipleObjects(numRenderThreads, renderThreads, TRUE, INFINITE);^M

-               delete [] renderThreads;
+                       delete [] renderThreads;^M
+               }^M

                for (int i = 0; i < numDevices; i++) delete devices[i];
                delete [] devices;
@@ -97,22 +103,32 @@ namespace WaveSabrePlayerLib
        {
                MxcsrFlagGuard mxcsrFlagGuard;

-               // Dispatch work for render threads
+               if (numRenderThreads > 1)^M
                {
-                       auto criticalSectionGuard = criticalSection.Enter();
+                       // Dispatch work for render threads^M
+                       {^M
+                               auto criticalSectionGuard = criticalSection.Enter();^M

-                       for (int i = 0; i < numTracks; i++)
-                               trackRenderStates[i] = TrackRenderState::Idle;
+                               for (int i = 0; i < numTracks; i++)^M
+                                       trackRenderStates[i] = TrackRenderState::Idle;^M
+^M
+                               renderThreadNumFloatSamples = numSamples / 2;^M
+                       }^M

-                       renderThreadNumFloatSamples = numSamples / 2;
+                       // Wait for render threads to complete their work^M
+                       //  Note that we don't need to enter/leave a critical section here since we're the only reader at this point.^M
+                       //  However, if we don't want to sleep, entering/leaving a critical section here also works to not starve the^M
+                       //  worker threads it seems. Sleeping feels like the better approach to reduce contention though.^M
+                       while (trackRenderStates[numTracks - 1] != TrackRenderState::Finished)^M
+                               Sleep(0);^M
                }
+               else^M
+               {^M
+                       int numFloatSamples = numSamples / 2;^M

-               // Wait for render threads to complete their work
-               //  Note that we don't need to enter/leave a critical section here since we're the only reader at this point.
-               //  However, if we don't want to sleep, entering/leaving a critical section here also works to not starve the
-               //  worker threads it seems. Sleeping feels like the better approach to reduce contention though.
-               while (trackRenderStates[numTracks - 1] != TrackRenderState::Finished)
-                       Sleep(0);
+                       for (int i = 0; i < numTracks; i++)^M
+                               tracks[i]->Run(numFloatSamples);^M
+               }^M

                // Copy final output
                float **masterTrackBuffers = tracks[numTracks - 1]->Buffers;
gsuberland commented 5 years ago

Continuing the conversation from YouTube

I'd like to discuss there if possible, but in response to what you're saying, is that createevent/setevent also possibly a good way to dispatch the work to multiple threads in the first place? i.e. can multiple threads wait for the same event at the same time? I'm also thinking if that's a good solution that would mean there's no extra spinning etc being done by the worker threads unless they're doing work. That leads me to perhaps another way to update the track render states; perhaps an interlocked CAS on a given state is a better way to pull work off of that list. As you say, when a thread finishes the last one it can signal that work is finished, and other threads can detect if the last track is finished and wait for the dispatch event again.

Yes, this is very much the approach I would take. To answer your questions on events: yes, multiple threads can wait on a single event using WaitForSingleObject / WaitForMultipleObjects. Signalling the event unblocks all threads that are waiting on it.

What I'd probably look into doing first is getting rid of the "was the final track rendered yet" wait loop and replacing it with an event waiting pattern. This'll save you a few CPU cycles and context switches, but more importantly it'll stall the thread properly so that you can better debug and profile things during runtime. Rather than all threads always executing and having to breakpoint the thread to see where it stopped, you'll be able to see that the thread is in a wait state.

In the constructor I'd use CreateEvent to set up an "final track rendered" event. Then, in renderThreadProc, just after you run the renderer, you can signal the event if nextTrackIndex is equal to songRenderer->numTracks - 1. You can then replace the loop above with a WaitForSingleObject call.

Next I'd consider taking the same approach to avoid spinning in renderThreadProc when there's no work to do. Effectively all you need to do there is have an event for "a track was just rendered", idea being that you only check for new work when that event is signaled. The WaitForSingleObject call should actually slot right into your existing code, just before you check if new work is available. Just make sure to exit the critical section around the WaitForSingleObject call otherwise you'll deadlock. Then you can call SetEvent right after the run call completes.

gsuberland commented 5 years ago

Slight correction w.r.t. events, as I was wrong above.

Events can be created as manual-reset or auto-reset. A manual reset event will unblock all waiting threads until one of them calls ResetEvent. An auto-reset event will unblock a single waiting thread. The event reset mode is set as a parameter in CreateEvent.

Auto-reset is the most convenient, but one thing to think about is the following race condition:

  1. Worker thread A completes a piece of work and marks it as Finished.
  2. Worker thread B completes a piece of work and marks it as Finished.
  3. Worker thread A calls SetEvent on the "I just finished some work" event.
  4. Worker thread B calls SetEvent on the "I just finished some work" event.
  5. Worker thread C is awoken, clearing the event. It completes one piece of work and stops.
  6. One piece of work is left stuck in the queue and is never completed.

There are two approaches to fix this.

The first is to have each worker thread continue grabbing work from the queue until there is none left, and only then call WaitForSingleObject to block again. This introduces a very slight performance loss whenever the above deadlock would normally occur, as it leads to one worker thread sequentially performing two operations instead of two in parallel, but this will be rare because it requires some fairly precise execution phase differences in order to happen.

The other approach is to use manually resetting events and have all worker threads try to grab work at once every time you signal the event. This has slightly lower general-case performance in total. I would still suggest using the "loop until there's no work left, then wait" approach just to avoid any sneaky race conditions.

I was also wrong about where SetEvent should be called in my comment above. It should be called right after you mark the work as finished. Otherwise the next thread will come in and see no available work because the current bit of work is still marked as Rendering rather than Finished, and you'll deadlock again.

yupferris commented 5 years ago

Cheers @gsuberland, thanks for the very informative tips and explanations :)

I'm going to start by first using an event to detect when the worker threads have finished doing their work. After that, I think I'll implement an event for when a worker thread has completed rendering the last track. Those two things alone should reduce contention a bit and as you say, stall the threads properly so we're not relying on some kind of roulette in the OS scheduler and sleeping to make sure nothing locks up.

I'm a bit concerned with firing an event whenever a track completes though - I can understand the motivation (that we don't want the worker threads to loop over the queue over and over again without anything to do in the case that not all tracks have been rendered but the only ones that don't have any dependencies to wait on are already being processed by other worker threads), but the deadlock issue you describe in your second comment seems particularly hairy to get around. At least with the current solution (and I believe also the first fix you describe if I understand you correctly) that doesn't happen, but we spend a little extra time checking if work is available. However I think we would at least spend a significantly less amount of time with any contention happening if we use interlocked CAS to update the work states; we still have lots of iterations over the work list until all work has finished, but at least we wouldn't have deadlocks anywhere, and we should still see a nice perf gain wrt dispatching work and waiting until it's finished with the events, and significantly less contention when taking work out of the queue. I think. :)

With that in mind we should end up with a solution with just two (manual reset) events; a "dispatch" event and an "all work done" event. The main thread (in RenderSamples) would signal the "dispatch" event (which all worker threads are waiting on), then waits until the "all work done" event is signalled. The worker threads, now woken up by the "dispatch" event, churn through whatever work they can (sleeping for a tiny bit of time if no work can be done) until one of them renders the last track and both resets the "dispatch" event as well as signals the "all work done" event. The other worker threads should see that the final track was rendered on their next iteration to find work; upon seeing that, they should go back to waiting for the "dispatch" event. The main thread, after being woken up by the "all work done" event, pulls out the final mix and resets the "all work done" event, and we're all set.

I think this implementation would make sense, be relatively simple, avoid deadlocks, and yield better/more predictable behavior/performance than what I have on this branch currently. I might be a bit paranoid though, but perhaps we also need events so that all worker threads can properly signal that they've seen that there's no work left before we try to dispatch more? Perhaps that's not actually a problem since any worker thread that sees no remaining work to be done will just wait until the "dispatch" event fires again anyways. My concern is that a worker thread that had somehow slept for loads of time might wake up and start processing work at the same time the main thread is clearing the render states before dispatch; perhaps this would never happen though.

Additionally, I think we can reuse this same "dispatch" event for clean exit; in that case we'd set renderThreadShutdown to true as we do currently, and then fire the dispatch event before waiting for the worker threads to terminate. The worker threads would then just check this condition before checking the work queue.

Please feel free to enlighten me further if it seems like I'm not quite getting it; this is not exactly my forte, so I'm very pleased you're willing to go over this and provide advice already. :)

gsuberland commented 5 years ago

Sounds like you've got a good handle on it, and I agree that that's probably a better approach if you're not worried too much about some wasted cycles. It's probably worth it in the end just to save time debugging any really rare/obscure race conditions.

Regarding the concerns about the threads waking up and grabbing work, that should be mitigated by the critical section - it should try to acquire the section, get blocked by the main thread, then unblock and find there's no work to do.

No probs answering questions on Windows threading / synchronisation stuff. It's one area I've had a lot of time to play around with! :P

kebby commented 5 years ago

Might be redundant but I found the easiest way to check whether all worker threads have finished is by just having one integer variable and one auto reset event. During "work assignment" set the variable to the # of render threads, and at the end of each render thread's work, InterlockedDecrement the variable, and if it hits 0, send the event - which the main thread waits for.

As for letting the render threads start, one auto reset event per render thread should be fine. You can use it for both signalling new work and exit - the thread can just check a global bool somewhere right after the WaitForSingleObject.

So main thread would do: pull critsec, set # of threads, assign work, set all start events, release critsec ... then wait for finished event. Done.

Bonus: If I'm not mistaken you can get rid of the special handling for the single threaded case that way. Just let the main thread run one iteration of what the render thread is doing. It'll dutifuly set all events right before it waits for them. Slightly redundant but works :)

kebby commented 5 years ago

Basically, in pseudo-C++

int threadsRunning = 0;
HANDLE startEvents[] = ...
HANDLE doneEvent = ...

Render Thread [# of cores - 1, render thread 0 is actually the main thread]

while (!exitFlag)
{
   RunRenderThread(threadNo); // threadno starts at 1
}

RunRenderThread(threadNo);

WaitForSingleObject(startEvents[threadNo]);
if (exitFlag) return;
for (;;)
{
  critSec.Enter();
  find_work(...);
  critSec.Leave();
  if (work)
    do_work(work);
  else
    break;
}
if (!InterlockedDecrement(threadsRunning))
   SetEvent(doneEvent);

Main thread:

schedule_work();
threadsRunning = numthreads;
for (int i=0; i<numthreads; i++)
  SetEvent(startEvents[i]);

// and now:
RunRenderThread(0);

WaitForSingleObject(doneEvent);
// and yay!

No sleeps, no state variables, guaranteed deadlock free, no special treatment for single core CPUs, you don't even need the critical section in the main thread anymore because all render threads are guaranteed to not run at that moment. :)

yupferris commented 5 years ago

Thanks for this @kebby , def going with that strategy. 👍

yupferris commented 5 years ago

Latest results:

new-improvements

This is consistently faster (except for Bare Faced Cheek, which seems anomalously fast in the first tests), by up to 13%, which is great. Also it's a bit cleaner; no Sleeps :)

The old results again, for comparison:

old-improvements

Thanks to @gsuberland and @kebby for their input :)

yupferris commented 5 years ago

Size-wise the new impl is an additional 188 bytes bigger (a total of 368 bytes for the entire multithreaded rendering impl), but I think that's still more than acceptable for the improvements in performance here. I'm just gonna merge this tomorrow if nobody else has any more input.

kebby commented 5 years ago

There's one optimization with a lot of work for questionable returns :D : Find some heuristic that guesses the CPU usage / render time of each track, and sort the tracks so that the threads process the tracks in order of descending complexity. That way the render threads are used optimally and it should give you another couple of %. You can easily do that at export/convert time - but I would probably be too lazy to do that :)

yupferris commented 5 years ago

Ha, yeah, nice idea, but I surely won't do it :) the current track order is dependency order (so that we could just store them as an ordered list and process them without walking the dependency chain originally) which should at least help with making sure tracks' dependencies are processed as early as possible (and perhaps that can be made better by also ranking tracks based on how many dependents they have, not just whatever we get from a depth-first search through the graph from the master) and anything beyond that is probably overkill.