sandialabs / qthreads

Lightweight locality-aware user-level threading runtime.
https://www.sandia.gov/qthreads/
Other
168 stars 34 forks source link

Have idle workers use a hybrid spin/condwait scheme #38

Closed ronawho closed 3 years ago

ronawho commented 7 years ago

Currently --enable-condwait-queue controls whether idle workers spinwait or do a condwait while looking for more work. Neither of these extremes works particularly well for us. We originally configured qthreads with --enable-condwait-queue because pure spinwaiting killed performance for serial applications or applications that didn't fully utilize all cores. However, as we've been looking into task spawning times, we've discovered that condwait dramatically increases task creation time. Because of this I added a simple hybird spin/condwait approach to nemesis with https://github.com/chapel-lang/chapel/pull/4902 and https://github.com/chapel-lang/chapel/pull/4905. This dramatically improved task spawn times for us and brought a raw/native qthreads task spawn variant on par with openmp without hurting our single threaded performance.

Beyond the raw task spawn test, this had a dramatic performance impact on many key applications for us that you can see here

It seems likely that this hybrid spin/condwait approach would benefit most qthreads users since presumably others users have serial sections of there code that could be hurt by the default spinwait approach. Our default number of spins is based off of what gomp does, and I think that's a good source of inspiration.

gomp has an env var to explicitly control the spincount before sleeping and another env var for the wait policy that changes the default number of spins if spincount isn't explicitly set. In our code we just have SPINCOUNT env var for now, but I think it'd be ideal to also have a more user friendly WAIT_POLICY env var which sets good defaults. I'm hoping you can basically take the simple hybrid code I linked to, clean it up, add a WAIT_POLICY env, and apply this scheme to all schedulers (or at least nemesis and distrib for us.)

Oh, and at least in the nemesis scheduler, there is a frustration atomic that is used as a kind of spin wait counter, but it also guards against unnecessary calls to condwait signal. I think the spinwait role can be removed, but I would imagine it's still good for performance to avoid the condwait signal in the qt_threadqueue_enqueue code.

This is really important for us since it has such a dramatic impact of task spawn times and a lot of benchmarks.

ronawho commented 7 years ago

I have a pretty good idea of what I think the code will look like at least for nemesis and distrib, so feel free to ping me for more info.

m3m0ryh0l3 commented 7 years ago

Since environment variables have a flat namespace, SPINCOUNT should have some kind of prefix on it (e.g. QT and/or QTHREAD), to avoid name collisions with programs built on top of Qthreads (and/or Chapel).

ronawho commented 7 years ago

I wasn't very clear. https://github.com/chapel-lang/chapel/pull/4902 uses qt_internal_get_env_num so the environment variable is actually {QT,QTHREAD}_SPINCOUNT, I just used SPINCOUNT in the description.

npe9 commented 7 years ago

I made a branch for these changes. I also pushed the environment variable change. Any reason not to merge this into the main tree? All the tests pass.

ronawho commented 7 years ago

I would imagine you want to clean up the branch a little, and add the QT_WAIT_POLICY as well before merging in the hybrid spin/condwait stuff to nemesis, but I suppose it could be done in multiple commits.

Let me know if you want to chat about what I had in mind for this.

npe9 commented 7 years ago

nemesisbackoff

I ran an experiment today varying the backoff times with Lulesh on the 2dmesh example with Nemesis as the scheduler. It looks like the big inflection points are between a 4096 -> 8192 boundary on spins and between 10^10 -> 20^20 backoffs (I'm isolating the exact point now then I'm going to profile them). Does that look like what you expect?

ronawho commented 7 years ago

I'm not following. Isn't backoff a "distrib" only setting?

npe9 commented 7 years ago

Sorry SPINCOUNT, I'm getting my terminology confused.

for i in `awk 'BEGIN {for(i=1; i < 2**22; i += 2**16) print 1073741824+i}'`; do
    export QT_SPINCOUNT=$i
     ./lulesh --filename lmeshes/sedov2cube.lmesh | sed 's/ //g;s/:/ /g' | awk '{print '$i'" "$0}'
done >2cube3.dat

This is one example of an experiment I'm running if that makes it clearer.

ronawho commented 7 years ago

Ah ok. FWIW, 300000 value works really well for us with nemesis, and is based off of what gnu's openmp implementation does. There were no benchmarks in our test suite that benefitted from increasing it. That said that testing was only on one of our perf testing machines (dual socket 12-core (24 total) haswells) so it might not be perfect everywhere. I actually went with 100000 originally, but bumped to 300000 after reading what gomp does, and it didn't hurt any serial benchmarks so I stuck with it.

npe9 commented 7 years ago

Interesting. Here's what I'm getting with more iterations of Lulesh: spintime

It seems like the 4 million spin mark is getting a 10% speedup up to 15 million or so where it slows down again. I've got a workflow for getting the graphs now so I'll try with the Chapel benchmarks directly from here on out and we'll see if I can figure out more about the numbers.

ronawho commented 7 years ago

cool 👍

npe9 commented 7 years ago

@ronawho BTW I haven't forgotten about a design chat, but I'm waiting until I'm fully up to speed on how everything works.

ronawho commented 7 years ago

Sounds good

npe9 commented 7 years ago

@ronawho Now that I understand contrib quite a bit better it's worth having that chat. I believe distrib's backoff works in a very similar way to Nemesis's backoff scheme. I've also run quite a few chapel benchmarks with both schedulers and I think I have a feel for them. @olivier-snl do you have any objections to integrating the Nemesis condwait stuff into the main tree? Are there any benchmarks on the Qthreads side we want to tune Nemesis with?

ronawho commented 7 years ago

Sounds good, just ping me whenever you want to chat.

npe9 commented 7 years ago

@ronawho any objections to closing this one as well?

ronawho commented 7 years ago

https://github.com/Qthreads/qthreads/commit/f5b80c1c581b851142e53a98ec02756fd4844345 and https://github.com/Qthreads/qthreads/commit/ecf0330787724f094db2e1eccff940967e0552ad really just pulled in the commits that I originally made in Chapel.

As indicted in the issue description this still needs to be applied/tuned for distrib, we'd like something like WAIT_POLICY to control the number of spins, and there's some improvements in nemesis that should be done WRT to the frustration atomic.

From the description

I'm hoping you can basically take the simple hybrid code I linked to, clean it up, add a WAIT_POLICY env, and apply this scheme to all schedulers (or at least nemesis and distrib for us.)

Oh, and at least in the nemesis scheduler, there is a frustration atomic that is used as a kind of spin wait counter, but it also guards against unnecessary calls to condwait signal. I think the spinwait role can be removed, but I would imagine it's still good for performance to avoid the condwait signal in the qt_threadqueue_enqueue code.

This is really important for us since it has such a dramatic impact of task spawn times and a lot of benchmarks, and I have to imagine this is important for other qthreads users.

ronawho commented 3 years ago

I'm going to close this. Nemesis is doing what we want and I think applying it to distrib is captured well enough by https://github.com/Qthreads/qthreads/issues/39.

It would be nice to have something like QT_WAIT_POLICY and to clean up the atomic spinloop in nemesis a bit, but it's not a high priority.