sandialabs / qthreads

Lightweight locality-aware user-level threading runtime.
https://www.sandia.gov/qthreads/
Other
177 stars 36 forks source link

Tune distrib scheduler performance #39

Open ronawho opened 7 years ago

ronawho commented 7 years ago

We use the distrib scheduler for our numa configuration, but default to nemesis everywhere else because it has slightly better performance. We used to default to sherwood for numa, but it had significant performance issues, so the distrib scheduler was created/tuned for us. While it's performance is far better than sherwood, we'd like to see if we can close the remaining performance gap between nemesis and default to distrib everywhere.

A few weeks ago I ran distrib against nemesis for our nightly performance suite. You can see the results here. More recent results will be skewed since I added the hybrid spin/condwait scheme for nemesis in our copy of qthreads, so it probably makes sense to tune distrib performance after doing they hybrid spin/condwait work.

I'd imagine you'll want/need more info from us and that this will be more iterative than some of the others feature requests but I wanted to get an issue up as a placeholder.

Also note that we currently disable work stealing for distrib and it'd be nice to tune the work stealing so that we could enable it by default without noticeably hurting performance for well balanced workloads.

This is a relatively high priority item for us, but nemesis is still serving us well for most of our configurations so it's not blocking us on anything yet.

npe9 commented 7 years ago

@ronawho So I've started profiling the chapel benchmarks to see what qthreads is up to, starting with Lulesh. Can you point me to where the preferred command line invocations and chapel compile options for each benchmark are?

ronawho commented 7 years ago

Just wanted to mention that I think performance comparisons between distrib and nemesis will be difficult until #38 is resolved.

As far as figuring out what commands to use:

All of our nightly performance testing is run with our testing infrastructure using start_test --performance and that will tell you exactly what commands were used to get the nightly graphs.

https://github.com/chapel-lang/chapel/blob/master/doc/developer/bestPractices/TestSystem.rst has more info on our testing system, but there's a lot there so I'll try to summarize relevant parts.

# from a clean chapel repo
source util/setchplenv.bash
make -j
make test-venv

start_test --performance examples/benchmarks/lulesh/lulesh.chpl

and then you'll see output like:

[Executing compiler $CHPL_HOME/bin/linux64/chpl -o lulesh --cc-warnings --fast --static --fast lulesh.chpl < /dev/null]
...
[Executing program ./lulesh --filename=lmeshes/sedov15oct.lmesh < /dev/null]

from that you can pick out the manual commands:

chpl -o lulesh --cc-warnings --fast --static --fast lulesh.chpl
./lulesh --filename=lmeshes/sedov15oct.lmesh 

Also note that you can figure out the command lines manually by looking at the .perfcompopts/PERFCOMPOPTS and .perfexecopts/PERFEXECOPTS files. start_test looks at those files to determine what to throw. In addition start_test automatically throws --fast --static when --performance is thrown, so you 'll want to make sure you throw those.

npe9 commented 7 years ago

@ronawho Great thanks! Regarding #38 I've incorporated your spin/condwait changes into my development branch. From my profiling of toy problems using lulesh it seems that >80% of the time is spent in the spin/condwait code in nemesis/distrib. My next step now that I have a representative example is to make the number of spins tunable in nemesis (I believe they already are in distrib) and to see if I can find a sweet spot for both schedulers.

ronawho commented 7 years ago

Sounds good. When you have something you're happy with, we can test it in one of our perf playgrounds and see how it does across all our benchmarks.

ronawho commented 3 years ago

Just wanted to note we'd still love a NUMA aware work-stealing scheduler that has comparable performance to nemesis for well-balanced cases.

For us we probably also only want to steal queued, but not started tasks. We use thread local storage in some cases, so we either need callbacks to be notified when a started task switches threads, or just avoid stealing started tasks altogether.