Open ronawho opened 7 years ago
I tested the benchmark with nemesis and distrib. I got:
nemesis real 2m13.800s user 140m33.444s sys 1m52.144s
distrib real 2m17.940s user 144m54.988s sys 1m53.671s
I started looking that the profiling data as well, but I don't quite understand it yet.
Those times look off to me, especially for nemesis where I would expect the real and user times to be pretty similar since the blocked qthreads shouldn't be using up much CPU.
Are you using a chapel-like build? (--enable-condwait-queue, --disable-spawn-cache, etc.)
Not yet. What's your exactly configure invocation? (or where can I find it in the chapel source?)
On some systems, "user" time is multiplied by the number of cores you're using. So, on the latest-and-greatest 72-core Skylake CPU, you could easily get 144 minutes of user-time out of a 2-minute wall-clock time.
@npe9 https://github.com/chapel-lang/chapel/blob/master/third-party/qthread/Makefile has everything that we set to build qthreads. I think Dylan and/or George also added something to https://github.com/Qthreads/qthreads/blob/master/scripts/build.pl to mimic our config.
And by default we run w/o hyperthreads doing someting like:
nemesis:
QT_GUARD_PAGES=false
QT_WORKER_UNIT=core
QT_NUM_SHEPHERDS=<num physical (not logical) cpus>
QT_NUM_WORKERS_PER_SHEPHERD=1
QT_STACK_SIZE=8388608 (8MB)
distrib:
QT_STEAL_RATIO=0
QT_GUARD_PAGES=false
QT_WORKER_UNIT=core
QT_HWPAR=<num physical (not logical) cpus>
QT_STACK_SIZE=8388608 (8MB)
You can look at our shim layer to see exactly what we set: https://github.com/chapel-lang/chapel/blob/master/runtime/src/tasks/qthreads/tasks-qthreads.c
@m3m0ryh0l3 yeah that's true for the machines I'm running on too, but my point was that while thread-ring creates a lot of tasks, only one is active at a time, and the others are blocked on a sync var.
Here's my timings on a dual socket (2 12-core machine) for nemesis:
real 14.56
user 14.74
sys 0.04
@ronawho Using your environment variable settings (and 1 shepherd to make sure numa isn't screwing anything up) I get the following on a 2 socket 32 core haswell:
sherwood real 0m22.206s user 0m21.843s sys 0m0.326s
nemesis real 0m24.182s user 0m23.821s sys 0m0.322s
distrib real 0m23.903s user 0m23.551s sys 0m0.324s
Which is closer to what we'd expect. I'm going to dig deeper into what's going using perf to profile the behavior and get back to you. When I get this infrastructure up we can work on tuning distrib as well.
Got it to the right timing. Perf was interfering.
With distrib I'm getting 15.56s user 0.37s system 99% cpu 15.952 total
And what if you use <num physical cores>
shepherds? For us, nemesis still gets good performance and the real/user times are pretty close, but for distrib we get worse perf and time -p
seems shows most cpus are busy.
Here's a diff of nemesis and distrib running thread ring on 32 cores:
[nevans@shepard-lsm1 benchmarks]$ perf diff perf.data.nemesis perf.data.distrib
Event cycles
#
Baseline Delta Shared Object Symbol
........ ....... .................... ..............................................
# 20.44% -0.18% libqthread.so.19.0.0 [.] qthread_gotlock_fill_inner 10.68% +0.04% libqthread.so.19.0.0 [.] qt_threadqueue_enqueue 9.24% +0.08% libqthread.so.19.0.0 [.] qt_hash64 7.93% -0.06% libqthread.so.19.0.0 [.] qthread_master 7.07% +0.26% libqthread.so.19.0.0 [.] qthread_writeEF 7.16% -0.47% libqthread.so.19.0.0 [.] qt_hash_lock 5.73% +0.01% libqthread.so.19.0.0 [.] qt_swapctxt 5.37% +0.12% libqthread.so.19.0.0 [.] qthread_readFE 5.26% -0.16% libqthread.so.19.0.0 [.] qt_scheduler_get_thread 4.58% +0.03% libqthread.so.19.0.0 [.] qt_hash_get_locked 2.13% -0.02% libpthread-2.12.so [.] pthread_getspecific 1.67% +0.20% libqthread.so.19.0.0 [.] qt_mpool_alloc 1.67% +0.09% [kernel.kallsyms] [k] clear_page_c_e 1.52% +0.07% libqthread.so.19.0.0 [.] qt_mpool_free 1.50% -0.04% libqthread.so.19.0.0 [.] qt_mpool_internal_getcache 1.07% +0.08% libqthread.so.19.0.0 [.] qt_setmctxt 1.03% libqthread.so.19.0.0 [.] qt_getmctxt 0.95% -0.04% libqthread.so.19.0.0 [.] qthread_internal_self 0.90% -0.02% ld-2.12.so [.] __tls_get_addr 0.67% +0.01% libqthread.so.19.0.0 [.] qthread_exec 0.59% +0.04% libqthread.so.19.0.0 [.] qt_hash_unlock 0.44% +0.02% libqthread.so.19.0.0 [.] qthread_writeEF_const
It looks like most of the overheads are in synchronization rather than the thread queue. Were you noticing this on thread ring or other performance benchmarks in general?
Let me try this in the chapel thread ring proper rather than qthreads benchmark and get back to you.
Ok, we don't have many benchmarks that use a lot of highly contended sync vars, so thread-ring performance isn't really indicative of performance for many other benchmarks.
The performance of thread-ring isn't actually that important to us, but you shouldn't let that stop you from working on it if this is fun/helping you get ramped up with Chapel.
@ronawho Exactly. I'm just getting a representative environment up. Looking at the performance results in #39 is there a particular benchmark I can spin on optimizing that models a particular job type you'd like to optimize? It looks like distrib and nemesis are within the noise of each other other except for MiniMD and Fasta where distrib wins and thread ring and DGEMM where nemesis is better.
In the meantime I'll work on profiling each of them.
Ah ok.
I think it's going to be tricky to compare nemesis and distrib until after #38 is resolved. Prior to using a hybrid spin/condwait approach for nemesis, distrib had faster task spawning, so I think that's why cases like minimd are faster for distrib.
I think it'd be easier to get the hybrid spin/condwait stuff done for nemesis and distrib such that task spawn test performs roughly the same for nemesis and distrib. Then I can restart nemesis vs. distrib performance testing and we can compare the results again.
This is still of mild interest to us, just because it's a somewhat interesting proxy for sync var and scheduler performance, but it's low priority.
thread-ring is one of the Computer Language Benchmarks Game (CLBG) codes.
Having Chapel sync vars map down to qthreads aligned_t sync vars (https://github.com/Qthreads/qthreads/issues/22, https://github.com/chapel-lang/chapel/pull/4489) dramatically improved performance of thread-ring for us. It resulted in a 6x speedup and put us in 4th place for the actual CLBG competition
There's Haskell, Go, and F# versions that are beating us, and I'm curious what it would take to get closer to them. Thread-ring is a "parallel" application in the sense that it creates a lot of threads/tasks, but only one task actually executes at a time so some of the versions that are beating us do things like limit the runtime to only use one real thread. I wonder if there's a better way to implement thread-ring over qtheads or maybe scheduler/FEB optimizations that would get us better performance.
Here's a native qthreads version that's based off the Chapel version that we submitted
It runs a second or two faster than the chapel version, but it's pretty similar otherwise.
We're really happy with the performance we're getting now, so this is a really low priority and it wouldn't bother me if this never got worked on. When I first started looking at mapping down to native qthread sync vars I had a lot of fun playing with different implementation so I figured this could be an interesting/fun side project or something to work on in the background that would still benefit us. Plus, I'd imagine improvements to it would improve the performance of FEB operations in general.