sandialabs / qthreads

Lightweight locality-aware user-level threading runtime.
https://www.sandia.gov/qthreads/
Other
175 stars 36 forks source link

Improve performance of "thread-ring" benchmark #48

Open ronawho opened 7 years ago

ronawho commented 7 years ago

thread-ring is one of the Computer Language Benchmarks Game (CLBG) codes.

Having Chapel sync vars map down to qthreads aligned_t sync vars (https://github.com/Qthreads/qthreads/issues/22, https://github.com/chapel-lang/chapel/pull/4489) dramatically improved performance of thread-ring for us. It resulted in a 6x speedup and put us in 4th place for the actual CLBG competition

There's Haskell, Go, and F# versions that are beating us, and I'm curious what it would take to get closer to them. Thread-ring is a "parallel" application in the sense that it creates a lot of threads/tasks, but only one task actually executes at a time so some of the versions that are beating us do things like limit the runtime to only use one real thread. I wonder if there's a better way to implement thread-ring over qtheads or maybe scheduler/FEB optimizations that would get us better performance.

Here's a native qthreads version that's based off the Chapel version that we submitted

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include "qthread/qthread.h"
#include <qthread/qloop.h>

// native qthreads version of thread-ring, based on Chapel's release version
//   https://github.com/chapel-lang/chapel/blob/master/test/release/examples/benchmarks/shootout/threadring.chpl

//static int n = 1000, ntasks = 503;
static int n = 50000000, ntasks = 503;
static aligned_t *mailbox;

static void passTokens(size_t start, size_t stop, void* arg) {
  uint64_t id = start;

  uint64_t numPasses = 0;
  do {
    qthread_readFE(&numPasses, &mailbox[id]);
    qthread_writeEF_const(&mailbox[(id+1)%ntasks], numPasses+1);
    if (numPasses == n) {
      printf("%ld\n", (long)id+1);
    }
  } while (numPasses < n);
}

static void init() {
  assert(qthread_initialize() == 0);
  printf("%i threads...\n", qthread_num_workers());

  // init array of syncvars (code grabbed from stress/feb_stream.c)
  mailbox = malloc(ntasks * sizeof(aligned_t));
  for (int i=0; i<ntasks; i++) {
    mailbox[i] = 0;
    qthread_empty(&mailbox[i]);
  }
}

int main(int argc, char **argv) {
  init();

  qthread_writeEF_const(&mailbox[0], 0);
  qt_loop_simple(0, ntasks, passTokens, NULL);

  return 0;
}

It runs a second or two faster than the chapel version, but it's pretty similar otherwise.

We're really happy with the performance we're getting now, so this is a really low priority and it wouldn't bother me if this never got worked on. When I first started looking at mapping down to native qthread sync vars I had a lot of fun playing with different implementation so I figured this could be an interesting/fun side project or something to work on in the background that would still benefit us. Plus, I'd imagine improvements to it would improve the performance of FEB operations in general.

npe9 commented 7 years ago

I tested the benchmark with nemesis and distrib. I got:

nemesis real 2m13.800s user 140m33.444s sys 1m52.144s

distrib real 2m17.940s user 144m54.988s sys 1m53.671s

I started looking that the profiling data as well, but I don't quite understand it yet.

ronawho commented 7 years ago

Those times look off to me, especially for nemesis where I would expect the real and user times to be pretty similar since the blocked qthreads shouldn't be using up much CPU.

Are you using a chapel-like build? (--enable-condwait-queue, --disable-spawn-cache, etc.)

npe9 commented 7 years ago

Not yet. What's your exactly configure invocation? (or where can I find it in the chapel source?)

m3m0ryh0l3 commented 7 years ago

On some systems, "user" time is multiplied by the number of cores you're using. So, on the latest-and-greatest 72-core Skylake CPU, you could easily get 144 minutes of user-time out of a 2-minute wall-clock time.

ronawho commented 7 years ago

@npe9 https://github.com/chapel-lang/chapel/blob/master/third-party/qthread/Makefile has everything that we set to build qthreads. I think Dylan and/or George also added something to https://github.com/Qthreads/qthreads/blob/master/scripts/build.pl to mimic our config.

And by default we run w/o hyperthreads doing someting like:

nemesis:

QT_GUARD_PAGES=false
QT_WORKER_UNIT=core
QT_NUM_SHEPHERDS=<num physical (not logical) cpus>
QT_NUM_WORKERS_PER_SHEPHERD=1
QT_STACK_SIZE=8388608 (8MB)

distrib:

QT_STEAL_RATIO=0
QT_GUARD_PAGES=false
QT_WORKER_UNIT=core
QT_HWPAR=<num physical (not logical) cpus>
QT_STACK_SIZE=8388608 (8MB)

You can look at our shim layer to see exactly what we set: https://github.com/chapel-lang/chapel/blob/master/runtime/src/tasks/qthreads/tasks-qthreads.c

@m3m0ryh0l3 yeah that's true for the machines I'm running on too, but my point was that while thread-ring creates a lot of tasks, only one is active at a time, and the others are blocked on a sync var.

Here's my timings on a dual socket (2 12-core machine) for nemesis:

real 14.56
user 14.74
sys 0.04
npe9 commented 7 years ago

@ronawho Using your environment variable settings (and 1 shepherd to make sure numa isn't screwing anything up) I get the following on a 2 socket 32 core haswell:

sherwood real 0m22.206s user 0m21.843s sys 0m0.326s

nemesis real 0m24.182s user 0m23.821s sys 0m0.322s

distrib real 0m23.903s user 0m23.551s sys 0m0.324s

Which is closer to what we'd expect. I'm going to dig deeper into what's going using perf to profile the behavior and get back to you. When I get this infrastructure up we can work on tuning distrib as well.

npe9 commented 7 years ago

Got it to the right timing. Perf was interfering.

With distrib I'm getting 15.56s user 0.37s system 99% cpu 15.952 total

ronawho commented 7 years ago

And what if you use <num physical cores> shepherds? For us, nemesis still gets good performance and the real/user times are pretty close, but for distrib we get worse perf and time -p seems shows most cpus are busy.

npe9 commented 7 years ago

Here's a diff of nemesis and distrib running thread ring on 32 cores:

[nevans@shepard-lsm1 benchmarks]$ perf diff perf.data.nemesis perf.data.distrib

Event cycles

#

Baseline Delta Shared Object Symbol

........ ....... .................... ..............................................

# 20.44% -0.18% libqthread.so.19.0.0 [.] qthread_gotlock_fill_inner 10.68% +0.04% libqthread.so.19.0.0 [.] qt_threadqueue_enqueue 9.24% +0.08% libqthread.so.19.0.0 [.] qt_hash64 7.93% -0.06% libqthread.so.19.0.0 [.] qthread_master 7.07% +0.26% libqthread.so.19.0.0 [.] qthread_writeEF 7.16% -0.47% libqthread.so.19.0.0 [.] qt_hash_lock 5.73% +0.01% libqthread.so.19.0.0 [.] qt_swapctxt 5.37% +0.12% libqthread.so.19.0.0 [.] qthread_readFE 5.26% -0.16% libqthread.so.19.0.0 [.] qt_scheduler_get_thread 4.58% +0.03% libqthread.so.19.0.0 [.] qt_hash_get_locked 2.13% -0.02% libpthread-2.12.so [.] pthread_getspecific 1.67% +0.20% libqthread.so.19.0.0 [.] qt_mpool_alloc 1.67% +0.09% [kernel.kallsyms] [k] clear_page_c_e 1.52% +0.07% libqthread.so.19.0.0 [.] qt_mpool_free 1.50% -0.04% libqthread.so.19.0.0 [.] qt_mpool_internal_getcache 1.07% +0.08% libqthread.so.19.0.0 [.] qt_setmctxt 1.03% libqthread.so.19.0.0 [.] qt_getmctxt 0.95% -0.04% libqthread.so.19.0.0 [.] qthread_internal_self 0.90% -0.02% ld-2.12.so [.] __tls_get_addr 0.67% +0.01% libqthread.so.19.0.0 [.] qthread_exec 0.59% +0.04% libqthread.so.19.0.0 [.] qt_hash_unlock 0.44% +0.02% libqthread.so.19.0.0 [.] qthread_writeEF_const

It looks like most of the overheads are in synchronization rather than the thread queue. Were you noticing this on thread ring or other performance benchmarks in general?

npe9 commented 7 years ago

Let me try this in the chapel thread ring proper rather than qthreads benchmark and get back to you.

ronawho commented 7 years ago

Ok, we don't have many benchmarks that use a lot of highly contended sync vars, so thread-ring performance isn't really indicative of performance for many other benchmarks.

The performance of thread-ring isn't actually that important to us, but you shouldn't let that stop you from working on it if this is fun/helping you get ramped up with Chapel.

npe9 commented 7 years ago

@ronawho Exactly. I'm just getting a representative environment up. Looking at the performance results in #39 is there a particular benchmark I can spin on optimizing that models a particular job type you'd like to optimize? It looks like distrib and nemesis are within the noise of each other other except for MiniMD and Fasta where distrib wins and thread ring and DGEMM where nemesis is better.

In the meantime I'll work on profiling each of them.

ronawho commented 7 years ago

Ah ok.

I think it's going to be tricky to compare nemesis and distrib until after #38 is resolved. Prior to using a hybrid spin/condwait approach for nemesis, distrib had faster task spawning, so I think that's why cases like minimd are faster for distrib.

I think it'd be easier to get the hybrid spin/condwait stuff done for nemesis and distrib such that task spawn test performs roughly the same for nemesis and distrib. Then I can restart nemesis vs. distrib performance testing and we can compare the results again.

ronawho commented 3 years ago

This is still of mild interest to us, just because it's a somewhat interesting proxy for sync var and scheduler performance, but it's low priority.