sandialabs / qthreads

Lightweight locality-aware user-level threading runtime.
https://www.sandia.gov/qthreads/
Other
169 stars 34 forks source link

Improve "vector" task spawn performance #47

Open ronawho opened 7 years ago

ronawho commented 7 years ago

It seems like there has to be some room for improving the performance of spawning a bunch of tasks at once. Currently for coforalls, Chapel basically calls qthread_fork in a for loop. i.e. we turn something like:

coforall i in 1..10 { noop(i); }

into something roughly like:

var endCount: atomic int;
endCount.add((1..10).size);
for i in 1..10 {
  args.i = i;
  args.endCount = endCount;
  qthread_fork(taskWrapper, args, NULL);     
}
endCount.waitFor(0);

proc taskWrapper(args) {
  var i = args.i;
  var endCount = args.endCount;
  noop(i);
  endCount.sub(1);
}

This works really well for us (performance on par with gcc's OpenMP, though behind intel/cce), but it seems like there should be a way to improve performance by enqueuing multiple tasks at once.


There is a qt_loop, which does tree spawning, but it seems to have worse performance than the for+fork idiom.

For a simple noop test, I see the chapel-like scheme being about 2X faster than qt_loop_dc (and all qt_loop variants.)

#include <stdint.h>

#include "qthread/qthread.h"
#include "qthread/qloop.h"
#include <qthread/qtimer.h>

static aligned_t decTask(void* arg) {
  aligned_t * doneCount = (aligned_t*)arg;
  qthread_incr(doneCount, -1);

  return 0;
}
static void qtChplLikeTaskSpawn(int64_t trials, int64_t numTasks) {
  int i, j;

  for (i=0; i<trials; i++) {
    aligned_t doneCount;
    doneCount = 0;
    qthread_incr(&doneCount, numTasks);
    for (j=0; j<numTasks; j++) {
      qthread_fork(decTask, &(doneCount), NULL);
    }
    while (qthread_incr(&doneCount, 0)) {
      qthread_yield();
    }
  }
}

static void emptyFunction(size_t start, size_t stop, void* arg) { }
static void qtLoopTaskSpawn(int64_t trials, int64_t numTasks) {
  int i;
  for (i=0; i<trials; i++) {
    qt_loop_dc(0, numTasks, emptyFunction, NULL);
  }
}

int main() {
  qthread_initialize();
  int64_t numTasks = qthread_num_workers();
  int64_t numTrials = 500000;

  qtimer_t t = qtimer_create();
  qtimer_start(t);
  {
    qtLoopTaskSpawn    (numTrials, numTasks);
    //qtChplLikeTaskSpawn(numTrials, numTasks);
  }
  qtimer_stop(t);
  printf("Elapsed time for %d workers: %f\n", numTasks, qtimer_secs(t));
}

On a 24 core (dual 12-core) haswell machine:

spawnScheme time(s)
for+fork (chpl-like) ~3.25
qt_loop ~7.15

Similar results on an older 4 core (single socket) core 2 quad machine:

spawnScheme time(s)
for+fork (chpl-like) ~1.15
qt_loop ~2.45

Those numbers are with nemesis. With distrib absolute times were slower, but the trend between for+fork and qt_loop is similar

qt_loop probably isn't the exact interface we'd want anyways, but I was figuring it's a good test for "optimized" spawn performance. We'd probably want an interface that takes the number of tasks to create, the function to call, and an array of args instead of qt_loop using the same args for all tasks.

It looks like qt_loop does a fair number of unpooled allocations, so that could be a cause of the overhead, but I haven't dug in enough to know for sure.

m3m0ryh0l3 commented 7 years ago

The qthread_fork() mechanism uses a "spawn cache". Basically, that builds up a list of tasks to spawn in a thread-local list and they only get pushed into the main scheduler as part of the next scheduling event (e.g. a yield or a blocking wait or something like that). They then get distributed via workstealing.

The qt_loop() mechanism is trying to ensure that each task lands in the right place, so uses a qthread_fork_to() under the covers, which touches the scheduler every time. That might be an easy fix - have it use the spawn cache as well.

I'd have to put some real thought into how to go much faster than that...

olivier-snl commented 7 years ago

@m3m0ryh0l3 I've had a lot of problems diagnosing performance anomalies with the spawn cache, so I'm not sure I'd recommend its use. Besides, Chapel doesn't use Sherwood now.

@npe9 : Looks like the Chapel folks could also benefit from the work on "team tasks"

@ronawho : We are working on something like "task clones" for our internal customer whose use case is to execute many similar tasks by a team of threads. I think this could help here too.

ronawho commented 7 years ago

@m3m0ryh0l3 we build with spawn cached disabled. https://github.com/chapel-lang/chapel/commit/2df9f3006a0bea3fca477c71aaacb1b626855eed has some details on why it was originally disabled (As I understand it, we disable spawn cache because it's incompatible with our asynchronous begin stmts)

And like @olivier-snl mentioned, we use nemesis by default, and distrib for our numa configurations with the hope of switching over to distrib everywhere once some performance differences can be resolved (#39). Generally speaking, sherwood performance wasn't great for us, and distrib was a dramatic improvement

I forgot to put this in the issue description -- this is a relatively high priority for us and obviously performance is important but I'm expecting this to be difficult and invasive so I have no expectations of a quick turnaround or anything. I mostly wanted to get it out there as something to think about and to see if there were any existing constructs that I wasn't aware of that could help.

ronawho commented 7 years ago

@olivier-snl do you have a branch or something you could point me to for more info on the "task clones" idea?

ronawho commented 7 years ago

@m3m0ryh0l3 Ah, thanks for the info on the intent of qt_loop.

Do you have any recollection of how much spawn caching helped performance? Was it only useful for schedulers that support multiple dequeuers, or do you think it would help with nemesis as well?

olivier-snl commented 7 years ago

@ronawho We are in the process of reworking our implementation. The current one uses a distinct API function with Sherwood to specify the number of cloned tasks that get inserted onto a priority queue to be picked up by the workers in the shepherd. In the new implementation we use a feature flag option for clone along with changes to work with the per-worker distrib queues, like delaying the actual replication.

npe9 commented 7 years ago

@ronawho The initial commit of the teaming work by @stelleg is in https://github.com/Qthreads/qthreads/tree/spawn_shepherd . We'll be working to integrate as we go along. Sometime soon, after I get all my infrastructure up we should talk about integrating this work and chapel.

m3m0ryh0l3 commented 7 years ago

@ronawho - the original impact of the spawncache was substantial. I forget the exact numbers. What we were looking at doing, essentially, was a benchmark of tight loops of qthread_fork() of null tasks (i.e. tasks that do nothing), primarily with the Sherwood scheduler. The challenge was that at startup, you have the circling hordes of all the other shepherds looking for work as fast as possible, which meant they're all locking and unlocking the queues all the time - with a relatively high probability of collisions. The more times the spawner had to touch the queue, the more likely he'd simply create more collisions. The spawn cache was there to reduce the scheduling touches - but has some nasty side-effects (such as never actually spawning anything) if programmers use spin-loops (then again, lots of things in Qthreads start to break if the programmer uses a manual spin-loop). Anyway, for our little micro-benchmark, spawn cache changed task spawning from something that got slightly slower as the core count went up to something that had a (low) constant time cost, regardless of core count.

Spawn cache didn't help with the Nemesis (FIFO) scheduler, because the Nemesis scheduler doesn't use locking in the same way - the Nemesis lock-free queue isn't designed to accept multiple items at a time from a single source. That said, I bet the concept could be modified to support Nemesis, at the expense of the same drawbacks as it has in Sherwood: when does that cache get flushed into the queues, if the user doesn't do something scheduler-related right away (like yield or wait)? That problem, too, could probably be resolved with some kind of periodic cache flusher thread (just off the top of my head).

However, it's probably worth acknowledging @olivier-snl's point that solutions like that up the complexity and could (especially if there are bugs) create the potential for hard-to-debug performance anomalies in certain cases.