sandialabs / qthreads

Lightweight locality-aware user-level threading runtime.
https://www.sandia.gov/qthreads/
Other
172 stars 35 forks source link

Bad performance for client-server workload #13

Closed eschnett closed 9 years ago

eschnett commented 9 years ago

I wrote an application where one thread starts many simultaneous worker threads, and then waits for their result. These worker threads run for a long time (milliseconds to seconds). While there are many more threads started than cores available, I don't see the expected speedup. It seems as if the idle cores were too reluctant to pick up any work.

I can reproduce this in a simple benchmark. These are the results (explanation below):

make run NTHREADS=12 EXE=benchmark

Single-Node Benchmark

   serial, 1 workitems:               0.00874165 usec/iter, 8.74165 nsec/iter/item   (120371268 iters, 1.05224 sec)
   serial, 1000 workitems:            5.79649 usec/iter, 5.79649 nsec/iter/item   (200000 iters, 1.1593 sec)
   serial, 1000000 workitems:         5780.55 usec/iter, 5.78055 nsec/iter/item   (200 iters, 1.15611 sec)

   daisychained, 1 workitems:         35.6302 usec/iter, 35630.2 nsec/iter/item   (57276 iters, 2.04076 sec)
   daisychained, 1000 workitems:      33.5546 usec/iter, 33.5546 nsec/iter/item   (92872 iters, 3.11628 sec)
   daisychained, 1000000 workitems:   5849.84 usec/iter, 5.84984 nsec/iter/item   (200 iters, 1.16997 sec)

   parallel, 1 workitems:             4.05608 usec/iter, 4056.08 nsec/iter/item   (271145 iters, 1.09979 sec)
   parallel, 1000 workitems:          9.6849 usec/iter, 9.6849 nsec/iter/item   (200000 iters, 1.93698 sec)
   parallel, 1000000 workitems:       5864.31 usec/iter, 5.86431 nsec/iter/item   (200 iters, 1.17286 sec)

   tree, 1 workitems:                 1.01038 usec/iter, 1010.38 nsec/iter/item   (2000000 iters, 2.02076 sec)
   tree, 1000 workitems:              1.81506 usec/iter, 1.81506 nsec/iter/item   (700923 iters, 1.27222 sec)
   tree, 1000000 workitems:           582.596 usec/iter, 0.582596 nsec/iter/item   (2000 iters, 1.16519 sec)

There are four benchmarks:

I run each benchmark three times with different amounts of work:

The interesting result is the numbers with the unit nsec/iter/item, which is the amortized amount of time spent on each work item, including overhead. The interesting case is that of 1000000 work items, where the thread management overhead is negligible in comparison.

The two benchmarks "serial" and "daisychained" perform as expected. The last benchmark "tree" also fares well; I see a speed-up of about a factor of 10 when running on 12 cores -- that's acceptable.

I am puzzled by the behaviour of the "parallel" benchmark. There is no speedup, as if all work was executed on a single core. I tried both with 2 shepherds (this machine has 2 NUMA nodes) and with 12 shepherds; there is not much difference.) I am using 12 threads, and I am using hwloc to set and check hardware affinity. The results are reproducible.

Further notes:

Do you have pointers to what I am doing wrong, or what I should try? Are there e.g. environment variables or configuration settings that I could try to influence the scheduler?

dylan-stark commented 9 years ago

Hi, Erik,

Any chance you can provide the code for your parallel benchmark? That would help with reproducing the behavior on my side. I'll also note that there is a test code that does something similar: test/benchmarks/mt/time_task_spawn.c. That one uses a yield with atomic increment idiom, which is different than what you described, but might be worth a look.

Dylan

On Sunday, March 29, 2015, Erik Schnetter notifications@github.com wrote:

I wrote an application where one thread starts many simultaneous worker threads, and then waits for their result. These worker threads run for a long time (milliseconds to seconds). While there are many more threads started than cores available, I don't see the expected speedup. It seems as if the idle cores were too reluctant to pick up any work.

I can reproduce this in a simple benchmark. These are the results (explanation below):

make run NTHREADS=12 EXE=benchmark

Single-Node Benchmark

serial, 1 workitems: 0.00874165 usec/iter, 8.74165 nsec/iter/item (120371268 iters, 1.05224 sec) serial, 1000 workitems: 5.79649 usec/iter, 5.79649 nsec/iter/item (200000 iters, 1.1593 sec) serial, 1000000 workitems: 5780.55 usec/iter, 5.78055 nsec/iter/item (200 iters, 1.15611 sec)

daisychained, 1 workitems: 35.6302 usec/iter, 35630.2 nsec/iter/item (57276 iters, 2.04076 sec) daisychained, 1000 workitems: 33.5546 usec/iter, 33.5546 nsec/iter/item (92872 iters, 3.11628 sec) daisychained, 1000000 workitems: 5849.84 usec/iter, 5.84984 nsec/iter/item (200 iters, 1.16997 sec)

parallel, 1 workitems: 4.05608 usec/iter, 4056.08 nsec/iter/item (271145 iters, 1.09979 sec) parallel, 1000 workitems: 9.6849 usec/iter, 9.6849 nsec/iter/item (200000 iters, 1.93698 sec) parallel, 1000000 workitems: 5864.31 usec/iter, 5.86431 nsec/iter/item (200 iters, 1.17286 sec)

tree, 1 workitems: 1.01038 usec/iter, 1010.38 nsec/iter/item (2000000 iters, 2.02076 sec) tree, 1000 workitems: 1.81506 usec/iter, 1.81506 nsec/iter/item (700923 iters, 1.27222 sec) tree, 1000000 workitems: 582.596 usec/iter, 0.582596 nsec/iter/item (2000 iters, 1.16519 sec)

There are four benchmarks:

  • "serial" is the base case without any multithreadin
  • "daisychained" has each thread start one follow-on thread and should thus ideally have the same performance as "serial"
  • "parallel" has one thread starting many other threads (as in my application)
  • "tree" has each thread start two more threads, and only the leaves are performing any work.

I run each benchmark three times with different amounts of work:

  • 1 workitem: basically no work, only overhead
  • 1000 workitems: small amount of work (about 1 us)
  • 1000000 workitems: large amount of work (about 1 ms)

The interesting result is the numbers with the unit nsec/iter/item, which is the amortized amount of time spent on each work item, including overhead. The interesting case is that of 1000000 work items, where the thread management overhead is negligible in comparison.

The two benchmarks "serial" and "daisychained" perform as expected. The last benchmark "tree" also fares well; I see a speed-up of about a factor of 10 when running on 12 cores -- that's acceptable.

I am puzzled by the behaviour of the "parallel" benchmark. There is no speedup, as if all work was executed on a single core. I tried both with 2 shepherds (this machine has 2 NUMA nodes) and with 12 shepherds; there is not much difference.) I am using 12 threads, and I am using hwloc to set and check hardware affinity. The results are reproducible.

Further notes:

  • I create threads with qthread_fork_syncvar, followed by qthread_yield_near
  • For the "parallel" benchmark I first create 1M threads, then wait for each of them via readFF, in two sequential for loops

Do you have pointers to what I am doing wrong, or what I should try? Are there e.g. environment variables or configuration settings that I could try to influence the scheduler?

— Reply to this email directly or view it on GitHub https://github.com/Qthreads/qthreads/issues/13.

Dylan

m3m0ryh0l3 commented 9 years ago

The answer is actually pretty easy: don't use qthread_yield_near(). That function does something very specific, and there's a reason why it's not well-documented. The underlying feature that you're abusing is known as the spawn-cache.

Under normal conditions, spawns do not get handed to the scheduler immediately, but queue up in a thread-local queue (which is quite fast, for obvious reasons). This queue is then handed to the scheduler at the next scheduling operation (e.g. a yield, a task termination, or a synchronization that causes the task to be blocked). This queue may be flushed manually with qthread_flushsc(), but most people shouldn't need to bother with this function (it's mostly useful for doing performance analyses).

As part of some performance analyses, qthread_yield_near() was invented to do something very specific with the spawn-cache. In particular, it yields directly to the next task in the spawn cache, and DOES NOT do a full scheduler operation (which means the spawn cache queue is not handed to the general scheduler). This can be quite useful if you know exactly what you're doing and why (e.g. you're swapping back and forth between two tasks that, for whatever good reason, should never migrate - this lets you avoid the performance impact of touching a shared scheduler object and it's associated synchronization).

But if you don't know or understand what qthread_yield_near() does, you should NOT be using it, because you WILL experience behavior you don't expect (I suppose that's a bit of a tautology). If you're just doing general processing, qthread_yield_near() shouldn't be used for exactly the reason you've discovered: your tasks will not be handed to the general scheduler, and so your program will not execute in parallel. In the program you've described, your initial thread will spawn to the spawn cache, and then (with the qthread_yield_near()) will switch to executing that spawned task, until it either terminates or you call qthread_yield_near() again. That's a recipe for serial performance. Instead, you should either use qthread_yield() or (probably more appropriately) some form of synchronization (e.g. a sinc). The spawn-cache, for example, combined with a sinc, makes qthreads particularly good at spawning a million (or more) tasks in a tight loop and then waiting for them all to finish.

eschnett commented 9 years ago

@m3m0ryh0l3: Thanks for the explanation. I spoke with Dylan one or two years ago and dimly remembered about qthread_yield_near, so I used it. I'll try both calling qthread_yield after spawning a thread, and not calling anything after spawning a thread.

eschnett commented 9 years ago

@dylan-stark: The code is available here https://bitbucket.org/eschnett/funhpc.cxx/src/4dea76238e39ac89e323b60c4f3f209a5a4ffb4d/examples/benchmark.cc?at=rewrite. This is the "rewrite" branch of a larger project. The benchmark source doesn't call Qthreads directly, but via a C++11-like API that is implemented in the qthreads subdirectory of the project, in particular in the future.hpp file there.

eschnett commented 9 years ago

The solution is to not yield after creating a thread, i.e. to trust Qthreads to do the right thing by itself.

For reference, here are the new&improved numbers:

make run NTHREADS=12 EXE=benchmark

Single-Node Benchmark

   serial, 1 workitems:               0.00271034 usec/iter, 2.71034 nsec/iter/item   (439150660 iters, 1.19025 sec)
   serial, 1000 workitems:            2.40942 usec/iter, 2.40942 nsec/iter/item   (447771 iters, 1.07887 sec)
   serial, 1000000 workitems:         2419.98 usec/iter, 2.41998 nsec/iter/item   (457 iters, 1.10593 sec)

   daisychained, 1 workitems:         22.6158 usec/iter, 22615.8 nsec/iter/item   (60350 iters, 1.36486 sec)
   daisychained, 1000 workitems:      20.4329 usec/iter, 20.4329 nsec/iter/item   (100000 iters, 2.04329 sec)
   daisychained, 1000000 workitems:   2419.59 usec/iter, 2.41959 nsec/iter/item   (455 iters, 1.10091 sec)

   parallel, 1 workitems:             1.52354 usec/iter, 1523.54 nsec/iter/item   (1000000 iters, 1.52354 sec)
   parallel, 1000 workitems:          1.3248 usec/iter, 1.3248 nsec/iter/item   (1665660 iters, 2.20666 sec)
   parallel, 1000000 workitems:       201.691 usec/iter, 0.201691 nsec/iter/item   (5412 iters, 1.09155 sec)

   tree, 1 workitems:                 0.623142 usec/iter, 623.142 nsec/iter/item   (1606869 iters, 1.00131 sec)
   tree, 1000 workitems:              0.7808 usec/iter, 0.7808 nsec/iter/item   (2000000 iters, 1.5616 sec)
   tree, 1000000 workitems:           201.729 usec/iter, 0.201729 nsec/iter/item   (5423 iters, 1.09398 sec)

The speed-up is 11.96 on 12 cores. I can't complain.

Thanks a lot!