Busy yield-loops hurt performance

cuviper commented 8 years ago

I noticed that the kernel scheduler was showing up a lot in my perf report, and I tracked this down to having so many sched_yield syscalls in waiting threads. Such busy work is wasted cpu time, not only for your own process but also if anything else is trying to run on the system.

I tried a crude patch below to use Condvar for the blocking latch wait, and it already shows a clear benefit. The other yield loop is in steal_until, which is harder because that's looping for either latch completion or available work -- suggestions welcome.

diff --git a/src/latch.rs b/src/latch.rs
index a81831d1cc7a..fbeff95acbd4 100644
--- a/src/latch.rs
+++ b/src/latch.rs
@@ -1,9 +1,11 @@
+use std::sync::{Mutex, Condvar};
 use std::sync::atomic::{AtomicBool, Ordering};
-use std::thread;

 /// A Latch starts as false and eventually becomes true. You can block
 /// until it becomes true.
 pub struct Latch {
+    m: Mutex<()>,
+    c: Condvar,
     b: AtomicBool
 }

@@ -11,19 +13,24 @@ impl Latch {
     #[inline]
     pub fn new() -> Latch {
         Latch {
+            m: Mutex::new(()),
+            c: Condvar::new(),
             b: AtomicBool::new(false)
         }
     }

     /// Set the latch to true, releasing all threads who are waiting.
     pub fn set(&self) {
+        let _guard = self.m.lock().unwrap();
         self.b.store(true, Ordering::SeqCst);
+        self.c.notify_all();
     }

     /// Spin until latch is set. Use with caution.
     pub fn wait(&self) {
+        let mut guard = self.m.lock().unwrap();
         while !self.probe() {
-            thread::yield_now();
+            guard = self.c.wait(guard).unwrap();
         }
     }

Before, running on Fedora 23 x86_64 with an i7-2600:

     Running `target/release/quicksort 1000`
Array length    : 1000K 32-bit integers
Repetitions     : 10
Parallel time   : 281422159
Sequential time : 848279007
Parallel speedup: 3.014258045685734

After, with Condvar wait:

     Running `target/release/quicksort 1000`
Array length    : 1000K 32-bit integers
Repetitions     : 10
Parallel time   : 225591277
Sequential time : 873530292
Parallel speedup: 3.872181157075502

nikomatsakis commented 8 years ago

The other yield loop is in steal_until, which is harder because that's looping for either latch completion or available work -- suggestions welcome.

Yeah, I figured I'd have to remove the busy loops sooner or later. It'd be worth benchmarking on other systems too (not sure what system you are using there). In the past, I've found that linux in particular has a more efficient condvar impl than other systems (I think it is based on futexes, but I've not researched this stuff in detail in a while). Another option is to use a spin loop with an increasing backoff -- i.e., sleep for increasing amounts of time.

nikomatsakis commented 8 years ago

Heh, I was wondering why wait was having such an impact on performance: I thought that I only used it to block until the worker threads started, but I see I also used it to wait for the main threads to start. That was... silly. Good catch, a condvar definitely seems to make sense there.

mneumann commented 8 years ago

I tried rayon with my embarrassingly parallel evolutionary algorithm. It was about 10-15% slower than the simple_parallel version IIRC. I will try again to see if performance is en par.

AlisdairO commented 8 years ago

Bit of a hanger-on to this issue, but is SeqCst necessary for this AtomicBool? My (hurried) reading of it is that Latch essentially waits until b is true, and then allows other code to run - that being the case, it should be possible to use Ordering::Acquire for the probe method, and Ordering::Release for the set method. This would potentially have performance benefits on at least some platforms.

cuviper commented 8 years ago

@nikomatsakis The numbers I gave were on Linux, Fedora 23 x86_64 on an i7-2600. I just tried Win10 x64 on an i7-3770K, and it goes from around 3-3.5x parallel speedup (quite variable!) to a more stable 4x speedup. (FWIW both of those are quad-core, giving 8 logical CPUs with hyperthreading.) I don't have OSX to try.

I can send a PR for the easy latch-wait condvar -- anything you want tweaked from the patch above?

@mneumann I'm curious to hear how much this helps you too!

@AlisdairO I think you're right that ordering could be relaxed. However, if we can get the latch fully converted to condvars, the atomic bool might go away in favor of just mutex-guarded state.

nikomatsakis commented 8 years ago

Hmm, I think I'd take the patch, but I'd prefer to separate the case of stolen jobs from other jobs, since only the latter benefit from the mutex condvar arrangement, and for the rest it is just overhead.

Niko

-------- Original message --------

From: Josh Stone notifications@github.com

Date:12/31/2015 17:15 (GMT-05:00)

To: nikomatsakis/rayon rayon@noreply.github.com

Cc: Niko Matsakis niko@alum.mit.edu

Subject: Re: [rayon] Busy yield-loops hurt performance (#9)

@nikomatsakis The numbers I gave were on Linux, Fedora 23 x86_64 on an i7-2600. I just tried Win10 x64 on an i7-3770K, and it goes from around 3-3.5x parallel speedup (quite variable!) to a more stable 4x speedup. (FWIW both of those are quad-core, giving 8 logical CPUs with hyperthreading.) I don't have OSX to try. I can send a PR for the easy latch-wait condvar -- anything you want tweaked from the patch above? @mneumann I'm curious to hear how much this helps you too! @AlisdairO I think you're right that ordering could be relaxed. However, if we can get the latch fully converted to condvars, the atomic bool might go away in favor of just mutex-guarded state. — Reply to this email directly or view it on GitHub.

cuviper commented 8 years ago

Well, supposedly stealing is rare anyway, so overhead doesn't matter. Plus I hope eventually to figure out how to make stealing sleep more precisely too. But until then, sure, I can try tweaking it so only fully-awaited jobs will notify the condvar.

nikomatsakis commented 8 years ago

Stealing is rare, but creating a latch is very, very common -- it happens on every fork.

On Thu, Dec 31, 2015, at 07:44 PM, Josh Stone wrote:

Well, supposedly stealing is rare anyway, so overhead doesn't matter. Plus I hope eventually to figure out how to make stealing sleep more precisely too. But until then, sure, I can try tweaking it so only fully-awaited jobs will notify the condvar.

— Reply to this email directly or view it on GitHub[1].

Links:

https://github.com/nikomatsakis/rayon/issues/9#issuecomment-168269732

mneumann commented 8 years ago

@nikomatsakis @cuviper : I redid my benchmarks. My numbers are:

Simple parallel: ~50 sec Rayon rev e15798dfe: ~55.8 sec Rayon rev 97b04cd7: ~56 sec Rayon rev e15798dfe with threshold 1.0: ~54.7 sec Rayon rev 97b04cd7 with threshold 1.0: ~49.8 sec.

Note that I am using par_iter, and in my case it's important to run every single iteration in parallel. Only changing the THRESHOLD constant did the trick. Good news, it's now en par with simple_parallel. So when setting the threshold I do see a 10% improvement with the new code!!! Now waiting for a way to set the threshold without patching the code :)

nikomatsakis commented 8 years ago

@mneumann interesting. I had planned to include the option in the par_iter API to add weights, so that you don't need to manually adjust the threshold. I also wanted to have an "autotune" operation that would try to find an ideal weight experimentally. (But I'm not really sure if that's the best overall approach.)

mneumann commented 8 years ago

@nikomatsakis To clarify: The 10% improvement I saw was in cpu time, not in wall clock time. simple_parallel was still a bit faster, but it also kept the cpu slightly more busy.

I experimented a bit with making the ParallelLen a trait and make the functions generic over it. The decision when to fallback to sequential mode would then be encapsulated there. One would use it like:

// this is equivalent to THRESHOLD = 1.0
pop.into_par_iter::<FullyParallel>().map(...) { }.collect_into()

struct WeightedParallel { cost: f64, ... }
impl ParallelLen for WeightedParallel {
    const THRESHOLD: f64 = 10_000.0;
   // ...
   fn is_parallel(&self) { self.cost > THRESHOLD && self.max_length > 1 }
}

pop.into_par_iter::<WeightedParallel>().map(...) {}

But I am not sure if it isn't overkill. Also one could use associated constants to specify the THRESHOLD of course.

I think it all depends on the type of algorithms that are being used. I can imagine algorithms were the left and right costs aren't always equally distributed (left and right subtrees for instance). So we'd need specialized ParallelLen implementations anyway.

nikomatsakis commented 8 years ago

@mneumann this is more germane to #7, so I'm going to post my reply there.

rayon-rs / rayon

Busy yield-loops hurt performance #9