Closed cuviper closed 8 years ago
The other yield loop is in steal_until, which is harder because that's looping for either latch completion or available work -- suggestions welcome.
Yeah, I figured I'd have to remove the busy loops sooner or later. It'd be worth benchmarking on other systems too (not sure what system you are using there). In the past, I've found that linux in particular has a more efficient condvar impl than other systems (I think it is based on futexes, but I've not researched this stuff in detail in a while). Another option is to use a spin loop with an increasing backoff -- i.e., sleep for increasing amounts of time.
Heh, I was wondering why wait
was having such an impact on performance: I thought that I only used it to block until the worker threads started, but I see I also used it to wait for the main threads to start. That was... silly. Good catch, a condvar definitely seems to make sense there.
I tried rayon
with my embarrassingly parallel evolutionary algorithm. It was about 10-15% slower than the simple_parallel
version IIRC. I will try again to see if performance is en par.
Bit of a hanger-on to this issue, but is SeqCst necessary for this AtomicBool? My (hurried) reading of it is that Latch essentially waits until b is true, and then allows other code to run - that being the case, it should be possible to use Ordering::Acquire for the probe method, and Ordering::Release for the set method. This would potentially have performance benefits on at least some platforms.
@nikomatsakis The numbers I gave were on Linux, Fedora 23 x86_64 on an i7-2600. I just tried Win10 x64 on an i7-3770K, and it goes from around 3-3.5x parallel speedup (quite variable!) to a more stable 4x speedup. (FWIW both of those are quad-core, giving 8 logical CPUs with hyperthreading.) I don't have OSX to try.
I can send a PR for the easy latch-wait condvar -- anything you want tweaked from the patch above?
@mneumann I'm curious to hear how much this helps you too!
@AlisdairO I think you're right that ordering could be relaxed. However, if we can get the latch fully converted to condvars, the atomic bool might go away in favor of just mutex-guarded state.
Hmm, I think I'd take the patch, but I'd prefer to separate the case of stolen jobs from other jobs, since only the latter benefit from the mutex condvar arrangement, and for the rest it is just overhead.
Niko
Well, supposedly stealing is rare anyway, so overhead doesn't matter. Plus I hope eventually to figure out how to make stealing sleep more precisely too. But until then, sure, I can try tweaking it so only fully-awaited jobs will notify the condvar.
Stealing is rare, but creating a latch is very, very common -- it
happens on every fork
.
On Thu, Dec 31, 2015, at 07:44 PM, Josh Stone wrote:
Well, supposedly stealing is rare anyway, so overhead doesn't matter. Plus I hope eventually to figure out how to make stealing sleep more precisely too. But until then, sure, I can try tweaking it so only fully-awaited jobs will notify the condvar.
— Reply to this email directly or view it on GitHub[1].
Links:
@nikomatsakis @cuviper : I redid my benchmarks. My numbers are:
Simple parallel: ~50 sec Rayon rev e15798dfe: ~55.8 sec Rayon rev 97b04cd7: ~56 sec Rayon rev e15798dfe with threshold 1.0: ~54.7 sec Rayon rev 97b04cd7 with threshold 1.0: ~49.8 sec.
Note that I am using par_iter
, and in my case it's important to run every single iteration in parallel. Only changing the THRESHOLD constant did the trick. Good news, it's now en par with simple_parallel. So when setting the threshold I do see a 10% improvement with the new code!!! Now waiting for a way to set the threshold without patching the code :)
@mneumann interesting. I had planned to include the option in the par_iter
API to add weights, so that you don't need to manually adjust the threshold. I also wanted to have an "autotune" operation that would try to find an ideal weight experimentally. (But I'm not really sure if that's the best overall approach.)
@nikomatsakis To clarify: The 10% improvement I saw was in cpu time, not in wall clock time. simple_parallel
was still a bit faster, but it also kept the cpu slightly more busy.
I experimented a bit with making the ParallelLen a trait and make the functions generic over it. The decision when to fallback to sequential mode would then be encapsulated there. One would use it like:
// this is equivalent to THRESHOLD = 1.0
pop.into_par_iter::<FullyParallel>().map(...) { }.collect_into()
struct WeightedParallel { cost: f64, ... }
impl ParallelLen for WeightedParallel {
const THRESHOLD: f64 = 10_000.0;
// ...
fn is_parallel(&self) { self.cost > THRESHOLD && self.max_length > 1 }
}
pop.into_par_iter::<WeightedParallel>().map(...) {}
But I am not sure if it isn't overkill. Also one could use associated constants to specify the THRESHOLD of course.
I think it all depends on the type of algorithms that are being used. I can imagine algorithms were the left and right costs aren't always equally distributed (left and right subtrees for instance). So we'd need specialized ParallelLen implementations anyway.
@mneumann this is more germane to #7, so I'm going to post my reply there.
I noticed that the kernel scheduler was showing up a lot in my perf report, and I tracked this down to having so many
sched_yield
syscalls in waiting threads. Such busy work is wasted cpu time, not only for your own process but also if anything else is trying to run on the system.I tried a crude patch below to use
Condvar
for the blocking latch wait, and it already shows a clear benefit. The other yield loop is insteal_until
, which is harder because that's looping for either latch completion or available work -- suggestions welcome.Before, running on Fedora 23 x86_64 with an i7-2600:
After, with Condvar wait: