Open deadsoul44 opened 2 months ago
I found par_bridge
:
if parallel {
let feature_histograms = hist.0.data.chunk_by_mut(|a, b| a.num < b.num);
feature_histograms
.zip(col_index.iter())
.par_bridge()
.for_each(|(h, col)| {
update_feature_histogram(
h,
data.get_col(*col),
&sorted_grad,
sorted_hess.as_deref(),
&index[start..stop],
);
});
} else {
col_index.iter().for_each(|col| {
update_feature_histogram(
hist.0.get_col_mut(*col),
data.get_col(*col),
&sorted_grad,
sorted_hess.as_deref(),
&index[start..stop],
);
});
}
But the performance is worse than the sequential counterpart. It seems like it keeps threads busy but uses only a single thread.
Parallel: average cpu time: 20.0, average wall time: 10.3 Sequential: average cpu time: 7.5, average wall time: 7.5
The slice is chunked before parallelization:
There's a direct parallel method for this:
https://docs.rs/rayon/latest/rayon/slice/trait.ParallelSliceMut.html#method.par_chunk_by_mut
It doesn't allow zip or enumerate.
I modified the Bin
struct so that enumerate or zip not needed but the results are very similar to par_bridge
:
Parallel: average cpu time: 19.6, average wall time: 10.5
You may need to run a profiler, like perf record
on Linux, to find where you're spending time in the parallel version.
I am trying to mutate a slice in parallel. The slice is chunked before parallelization:
But I get the following error.
Is there any workaround to make this work?
I am trying to add parallelism to the following section: https://github.com/perpetual-ml/perpetual/blob/a5b1a69aa96999835cd909981f53eaa662884fad/src/histogram.rs#L264