the method `par_iter` exists for struct `ChunkByMut<'_, BinF32, {closure@histogram.rs:207:78}>`, but its trait bounds were not satisfied

deadsoul44 commented 2 months ago

I am trying to mutate a slice in parallel. The slice is chunked before parallelization:

        let mut feature_histograms = hist.0.data.as_mut_slice().chunk_by_mut(|a, b| a.num < b.num);

        feature_histograms.par_iter().for_each(|h| {
            update_feature_histogram(
                h,
                data.get_col(0),
                &sorted_grad,
                sorted_hess.as_deref(),
                &index[start..stop],
            );
        });

But I get the following error.

error[E0599]: the method `par_iter` exists for struct `ChunkByMut<'_, BinF32, {closure@histogram.rs:207:78}>`, but its trait bounds were not satisfied
    --> src\histogram.rs:209:28
     |
209  |         feature_histograms.par_iter().for_each(|h| {
     |                            ^^^^^^^^ method cannot be called due to unsatisfied trait bounds
     |
    ::: C:\Users\ASUS\.rustup\toolchains\stable-x86_64-pc-windows-msvc\lib/rustlib/src/rust\library\core\src\slice\iter.rs:3347:1
     |
3347 | pub struct ChunkByMut<'a, T: 'a, P> {
     | ----------------------------------- doesn't satisfy `_: IntoParallelRefIterator<'_>`
     |
     = note: the following trait bounds were not satisfied:
             `&std::slice::ChunkByMut<'_, BinF32, {closure@src\histogram.rs:207:78: 207:84}>: IntoParallelIterator`
             which is required by `std::slice::ChunkByMut<'_, BinF32, {closure@src\histogram.rs:207:78: 207:84}>: rayon::iter::IntoParallelRefIterator<'_>`

Is there any workaround to make this work?

I am trying to add parallelism to the following section: https://github.com/perpetual-ml/perpetual/blob/a5b1a69aa96999835cd909981f53eaa662884fad/src/histogram.rs#L264

deadsoul44 commented 2 months ago

I found par_bridge:

    if parallel {
        let feature_histograms = hist.0.data.chunk_by_mut(|a, b| a.num < b.num);

        feature_histograms
            .zip(col_index.iter())
            .par_bridge()
            .for_each(|(h, col)| {
                update_feature_histogram(
                    h,
                    data.get_col(*col),
                    &sorted_grad,
                    sorted_hess.as_deref(),
                    &index[start..stop],
                );
            });
    } else {
        col_index.iter().for_each(|col| {
            update_feature_histogram(
                hist.0.get_col_mut(*col),
                data.get_col(*col),
                &sorted_grad,
                sorted_hess.as_deref(),
                &index[start..stop],
            );
        });
    }

But the performance is worse than the sequential counterpart. It seems like it keeps threads busy but uses only a single thread.

Parallel: average cpu time: 20.0, average wall time: 10.3 Sequential: average cpu time: 7.5, average wall time: 7.5

cuviper commented 2 months ago

The slice is chunked before parallelization:

There's a direct parallel method for this:

https://docs.rs/rayon/latest/rayon/slice/trait.ParallelSliceMut.html#method.par_chunk_by_mut

deadsoul44 commented 2 months ago

It doesn't allow zip or enumerate.

deadsoul44 commented 2 months ago

I modified the Bin struct so that enumerate or zip not needed but the results are very similar to par_bridge:

Parallel: average cpu time: 19.6, average wall time: 10.5

cuviper commented 2 months ago

You may need to run a profiler, like perf record on Linux, to find where you're spending time in the parallel version.

rayon-rs / rayon

the method `par_iter` exists for struct `ChunkByMut<'_, BinF32, {closure@histogram.rs:207:78}>`, but its trait bounds were not satisfied #1184