rayon-rs / rayon

Rayon: A data parallelism library for Rust
Apache License 2.0
11.07k stars 501 forks source link

Feature request: `.tuples()` #971

Open WhyNotHugo opened 2 years ago

WhyNotHugo commented 2 years ago

I see tuples() listed in Probably applicable but did not yet open issues

I'd find this very useful. I'm re-encoding images from LE-XRBG to BE-RGB:

    let mmap: memmap2::Mmap = ...
    let mut image_data = vec![0u8; mmap.len() * 3 / 4];
    mmap.iter()
        .tuples()
        .zip(image_data.iter_mut().tuples())
        .for_each(|((b, g, r, _), (r2, g2, b2))| (*r2, *g2, *b2) = (*r, *g, *b));

This can clearly be parallelised quite a bit, but I don't see any obvious way of writing it without tuples().

cuviper commented 2 years ago

The ugly part of tuples() is trying to be variadic in length, which doesn't have any language support. We deal with this a little in the IntoParallelIterator for tuples into a MultiZip, which we manually (macro) implement up to length 12. Itertools depends on a similar manual implementation of its own HomogeneousTuple trait.

For your particular case with slice-able data, you could also use par_chunks_exact(4) and par_chunks_exact_mut(3). You can still pattern-match that if you treat it as a refutable pattern within your for_each, like if let ([b, g, r, _], [r2, g2, b2]) {...}, even though that if let will never fail in practice. Eventually it can be a truly irrefutable argument pattern with arrays, #789, but I've been hoping that the standard library would stabilize the regular Iterator API for that first.

cuviper commented 2 years ago

I'm re-encoding images from LE-XRBG to BE-RGB:

This can clearly be parallelised quite a bit,

FWIW, it's not necessarily true that this task will benefit from pixel-level parallelization, especially since there's little real CPU work, just data movement. If you have many images to convert, it may work better to parallelize that coarsely with each image working serially, but ultimately you're going to be limited by I/O and memory bandwidth.

WhyNotHugo commented 2 years ago

I tend to agree with that last statement; it seems to be mostly memory IO limited (rather than CPU), so it might be a pointless optimisation. FWIW, the original code can also be re-written as:

    let mut image_data = vec![0u8; mmap.len() * 3 / 4];
    mmap.iter()
        .tuples()
        .zip(image_data.iter_mut().tuples())
        .for_each(|((b, g, r, _), (r2, g2, b2))| (*r2, *g2, *b2) = (*r, *g, *b));

Clearly it's just copying data around. I'd still be curious to benchmark with parallelisation, but performance has improved to the point where this is no longer the slowest thing.

I couldn't get chunks to work for this example, regrettably.

Shall I still leave this issue open for further discussion on .tuple()?

cuviper commented 2 years ago

I couldn't get chunks to work for this example, regrettably.

Here's a playground with chunks_exact, par_chunks_exact, and also using unstable standard library features for array_chunks and as_chunks.

Shall I still leave this issue open for further discussion on .tuple()?

Yes, we can leave that request open.