ginkgo-project / ginkgo

Numerical linear algebra software package
https://ginkgo-project.github.io/
BSD 3-Clause "New" or "Revised" License
384 stars 86 forks source link

Segmented ranges #1582

Open upsj opened 3 months ago

upsj commented 3 months ago

This PR adds a bunch of utility abstractions for simplifying common iteration patterns in Ginkgo.

  1. irange: Similar to the Python range([start, ] stop[, step]) function, the irange provides a range that can be used in range-for loops (and the corresponding iterators in standard library algorithms):

    for (int i = 0; i < size; i++)
    for (auto j = begin; j < end; j += step)
    // becomes
    for (auto i : irange<int>(size))
    for (auto j : irange<int>(begin, end, step))
    // or with C++17 CTAD
    for (auto i : irange(size))
    for (auto j : irange(begin, end, step))

    The range-for loop has the added advantage that the iteration variable can be made const, which prevents accidentally incrementing the wrong variable, e.g. in a nested loop. The compiler generates 100% identical code for this thanks to inlining, constant folding and common subexpression elimination.

  2. (enumerating_)indexed_iterator is pretty similar to permuting_iterator, but its intended usage is slightly different - it is mainly intended to provide strided access to an array, e.g. when stepping through a Csr matrix row with a whole warp, assigning nonzeros in a striped fashion. The enumerating_ variant returns a struct (index, value). The type is unlikely to be used directly, but used as additional functionality in the following segmented range abstraction.

  3. segmented_((enumerating_)value_)range provides a range-of-ranges abstraction that can cover most of our typical Csr row_ptrs access patterns. The final result is intended to look as follows:

    const auto begin = row_ptrs[row];
    const auto end = row_ptrs[row + 1];
    for (auto nz = begin; nz < end; nz++) {
    const auto col = cols[nz];
    const auto val = vals[nz];
    // ...
    }
    // with C++17 CTAD and structured bindings
    csr_view{row_ptrs, cols, vals, num_rows}; // the actual type will probably differ
    for (auto [nz, col, value] : csr_view.enumerate(row)) {
    // ...
    }

    Additionally, I am planning to make coalesced/striped work assignment more clear with some small helpers that turn regular ranges into strided ranges based on the warp/subgroup lane and size:

    const auto begin = row_ptrs[row];
    const auto end = row_ptrs[row + 1];
    for (auto nz = begin + subwarp.thread_rank(); nz < end; nz += subwarp.size()) {
    const auto col = cols[nz];
    const auto val = vals[nz];
    // ...
    }
    // with C++17 CTAD and structured bindings
    csr_view{row_ptrs, cols, vals, num_rows}; // the actual type will probably differ
    for (auto [nz, col, value] : csr_view.enumerate(row).striped(subwarp)) {
    // ...
    }

    Similarly, we can provide a blocked(subwarp) function, which assigns a consecutive block of indices to each thread, when that is useful. The inspiration for this approach comes from cub's striped and blocked work assignment strategies

TODO: