Segmented ranges - Githubissues

This PR adds a bunch of utility abstractions for simplifying common iteration patterns in Ginkgo.

irange: Similar to the Python range([start, ] stop[, step]) function, the irange provides a range that can be used in range-for loops (and the corresponding iterators in standard library algorithms):
```
for (int i = 0; i < size; i++)
for (auto j = begin; j < end; j += step)
// becomes
for (auto i : irange<int>(size))
for (auto j : irange<int>(begin, end, step))
// or with C++17 CTAD
for (auto i : irange(size))
for (auto j : irange(begin, end, step))
```
The range-for loop has the added advantage that the iteration variable can be made const, which prevents accidentally incrementing the wrong variable, e.g. in a nested loop. The compiler generates 100% identical code for this thanks to inlining, constant folding and common subexpression elimination.
(enumerating_)indexed_iterator is pretty similar to permuting_iterator, but its intended usage is slightly different - it is mainly intended to provide strided access to an array, e.g. when stepping through a Csr matrix row with a whole warp, assigning nonzeros in a striped fashion. The enumerating_ variant returns a struct (index, value). The type is unlikely to be used directly, but used as additional functionality in the following segmented range abstraction.

segmented_((enumerating_)value_)range provides a range-of-ranges abstraction that can cover most of our typical Csr row_ptrs access patterns. The final result is intended to look as follows:

const auto begin = row_ptrs[row];
const auto end = row_ptrs[row + 1];
for (auto nz = begin; nz < end; nz++) {
const auto col = cols[nz];
const auto val = vals[nz];
// ...
}
// with C++17 CTAD and structured bindings
csr_view{row_ptrs, cols, vals, num_rows}; // the actual type will probably differ
for (auto [nz, col, value] : csr_view.enumerate(row)) {
// ...
}

Additionally, I am planning to make coalesced/striped work assignment more clear with some small helpers that turn regular ranges into strided ranges based on the warp/subgroup lane and size:

const auto begin = row_ptrs[row];
const auto end = row_ptrs[row + 1];
for (auto nz = begin + subwarp.thread_rank(); nz < end; nz += subwarp.size()) {
const auto col = cols[nz];
const auto val = vals[nz];
// ...
}
// with C++17 CTAD and structured bindings
csr_view{row_ptrs, cols, vals, num_rows}; // the actual type will probably differ
for (auto [nz, col, value] : csr_view.enumerate(row).striped(subwarp)) {
// ...
}

Similarly, we can provide a blocked(subwarp) function, which assigns a consecutive block of indices to each thread, when that is useful. The inspiration for this approach comes from cub's striped and blocked work assignment strategies

TODO:

[ ] more comprehensive tests
[ ] device tests to check compilation
[ ] strided ranges with group support

ginkgo-project / ginkgo

Segmented ranges #1582