Generic iteration function for n dimension mdspan

SimeonEhrig commented 1 year ago

Hi, I'm working on vikunja, a platform-independent primitives library (e.g. vikunja::transform, vikunja::reduce) for different accelerators based on alpaka. I want to use mdspan to support N dimension memory. In the presentation of Bryce Aldelstein Lelbach (youtube link) there is the idea to build a recursive function to create n nested loops, which the compiler can optimize.

I implemented a prototype with submdspan for an iota function for a mdspan:

namespace stdex = std::experimental;

template<int TDim>
struct Iterate_mdspan_impl;

template<>
struct Iterate_mdspan_impl<1>
{
    template<typename TSpan, typename TFunc>
    void operator()(TSpan span, TFunc& functor)
    {
        for(auto i = 0; i < span.extent(0); ++i)
        {
            span(i) = functor(span(i));
        }
    }
};

template<>
struct Iterate_mdspan_impl<2>
{
    template<typename TSpan, typename TFunc>
    void operator()(TSpan span, TFunc& functor)
    {
        for(auto i = 0; i < span.extent(0); ++i)
        {
            auto submdspan = stdex::submdspan(span, i, stdex::full_extent);
            Iterate_mdspan_impl<TSpan::rank() - 1>{}(submdspan, functor);
        }
    }
};

template<>
struct Iterate_mdspan_impl<3>
{
    template<typename TSpan, typename TFunc>
    void operator()(TSpan span, TFunc& functor)
    {
        for(auto i = 0; i < span.extent(0); ++i)
        {
            auto submdspan = stdex::submdspan(span, i, stdex::full_extent, stdex::full_extent);
            Iterate_mdspan_impl<TSpan::rank() - 1>{}(submdspan, functor);
        }
    }
};

template<typename TSpan, typename TData>
void iota_span(TSpan span, TData index)
{
    static_assert(TSpan::rank() > 0);
    static_assert(TSpan::rank() <= 3);
    auto functor = [&index](TData input) { return index++; };
    Iterate_mdspan_impl<TSpan::rank()>{}(span, functor);
}

My problem is, that I need to specialize each dimension by hand. At the moment I stopped at dim 3. Does anybody have an idea to write a generic function to iterate over all elements of a mdspan?

Maybe it is possible to set the n stdex::full_extent arguments in the function stdex::submdspan(span, i, stdex::full_extent, ...) depending on the template parameter TDim. I'm not sure, if this is possible with some variadic or type trait.

mhoemmen commented 1 year ago

Hi @SimeonEhrig ! Bryce also has presentations that talk about the two C++23 ranges features cartesian_product and iota. You can use those to construct a range with the desired iteration order, and then iterate over the range with an algorithm like ranges::for_each.

You could also imagine specializing for the case of iterating over an mdspan's domain (which is a Cartesian product of the extents). It's an exercise (not an easy one!) in parameter pack algebra.

SimeonEhrig commented 1 year ago

I developed a solution:

#include <experimental/mdspan>
#include <iostream>

namespace stdex = std::experimental;

/**
 * @brief Construct submdspan of mdspan. The submdspan has one rank less than the mdspan. The left dimension is fixed
 * to a specific index. The rest of the dimension contains the full range.
 *
 * @tparam TRank Dimension of the new submdspan (needs to be mdspan::rank()-1).
 */
template<int TRank>
struct Construct_Submdspan;

template<>
struct Construct_Submdspan<1>
{
    template<typename TSpan, typename... Types>
    constexpr auto construct(TSpan span, std::size_t const fixed_index_pos, Types... args)
    {
        return stdex::submdspan(span, fixed_index_pos, args...);
    }
};

template<int TRank>
struct Construct_Submdspan
{
    /**
     * @brief Returns the submdspan of a mdspan, with one dimension less.
     *
     * @tparam TSpan Type of the span
     * @tparam Types needs to std::experimental::full_extent_t
     * @param span mdspan from which the submdspan is created
     * @param fixed_index_pos Index postion of the fixed dimension
     * @param args needs to std::experimental::full_extent
     * @return constexpr auto returns a stdex::submdspan
     */
    template<typename TSpan, typename... Types>
    constexpr auto construct(TSpan span, std::size_t const fixed_index_pos, Types... args)
    {
        return Construct_Submdspan<TRank - 1>{}.construct(span, fixed_index_pos, stdex::full_extent, args...);
    }
};

/**
 * @brief Returns a submdspan of mdspan. The submdspan has one rank less than the mdspan. The left dimension is fixed
 * to a specific index. The rest of the dimension contains the full range.
 *
 * @tparam TMDSpan
 * @param span mdspan from which the submdspan is created
 * @param fixed_index_pos Index postion of the fixed dimension
 * @return constexpr auto returns a stdex::submdspan
 */
template<typename TMDSpan>
constexpr auto submdspan_remove_dim(TMDSpan span, std::size_t const fixed_index_pos)
{
    constexpr auto rank = TMDSpan::rank();
    return Construct_Submdspan<rank - 1>{}.construct(span, fixed_index_pos, stdex::full_extent);
}

/**
 * @brief Iterates over all elements of an n dimension mdspan. The iteration order is from the right to the left
 * dimension.
 *
 * @tparam TDim Rank of the mdspan
 */
template<int TDim>
struct Iterate_mdspan;

template<>
struct Iterate_mdspan<1>
{
    template<typename TSpan, typename TFunc>
    void operator()(TSpan span, TFunc& functor)
    {
        for(auto i = 0; i < span.extent(0); ++i)
        {
            span(i) = functor(span(i));
        }
    }
};

template<int TDim>
struct Iterate_mdspan
{
    /**
     * @brief Iterate over all elements of an mdspan and apply the functor on it.
     *
     * @tparam TSpan type of the mdspan
     * @tparam TFunc type of the functor
     * @param span The mdspan
     * @param functor The functor
     */
    template<typename TSpan, typename TFunc>
    void operator()(TSpan span, TFunc& functor)
    {
        for(auto i = 0; i < span.extent(0); ++i)
        {
            auto submdspan = submdspan_remove_dim(span, i);
            Iterate_mdspan<TSpan::rank() - 1>{}(submdspan, functor);
        }
    }
};

/**
 * @brief Do the same like std::iota with a n-dimensional mdspan. The iteration order is from the right to the left
 * dimension.
 *
 * @tparam TSpan type of the mdspan
 * @tparam TData type of the functor
 * @param span The mdspan
 * @param index value of the first element
 */
template<typename TSpan, typename TData>
void iota_span(TSpan span, TData index)
{
    static_assert(TSpan::rank() > 0);
    auto functor = [&index](TData input) { return index++; };
    Iterate_mdspan<TSpan::rank()>{}(span, functor);
}

int main()
{
    std::array<int, 12> d;

    // stdex::mdspan m{d.data(), stdex::extents{12}};
    // stdex::mdspan m{d.data(), stdex::extents{2, 6}};
    // stdex::mdspan m{d.data(), stdex::extents{2, 4, 2}};
    stdex::mdspan m{d.data(), stdex::extents{2, 2, 1, 4}};

    iota_span(m, 42);

    for(auto const& v : d)
    {
        std::cout << v << " ";
    }
    std::cout << std::endl;

    return 0;
}

I take some times, until I understand all required rules of variadic templates and variadic parameter packs and it takes also some time to combine everything but now I have a working solution. cppinsights helped a lot to developed it. Now, I need to check, if the compiler can optimize it.

mhoemmen commented 1 year ago

Excellent work! C++23 (cartesian_product and iota range views) or the ranges-v3 library can make this a lot easier to write. Here is an example of C++20 code for iterating over the domain of an mdspan: https://godbolt.org/z/fh41E1q66

#include <array>
#include <iostream>
#include <ranges>
#include <range/v3/view/cartesian_product.hpp>
#include <range/v3/view/indices.hpp>
#include <tuple>

#include <https://raw.githubusercontent.com/kokkos/mdspan/single-header/mdspan.hpp>

namespace stdex = std::experimental;

// You can't use std::apply on templated nonmember functions,
// but you can use it on generic lambdas.  See the example here:
//
// https://en.cppreference.com/w/cpp/utility/apply
auto output_one_tuple_2 = []<class ... InputTypes>(InputTypes&& ... input) {
    // lambdas with explicit template parameters require C++20.
    auto print_all = [&]<std::size_t ... Indices>( std::index_sequence<Indices...> ) {
        auto print_one = [&] (std::size_t index, auto&& in) {
            std::cout << in;
            if(index + 1 < sizeof...(Indices)) {
                std::cout << ", ";
            }
        };
        (print_one(Indices, input), ...);
    };
    std::cout << '(';
    print_all(std::make_index_sequence<sizeof...(InputTypes)>());
    std::cout << ")\n";
};

template<class Callable, class IndexType, std::size_t ... Extents>
void for_each_in_extents(Callable&& f, stdex::extents<IndexType, Extents...> e) {
  [&]<std::size_t ... Indices>( std::index_sequence<Indices...> ) {
    auto v = ranges::views::cartesian_product(ranges::views::indices(0, e.extent(Indices))...);
    for(const auto& k : v) {
        std::apply(std::forward<Callable>(f), k);
    }
  }(std::make_index_sequence<sizeof...(Extents)>());
}

int main() {
  stdex::extents<int, 3, 4, 5> e;
  for_each_in_extents(output_one_tuple_2, e);
  return 0;
}

mhoemmen commented 1 year ago

Here's an improvement of that example, that shows how to iterate over the domain of an mdspan without needing ranges: https://godbolt.org/z/nabce8WW6

#include <https://raw.githubusercontent.com/kokkos/mdspan/single-header/mdspan.hpp>
#include <array>
#include <iostream>
#include <tuple>

namespace stdex = std::experimental;

// You can't use std::apply on templated nonmember functions,
// but you can use it on generic lambdas.  See the example here:
//
// https://en.cppreference.com/w/cpp/utility/apply
auto print_pack = []<class ... InputTypes>(InputTypes&& ... input) {
    auto print_all = [&]<std::size_t ... Indices>( std::index_sequence<Indices...> ) {
        auto print_one = [&] (std::size_t index, auto&& in) {
            std::cout << in;
            if(index + 1 < sizeof...(Indices)) {
                std::cout << ", ";
            }
        };
        (print_one(Indices, input), ...);
    };
    std::cout << '(';
    print_all(std::make_index_sequence<sizeof...(InputTypes)>());
    std::cout << ")\n";
};

// C++20 lets you write lambdas templated on <std::size_t ... Indices>.
// Calling those lambdas with the result of std::make_index_sequence
// is a good way to "iterate" over a parameter pack.
// If you don't have C++20, you can replace the lambda
// with a separate, named helper function.

// Returns a new extents object representing all but the leftmost extent of e.
template<class IndexType, std::size_t ... Extents>
auto right_extents( stdex::extents<IndexType, Extents...> e )
{
  static_assert(sizeof...(Extents) != 0);
  return [&]<std::size_t ... Indices>( std::index_sequence<Indices...> ) {
    return stdex::extents<IndexType, e.static_extent(Indices + 1)...>{
      e.extent(Indices + 1)...
    }; 
  }( std::make_index_sequence<sizeof...(Extents) - 1>() );
}

// right_extents can be implemented by overloading for
// extents<IndexType, LeftExtent, RightExtents...>.
// That approach doesn't work for left_extents.

// Returns a new extents object representing all but the rightmost extent of e.
template<class IndexType, std::size_t ... Extents>
auto left_extents( stdex::extents<IndexType, Extents...> e )
{
  static_assert(sizeof...(Extents) != 0);
  return [&]<std::size_t ... Indices>( std::index_sequence<Indices...> ) {
    return stdex::extents<IndexType, e.static_extent(Indices)...>{
      e.extent(Indices)...
    };
  }( std::make_index_sequence<sizeof...(Extents) - 1>() );
}

template<class Callable, class IndexType, std::size_t ... Extents>
void for_each_in_extents_row_major(Callable&& callable, stdex::extents<IndexType, Extents...> ext)
{
  if constexpr(ext.rank() == 0) {
    return;
  } else if constexpr(ext.rank() == 1) {
    if constexpr(ext.static_extent(0) == stdex::dynamic_extent) {
      for(IndexType index = 0; index < ext.extent(0); ++index) {
        std::forward<Callable>(callable)(index);  
      }
    } else {
      // TODO unroll this loop at compile time
      // (the compiler might do that for us anyway).
      for(IndexType index = 0; index < ext.static_extent(0); ++index) {
        std::forward<Callable>(callable)(index);  
      }
    }
  } else {
    auto right_ext = right_extents(ext);

    if constexpr(ext.static_extent(0) == stdex::dynamic_extent) {
      for(IndexType left_index = 0; left_index < ext.extent(0); ++left_index) {
        auto next = [&] (auto... right_indices) {
          std::forward<Callable>(callable)(left_index, right_indices...);
        };
        for_each_in_extents_row_major( next, right_ext );
      }
    } else {
      // TODO unroll this loop at compile time
      // (the compiler might do that for us anyway).
      for(IndexType left_index = 0; left_index < ext.static_extent(0); ++left_index) {
        auto next = [&] (auto... right_indices) {
          std::forward<Callable>(callable)(left_index, right_indices...);
        };
        for_each_in_extents_row_major( next, right_ext );
      }
    }
  }
}

// Overloading on stdex::extents<IndexType, LeftExtents..., RightExtent>
// works fine for the row major case, but not for the column major case.
template<class Callable, class IndexType, std::size_t ... Extents>
void for_each_in_extents_col_major(Callable&& callable, stdex::extents<IndexType, Extents...> ext)
{
  if constexpr(ext.rank() == 0) {
    return;
  } else if constexpr (ext.rank() == 1) {
    if constexpr(ext.static_extent(ext.rank() - 1) == stdex::dynamic_extent) {
      for(IndexType index = 0; index < ext.extent(ext.rank() - 1); ++index) {
        std::forward<Callable>(callable)(index);  
      }
    } else {
      for(IndexType index = 0; index < ext.static_extent(ext.rank() - 1); ++index) {
        std::forward<Callable>(callable)(index);  
      }
    }
  } else {
    auto left_ext = left_extents(ext);

    if constexpr(ext.static_extent(ext.rank() - 1) == stdex::dynamic_extent) {
      for(IndexType right_index = 0; right_index < ext.extent(ext.rank() - 1); ++right_index) {
        auto next = [&] (auto... left_indices) {
          std::forward<Callable>(callable)(left_indices..., right_index);
        };
        for_each_in_extents_col_major( next, left_ext );
      }
    } else {
      for(IndexType right_index = 0; right_index < ext.static_extent(ext.rank() - 1); ++right_index) {
        auto next = [&] (auto... left_indices) {
          std::forward<Callable>(callable)(left_indices..., right_index);
        };
        for_each_in_extents_col_major( next, left_ext );
      }
    }
  }
}

int main() {
  stdex::extents<int, 3, 4, 5> e;

  std::cout << "Row major:\n";
  for_each_in_extents_row_major(print_pack, e);
  std::cout << "\nColumn major:\n";
  for_each_in_extents_col_major(print_pack, e);
  return 0;
}

bernhardmgruber commented 1 year ago

Since I had to implement the same functionality in LLAMA, here is my version:

// ... includes etc.

template<typename IndexType, std::size_t ... Extents, typename Func,
         std::size_t FirstExtInd, std::size_t... RestExtInds, typename... Is>
void forEachIndex(const stdex::extents<IndexType, Extents...>& ext, Func&& func,
                  std::index_sequence<FirstExtInd, RestExtInds...>, Is... is) {
    for(IndexType i = 0; i < ext.extent(FirstExtInd); i++)
        if constexpr(sizeof...(RestExtInds) > 0)
            forEachIndex(ext, std::forward<Func>(func), std::index_sequence<RestExtInds...>{}, is..., i);
        else
            std::forward<Func>(func)(is..., i);
}

template<class IndexType, std::size_t ... Extents, typename Func>
void forEachIndex(stdex::extents<IndexType, Extents...> ext, Func&& func) {
    // for column-major, you could reverse the index_sequence
    forEachIndex(ext, std::forward<Func>(func), std::make_index_sequence<sizeof...(Extents)>{});
}

int main() {
  stdex::extents<int, 3, stdex::dynamic_extent, 5> e{4};
  forEachIndex(e, print_pack);
  return 0;
}

Godbolt: https://godbolt.org/z/6ae585YM8

The trick here is to use an index_sequence to determine in which order the dimensions of the extents should be iterated. That idea btw. came from alpaka's NdLoop. The extents object is also never modified, so you can safe yourself the utility function to pop one extent left or right.

bernhardmgruber commented 1 year ago

@SimeonEhrig here is your example of iota_span using the above snippet:

template<typename TSpan, typename TData>
void iota_span(TSpan span, TData index)
{
    forEachIndex(span.extents(), [&](auto... is){
        span(is...) = index++;
    });
}

Godbolt: https://godbolt.org/z/5hncdK1fj

SimeonEhrig commented 1 year ago

@mhoemmen and @bernhardmgruber thanks for the solutions. Looks much better, than my solution. I think, I will take @bernhardmgruber solution and extend it with the ideas of @mhoemmen solution (I cannot directly overtake it, because we cannot use C++20 because of the nvcc compiler).

mhoemmen commented 1 year ago

@SimeonEhrig I think I can do better to make my code more concise! I've been working on it when I need a break from other things. The only strictly C++20 things in my example are lambdas with named template parameters, but you've figured out how to work around.

You might also want to consider Kokkos' MDRange, an existing optimized parallel multidimensional index for-each that only requires C++14 and works well with nvcc and other compilers.

mhoemmen commented 1 year ago

@bernhardmgruber @SimeonEhrig Here's a refinement of my above example: https://godbolt.org/z/o9ecK7coj -- please see in particular for_each_in_extents_4.

SimeonEhrig commented 1 year ago

Thanks for the solution. Looks like my idea in a heavy optimized version :sweat_smile: I'm also impressed, how you can use templated lambdas in C++20 for template meta programming. I read about templated lambdas but I never thought about to use it for meta programming.

bernhardmgruber commented 1 year ago

I'm also impressed, how you can use templated lambdas in C++20 for template meta programming. I read about templated lambdas but I never thought about to use it for meta programming.

TMP is the use for lambdas with explicit template heads until we get P1061, possibly in C++26.

Edit: given that std::extents can be destructured, which it can be, but not in the way you would expect:

  stdex::extents<int, 3, stdex::dynamic_extent, 5> e{4};
  const auto [a, b, c] = e; // compile error
  const auto [b] = e; // ok, but b is now some kind of internal `storage_t`.

@mhoemmen should this be fixed? That is, should std::extents destructure to all extents (with static extends being demoted to runtime values)? I would find this useful. Definitely more useful than what currently happens :)

crtrott commented 1 year ago

This structured binding thing will not work no matter what, certainly wouldn't produce runtime values. This just makes internal members visible.

crtrott commented 1 year ago

Oh I see you could do it for custom types. I guess someone could propose a paper to do this ...

mhoemmen commented 1 year ago

@crtrott Right, one could overload the various tuple_* customization points so that get works as one might expect.

@bernhardmgruber It's a neat idea to make this work for extents. Would you be interested in writing the paper? It would have to target C++26, as the C++23 acceptance date for new features is long past.

bernhardmgruber commented 1 year ago

@crtrott and @mhoemmen Exactly, we would need specializations of std::tuple_size and std::tuple_element as well as either a member std::extents::get or a free std::get(extents).

I am interested in writing a small paper. It's a little personal wish of mine for a long time anyway :) I will try to float a draft on the mailing list in a couple of weeks.

mhoemmen commented 1 year ago

@bernhardmgruber That's excellent! We would be happy to review early drafts of the paper before you send it to the mailing list. Thanks!

yurielnf commented 1 year ago

Hello, thanks for the nice mdspan! How bad is the following code? I expect much less index computation but ...

template <class T, class IndexType, size_t E0, size_t... E, class L, class A, class Fun>
void forEach(std::mdspan<T, std::extents<IndexType, E0, E...>, L, A> x, Fun&& f) 
{
    auto full = [](size_t) { return std::full_extent; };
    for (IndexType i = 0; i < x.extent(0); i++)
        if constexpr (x.rank() > 1)
            forEach(stdex::submdspan(x, i, full(E)...), f);
        else
            std::forward<Fun>(f)(x[i]);
}

SimeonEhrig commented 1 year ago

Hi @yurielnf , my personal experience is, that constructing recursive std::submdspan is very expensive. You should avoid it. Instead you should direct built the memory access recursive: forEach(mdspan[calculate_extend_index<I>(index)...], f);

yurielnf commented 1 year ago

Thanks @SimeonEhrig. This implementation works for strided mdspan. Any idea how to generalize it?

template <size_t pos, class T, class E, class L, class A, class Fun>
void forEach_impl(std::mdspan<T, E, L, A> const& x, Fun&& f, size_t offset) {
    if constexpr (pos == x.rank())
        std::forward<Fun>(f)(x.accessor().access(x.data_handle(), offset));
    else
        for (size_t i = 0u; i < x.extent(pos); i++)
            forEach_impl<pos + 1>(x, std::forward<Fun>(f), offset + i * x.stride(pos));
}

template <class T, class E, class L, class A, class Fun>
void forEach(std::mdspan<T, E, L, A> const& x, Fun&& f) {
    static_assert(x.is_always_strided(), "valid only for strided mdspan");
    forEach_impl<0>(x, std::forward<Fun>(f), 0);
}

SimeonEhrig commented 1 year ago

Take this solution from this issue, pass a std::mdspan instead the extend and modify the std::forward. https://github.com/kokkos/mdspan/issues/202#issuecomment-1294207397

mhoemmen commented 1 year ago

@yurielnf Please feel free to use this PR as a guide: https://github.com/kokkos/mdspan/pull/247

yurielnf commented 1 year ago

Excellent! It is a beautiful code! In the case of strided mdspan, would the recursion on your lambda next be as efficient as the recursion on a size_t offset above (after unrolling the loop for compile-time extents)?

mhoemmen commented 1 year ago

@yurielnf wrote:

Excellent! It is a beautiful code!

Thanks so much! : - D

would the recursion on your lambda next be as efficient as the recursion on a size_t offset above (after unrolling the loop for compile-time extents)?

My implementation is generic; it works for any layout (if reading, or for any unique layout if writing). If you want iteration to be efficient, the best way is to specialize for the layout.

For exhaustive layouts, just skip the layout mapping and iterate directly over [0, required_span_size()).

For strided layouts where either the leftmost or rightmost extent has stride 1 (see P2642), you might get faster code by treating the stride-1 extent as a special case.

Another generic approach is to use the inverse of a canonical exhaustive layout. For example, take the layout_left::mapping with the same extents and iterate over the valid offsets [0, required_span_size()) of that mapping. In the iteration, invert the current offset to find the canonical rank()-dimensional coordinate, then use that coordinate in the original mdspan's mapping. This approach requires divmod (divisions and modulus) operations, so it's probably not as fast for strided layouts as what you're doing.

Thanks for your interest in mdspan! : - )

kokkos / mdspan

Generic iteration function for n dimension mdspan #202