lanl / ports-of-call

Performance Portability Utilities
https://lanl.github.io/ports-of-call/
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

Draft: Templated MD arrays #39

Open mauneyc-LANL opened 11 months ago

mauneyc-LANL commented 11 months ago

PR Summary

This is a prototype for "templated MD", that is the reduction of code of the type

PORTABLE_FUNCTION PortableMDArray(T *data, int nx1) noexcept
      : pdata_(data), nx1_(nx1), nx2_(1), nx3_(1), nx4_(1), nx5_(1), nx6_(1) {}
  PORTABLE_FUNCTION
  PortableMDArray(T *data, int nx2, int nx1) noexcept
      : pdata_(data), nx1_(nx1), nx2_(nx2), nx3_(1), nx4_(1), nx5_(1), nx6_(1) {
  }
  PORTABLE_FUNCTION
  PortableMDArray(T *data, int nx3, int nx2, int nx1) noexcept
      : pdata_(data), nx1_(nx1), nx2_(nx2), nx3_(nx3), nx4_(1), nx5_(1),
        nx6_(1) {}

To

  template <typename... NXs>
  PORTABLE_FUNCTION PortableMDArray(T *p, NXs... nxs) noexcept
      : PortableMDArray(p, nx_arr(nxs...), sizeof...(NXs)) {}

The single-value dimension variables int nxN_ have been replaced with a staticly sized container std::array<size_t, MAXDIM> nxs_. Also added PortableMDArray::rank_ member to avoid some reevaluations, though I'm not sure it's necessary.

Currently WIP, so documentation is wanting and a few function names/namespaces are silly/inconsistent/gibberish. I know there can be some reluctance about this coding style, so I wanted to get a "at least compiles" version up and gauge interest in iterating on it. (Actually passes a few tests! but haven't tried spiner with this yet).

It was working fine, why change it?

What are the downsides?

Any suggestions, comments, questions welcome! @jonahm-LANL @chadmeyer @dholladay00 @jhp-lanl @jdolence

PR Checklist

mauneyc-LANL commented 11 months ago

I've made some changes to push a lot of the indexing/building functions into the object. I don't know that I love the result, which lead to the elimination of some constexpr initialization (not that it couldnt be done, just saved time to get a working code).

One question to ask is how much time should PortableMDArray devote to reconstructing internal layout configuration data - in particular strides. These are the offset widths associated with a particlar dimensional spec, eg.

// ordering based on what MDArray is doing
dims = {NT, NZ, NY, NX};
strides = {NZ * NY * NX, NY * NX, NX, 1};

The options are:

  1. at each compute_index call, recompute the strides. This obviously costs some FLOPS but strides are simple to compute and GPU compute is cheap. Further, because the rank is known at this point, the stride data only needs to go up to rank, rather than compute/store values up to MAXDIM
  2. (current commit) at each construction or reshaping, recompute the strides. This is the "natural" path but does introduce more data to move around, to a first approximation doubling the data that PortableMDArray needs to move from between contexts. Contra to (1), this array needs MAXDIM storage which is likely going to be mostly dead weight.
Yurlungur commented 11 months ago

I've made some changes to push a lot of the indexing/building functions into the object. I don't know that I love the result, which lead to the elimination of some constexpr initialization (not that it couldnt be done, just saved time to get a working code).

One question to ask is how much time should PortableMDArray devote to reconstructing internal layout configuration data - in particular strides. These are the offset widths associated with a particlar dimensional spec, eg.

// ordering based on what MDArray is doing
dims = {NT, NZ, NY, NX};
strides = {NZ * NY * NX, NY * NX, NX, 1};

The options are:

1. at each `compute_index` call, recompute the strides. This obviously costs some FLOPS but strides are simple to compute and GPU compute is cheap. Further, because the rank is known at this point, the stride data only needs to go up to `rank`, rather than compute/store values up to `MAXDIM`

2. (current commit) at each construction or reshaping, recompute the strides. This is the "natural" path but does introduce more data to move around, to a first approximation doubling the data that `PortableMDArray` needs to move from between contexts. Contra to (1), this array needs `MAXDIM` storage which is likely going to be mostly dead weight.

I don't know the answer to this a priori. I'm not married to either option. Which leads to the cleaner code? Probably 2? And do we see a performance hit in a simple test case?

mauneyc-LANL commented 11 months ago

I don't know the answer to this a priori. I'm not married to either option. Which leads to the cleaner code? Probably 2? And do we see a performance hit in a simple test case?

Would be work considering the context that PortableMDArray is used in. Does it reshape often, does it get move on/off device frequently, ect. (2.) would be "best" for host, since it avoids recomputation, but (1.) may be better for device since it would minimize data transfer (tho also it's like 48 bytes extra so probably over-optimizing here).

I've included a simple test with the PR. As originally there wasn't very much being tested directly with poc we may need to add some more for checking these details. I glanced over spiner tests for ideas but most of what is there looks more numerics based.

Yurlungur commented 11 months ago

Yeah fair questions:

Does it reshape often?

No, almost never.

does it get move on/off device frequently

Also no.

(2.) would be "best" for host, since it avoids recomputation, but (1.) may be better for device since it would minimize data transfer (tho also it's like 48 bytes extra so probably over-optimizing here).

Like you said, 48 bytes is basically nothing, so I lean towards (2).

dholladay00 commented 11 months ago

Previously we recalculated at every access it seems. It seems like the number of adds will be the same and the number of multiplies saved will only be noticeable with high rank arrays. Performance data for ranks 2-6 would helpful in this case.

dholladay00 commented 10 months ago

The more I think about this the more I think we should recompute the strides. Even so, I'd like to see performance differences. Integer operations can be noticeable on GPUs so pre-computed may be more performant, but a larger object may use more registers. There are a lot of competing factors so data is the best.

dholladay00 commented 1 month ago

now that singularity-eos 1.9.0 release has been cut, I think we should up to c++17 and get this merged in prep for the next release.

Yurlungur commented 1 month ago

now that singularity-eos 1.9.0 release has been cut, I think we should up to c++17 and get this merged in prep for the next release.

I am in favor of moving to C++17 at this time. However, I would prefer to decouple the shift to 17 from this MR. Let's just update the cmake builds to require 17 and then merge other MRs when they are ready.

cmauney commented 3 weeks ago

@jonahm-LANL @dholladay00 I've pushed some benchmarks using catch2, right now just doing index computation through contiguous memory - (i, j, k) to n. I haven't tested this on device yet but I would be interested to see the results. I intend to put in the same benchmarks on random (rather than contiguous) memory.

I can take off Draft and move forward with review if you want to get this in. There are things I'd like to modify and add but there's a danger of code-creep and I don't want to hold up, the PR 'as-is' should be ready to go.

cmauney commented 3 weeks ago

If C++17 gets in before this is merged, then I do want to do another draft - it will make things cleaner and clearer.

Yurlungur commented 2 days ago

@mauneyc-LANL let us know when this is ready for re-review