Open mauneyc-LANL opened 11 months ago
I've made some changes to push a lot of the indexing/building functions into the object. I don't know that I love the result, which lead to the elimination of some constexpr initialization (not that it couldnt be done, just saved time to get a working code).
One question to ask is how much time should PortableMDArray
devote to reconstructing internal layout configuration data - in particular strides. These are the offset widths associated with a particlar dimensional spec, eg.
// ordering based on what MDArray is doing
dims = {NT, NZ, NY, NX};
strides = {NZ * NY * NX, NY * NX, NX, 1};
The options are:
compute_index
call, recompute the strides. This obviously costs some FLOPS but strides are simple to compute and GPU compute is cheap. Further, because the rank is known at this point, the stride data only needs to go up to rank
, rather than compute/store values up to MAXDIM
PortableMDArray
needs to move from between contexts. Contra to (1), this array needs MAXDIM
storage which is likely going to be mostly dead weight.I've made some changes to push a lot of the indexing/building functions into the object. I don't know that I love the result, which lead to the elimination of some constexpr initialization (not that it couldnt be done, just saved time to get a working code).
One question to ask is how much time should
PortableMDArray
devote to reconstructing internal layout configuration data - in particular strides. These are the offset widths associated with a particlar dimensional spec, eg.// ordering based on what MDArray is doing dims = {NT, NZ, NY, NX}; strides = {NZ * NY * NX, NY * NX, NX, 1};
The options are:
1. at each `compute_index` call, recompute the strides. This obviously costs some FLOPS but strides are simple to compute and GPU compute is cheap. Further, because the rank is known at this point, the stride data only needs to go up to `rank`, rather than compute/store values up to `MAXDIM` 2. (current commit) at each construction or reshaping, recompute the strides. This is the "natural" path but does introduce more data to move around, to a first approximation doubling the data that `PortableMDArray` needs to move from between contexts. Contra to (1), this array needs `MAXDIM` storage which is likely going to be mostly dead weight.
I don't know the answer to this a priori. I'm not married to either option. Which leads to the cleaner code? Probably 2? And do we see a performance hit in a simple test case?
I don't know the answer to this a priori. I'm not married to either option. Which leads to the cleaner code? Probably 2? And do we see a performance hit in a simple test case?
Would be work considering the context that PortableMDArray
is used in. Does it reshape often, does it get move on/off device frequently, ect. (2.) would be "best" for host, since it avoids recomputation, but (1.) may be better for device since it would minimize data transfer (tho also it's like 48 bytes extra so probably over-optimizing here).
I've included a simple test with the PR. As originally there wasn't very much being tested directly with poc
we may need to add some more for checking these details. I glanced over spiner tests for ideas but most of what is there looks more numerics based.
Yeah fair questions:
Does it reshape often?
No, almost never.
does it get move on/off device frequently
Also no.
(2.) would be "best" for host, since it avoids recomputation, but (1.) may be better for device since it would minimize data transfer (tho also it's like 48 bytes extra so probably over-optimizing here).
Like you said, 48 bytes is basically nothing, so I lean towards (2).
Previously we recalculated at every access it seems. It seems like the number of adds will be the same and the number of multiplies saved will only be noticeable with high rank arrays. Performance data for ranks 2-6 would helpful in this case.
The more I think about this the more I think we should recompute the strides. Even so, I'd like to see performance differences. Integer operations can be noticeable on GPUs so pre-computed may be more performant, but a larger object may use more registers. There are a lot of competing factors so data is the best.
now that singularity-eos 1.9.0 release has been cut, I think we should up to c++17 and get this merged in prep for the next release.
now that singularity-eos 1.9.0 release has been cut, I think we should up to c++17 and get this merged in prep for the next release.
I am in favor of moving to C++17 at this time. However, I would prefer to decouple the shift to 17 from this MR. Let's just update the cmake builds to require 17 and then merge other MRs when they are ready.
@jonahm-LANL @dholladay00 I've pushed some benchmarks using catch2
, right now just doing index computation through contiguous memory - (i, j, k)
to n
. I haven't tested this on device yet but I would be interested to see the results. I intend to put in the same benchmarks on random (rather than contiguous) memory.
I can take off Draft
and move forward with review if you want to get this in. There are things I'd like to modify and add but there's a danger of code-creep and I don't want to hold up, the PR 'as-is' should be ready to go.
If C++17 gets in before this is merged, then I do want to do another draft - it will make things cleaner and clearer.
@mauneyc-LANL let us know when this is ready for re-review
PR Summary
This is a prototype for "templated MD", that is the reduction of code of the type
To
The single-value dimension variables
int nxN_
have been replaced with a staticly sized containerstd::array<size_t, MAXDIM> nxs_
. Also addedPortableMDArray::rank_
member to avoid some reevaluations, though I'm not sure it's necessary.Currently WIP, so documentation is wanting and a few function names/namespaces are silly/inconsistent/gibberish. I know there can be some reluctance about this coding style, so I wanted to get a "at least compiles" version up and gauge interest in iterating on it. (Actually passes a few tests! but haven't tried
spiner
with this yet).It was working fine, why change it?
What are the downsides?
std::apply()
, nostd::tuple
, no fold expressions). C++14 can be "recursion heavy" in this regard, which may degrade performance.Any suggestions, comments, questions welcome! @jonahm-LANL @chadmeyer @dholladay00 @jhp-lanl @jdolence
PR Checklist