parthenon-hpc-lab / parthenon

Parthenon AMR infrastructure
https://parthenon-hpc-lab.github.io/parthenon/
Other
125 stars 37 forks source link

GPU friendly ContainerIterator #90

Closed jdolence closed 4 years ago

jdolence commented 4 years ago

The current iterator does not return something that is usable on the GPU. It's not entirely clear what should be returned. A view of views? Some other Kokkos container? Anyone have thoughts?

@Yurlungur @forrestglines @dholladay00 perhaps this is obvious to you?

forrestglines commented 4 years ago

Definitely not a View of views, as that seems to require UVM memory and would impose unacceptable performance hits.

It might be ok to have a lightweight array of pointers to CUDA memory. Accessing data would involve first getting which array the variable lives in, then accessing the array. As the loops are currently written, this would mean twice the number of lookups. We could mitigate this by rearranging the loops so that each GPU thread does multiple points instead of just one so that the variable array could be cached. However, that would probably reduce parallelization to the point that we couldn't keep the GPU occupied.

My original idea would be that all data in a meshblock (or maybe all Cell, Edge, and Face centered in their own arrays) would be contained in a single Kokkos-View. Iterating over all Variables in a Container would just involve iterating over one more dimension, as we do in K-Athena. If the Variables weren't contiguous, we could also potentially pass a small array of integers in parameter memory (ideally not a View).

It would be even better if all data from all meshblocks in flight could live in a single data structure so that kernels could operate on multiple meshblocks. Although it might be faster, we decided against that so that we could keep the original code structure. We're using streams to make up for that lost parallelization, but I wouldn't count on streams saving us beyond how we're already testing the high kernel count.

A lot of this depends on how the data itself is arranged within the container. For optimal performance it will be essential to get that right.

forrestglines commented 4 years ago

I'm confused what the ContainerIterator is for. Is this to iterate over all variables in a Container? I'm concerned that an iterator itself is needed at all. Not only would the iterator need to return GPU usable data structures, but that iterator would also need to exist on the device and be light weight enough to not impose performance hits. This might be difficult to get right.

dholladay00 commented 4 years ago

... as that seems to require UVM memory and would impose unacceptable performance hits.

I believe this can be implemented without UVM. That being said, it's not clear that a View of Views in the way to go. If this is the only reason to avoid this, then perhaps it's still worth looking at.

pgrete commented 4 years ago

Is the ContainerIterator ever used somehwere in a kernel? Otherwise, this could be a high level "wrapper" that translates to nested parallelism on the extracted (Container) Variables.

pgrete commented 4 years ago

Regarding the View of Views @forrestglines is probably referring to https://github.com/kokkos/kokkos/wiki/View#623-can-i-make-a-view-of-views

jdolence commented 4 years ago

So the ContainerIterator was/is never meant to be used in a kernel. It's supposed to be a high level utility that produces an iterable list of variables from a container that match a set of Metadata flags. You can see examples of it's usage in Update.cpp. Example in words: I want to loop over all conserved/independent variables and update them with fluxes. How do I do that in general? Our solution was to produce an iterable object (a std::vector in the current implementation) that has the appropriate variables in it. Something conceptually similar to this is essential, I think, to this physics agnostic infrastructure we're trying to develop.

One thought is that this doesn't need to be an object. Some kind of factory function that takes a container and list of Metadata flags and produces a list of shallow copied ParArrayNDs would be fine. But I don't know what that "list" actually is.

forrestglines commented 4 years ago

One idea I had regarding memory layout for containers with contiguous variables and non-contiguous variables owned by multiple containers: Instead of an iterator over variables we could provide an executor (like a replacement for par_for) that would executor a kernel over all variables in the kernel. The actual looping could be handled differently depending on the memory layout of the variables.

forrestglines commented 4 years ago

So the ContainerIterator was/is never meant to be used in a kernel. It's supposed to be a high level utility that produces an iterable list of variables from a container that match a set of Metadata flags.

One problem though is that this multiplies the number of kernel launches by the number of variables compared to K-Athena. We were already finding this to be a performance issue for moving forward with AMR which is why we were moving to streams.

jdolence commented 4 years ago

Yeah that would be a no go. I think. The current design would mean we'd have to do that. I'm asking for us to come up with a different design that doesn't require it. The loop of variables just becomes the outer loop in a kernel.

forrestglines commented 4 years ago

Some kind of factory function that takes a container and list of Metadata flags and produces a list of shallow copied ParArrayNDs would be fine. But I don't know what that "list" actually is.

If this were just CUDA, I would make an array of pointers for each array, but ensure that the array is in CUDA constant memory and not CUDA global memory. In Kokkos, I think that would be an array of subviews that specifically lives in constant memory. I'm not sure right now how to do that with Kokkos or if it's possible.

forrestglines commented 4 years ago

These are my current thoughts, opinions, and ideas on the issue collected into one place.

Performance Constraints

If our goal is to match performance with K-Athena, then there are a couple implementation designs which could hinder performance.

Acceptable Performance tradeoffs

I think we can add some additional complexity to the kernels without losing performance. -Additional integer arithmetic is ok -Some additional parameter/constant memory usage is ok If the additional structures we need for iterators/array magic is small, it probably won't impact performance of larger kernels. -Additional parameter/constant memory accesses is ok Accessing constant memory is much faster than accessing global memory. For an array of arrays, if the first layer existed in constant memory, then the first memory access probably wouldn't be burdensome.

Potential Approaches

I have a couple ideas on how resolve performance issues for looping across all variables.

-Place variables in one large contiguous view If all variables in the container are placed in one contiguous view with the variable number as the slowest index, we can loop over variables as a kernel dimension, like in the K-Athena. This would be the most performant solution. It would look like

par_for( 0, N, 0, L, ks, ke, js, je, is, ie,
  KOKKOS_LAMBDA (const int n, const int l, const int k, const int j, const int i){
    qout(n,l, k, j, i) = qin(n,l, k, j, i) + dt * dudt(n,l, k, j, i);
});

-Place variables in one large view, use integer array in parameter memory to index If all variables that needed to be looped over were in one view but potentially they weren't contiguous, then they could be still be accessed in one global memory access if the list of variable indices is in CUDA constant memory. I'm not sure how to get that array into constant memory with Kokkos though. Performance wise, this might be ok.

par_for( 0, N, 0, L, ks, ke, js, je, is, ie,
  KOKKOS_LAMBDA (const int n, const int l, const int k, const int j, const int i){
    const int var = vars[n]; //vars is in constant memory (somehow)
    qout(var,l, k, j, i) = qin(var,l, k, j, i) + dt * dudt(var,l, k, j, i);
});

-Variables go in separate views, manipulate pointer data of one subview in constant memory If we couldn't make variables exist in one view, we could potentially get around having a view of views by exploiting the fact that all variable views should have the same layout. We could pass one subview to the kernel in constant memory (via the lambda capture) and then change the underlying data pointer in the subview to the data pointer of the variable view we need. The array of pointers would need to exist in constant memory to improve performance. I don't know if this approach is doable with Kokkos.

ParArray4D dummy; // Make a dummy view/subview, somehow
par_for( 0, N, 0, L, ks, ke, js, je, is, ie,
  KOKKOS_LAMBDA (const int n, const int l, const int k, const int j, const int i){
    auto qout = dummy; 
    qout.ptr_ = qout_ptrs[n];
    auto qin = dummy; 
    qin.ptr_ = qin_ptrs[n];
    auto dudt = dummy; 
    dudtt.ptr_ = dudt_ptrs[n];

    qout(l, k, j, i) = qin(l, k, j, i) + dt * dudt(l, k, j, i);
});

-Create a Container::for_each for flexible looping implementations This isn't a solution on it's own, but it could allow flexible iteration patterns over variables depending on their layout at runtime.

For example, if variables in containers are always in the same view but are not always contiguous, we would want to use the first looping approach when they are contiguous and the second looping approach when they aren't. The container could determine by it's meta data at runtime which implementation is needed. This would look like:

class Container{
  template <typename Functor>
  void for_each( vector var_list, int L, int ks, int ke, int js, int je, int is, int ie, Functor func){
    if( /* var_list is contiguous */) {
      //Use the fastest approach, no constant memory accesses
      par_for( 0, N, 0, L, ks, ke, js, je, is, ie,
        KOKKOS_LAMBDA (const int n, const int l, const int k, const int j, const int i){
          func(n,l,k,j,i);
      });
    } else {
      par_for( 0, N, 0, L, ks, ke, js, je, is, ie,
        KOKKOS_LAMBDA (const int n, const int l, const int k, const int j, const int i){
          const int var = vars[n]; //vars is in constant memory (somehow)
          func(var,l,k,j,i);
      });
    }
  }
}

which would be used like

qout.for_each( {0,1,2,3,4}, 0, L, ks, ke, js, je, is, ie,
  KOKKOS_LAMBDA (const int var, const int l, const int k, const int j, const int i){
    qout(var,l, k, j, i) = qin(var,l, k, j, i) + dt * dudt(var,l, k, j, i);
});

Some details would need to be worked out - for_each might make more sense as a friend function, how do we make a list of ints for vars in constant memory, etc.

Key Ideas

I think performance optimization will be easiest if we put all variables in a container into a single view. It would be even better if looping is only supported over contiguous variables, or just all variables in a container. If executing a kernel across non-continuous variables is needed somewhere, it might be better to have the kernel reference those views explicitly instead of through a list.

jdolence commented 4 years ago

Thanks @forrestglines. I've added a test as we discussed. You'll see that it currently launches a kernel for each variable, which is at least a (probably poorly performing) place to start. There are a couple things worth highlighting:

The varying shapes adds another real twist to this, I think.

jdolence commented 4 years ago

anyone know anything about this?

https://github.com/kokkos/kokkos/blob/master/containers/src/Kokkos_Vector.hpp

dholladay00 commented 4 years ago

anyone know anything about this?

Last I looked it did not have much compatibility with an actual std::vector and I didn't see much benefit of using it over a view, but perhaps things are different in 3.0.

pgrete commented 4 years ago

Haven seen/used this so far, but this reminds me that the question discussed here is probably also a good one asking on the Kokkos Slack (or the Kokkos developers directly). I bet they have some ideas about this as this may be a more common use case/request.

Yurlungur commented 4 years ago

Thanks for the detailed thoughts @forrestglines. I'm sorry I haven't participated much in this discussion, as I haven't been feeling well. One option that occurs to me, which I don't know how possible it is, would be a single view with a tiled layout that points to non-contiguous points in memory.

Really a view of views is what we want, if we can somehow make it performant. I think we should ask the Kokkos devs for advice on this one. I can make time to engage with them next week if nobody else does. I can start with the container iterator test that @jdolence wrote.

jdolence commented 4 years ago

Just to expand on my comment about varying shapes, here's what I meant to point out. So the variable rank arrays would suggest writing something like

for v in vars
  for m < v.GetDim(4)
    for k < v.GetDim(3)
      for j < v.GetDim(2)
        for i < v.GetDim(1)
          v(m,k,j,i) = ...

Dims 1, 2, and 3 will presumably always be fixed, but Dim 4 won't be. So how is this parallelized most effectively with Kokkos?

forrestglines commented 4 years ago

Thanks @forrestglines. I've added a test as we discussed. You'll see that it currently launches a kernel for each variable, which is at least a (probably poorly performing) place to start.

Thanks, I'll take a look at it and use it as a starting point.

  • They don't share a view. They are all distinct memory blocks.

I think this might be the most important piece to change. I don't see anything yet in the current interface that would prevent us from shoving everything into one array.

  • The ContainerIterator pulls out half of the variables. The variables were added to the Container such that every other variable that gets added is pulled out, meaning they aren't (and can't always be) "next to" each other by any reasonable definition.

How often do we need to iterate over every other variable? Most of the time (boundary comms, averaging, updating) it seems that we iterate over every variable. It might be acceptable from a performance standpoint to only optimize iterating over all variables and use the current pattern for anything else.

The varying shapes adds another real twist to this, I think.

I'm thinking of just flattening that dimension in the array, assuming a LayoutRight and optimizing for that. We could specialize the solution so that we don't break other layouts, even if they aren't optimized for the GPUs.

forrestglines commented 4 years ago

Thanks for the detailed thoughts @forrestglines. I'm sorry I haven't participated much in this discussion, as I haven't been feeling well. One option that occurs to me, which I don't know how possible it is, would be a single view with a tiled layout that points to non-contiguous points in memory.

No worries! I really don't have much experience with tiled layouts, but I think they could be useful if the variables were non-contiguous in a single View but strided evenly. So our every other variable example would work.

Really a view of views is what we want, if we can somehow make it performant.

When it comes down to it, any kind of view of views solution necessarily means two array accesses instead of one so that it will never be as performant a single view solution. The key detail is from what memory space that extra array access happens, the performance difference might be negligible.

I think we should ask the Kokkos devs for advice on this one. I can make time to engage with them next week if nobody else does. I can start with the container iterator test that @jdolence wrote.

Since I've had the most to say, maybe I should take the initiative on that or at least be involved. I can do that by Monday. Anyone's help is welcome though, I can't work on this full time.

forrestglines commented 4 years ago
for v in vars
  for m < v.GetDim(4)

I'm thinking of internally flattening these two ranks into one like we effectively do in K-Athena. Outside of the Container these two ranks would be separated as normal, but at memory allocation and within the for_each, these would be merged.

forrestglines commented 4 years ago

I've added a performance test in #105 for iterating over variables in a container compared to the solution we used in Athena++. For the idealized example I test, there's a 5x slowdown with the current Container. The slowdown will be less severe for more complex kernels and more severe for boundary packing/unpacking functions.

jdolence commented 4 years ago

Thanks @forrestglines! Actually, for the kernel you're testing, I'm surprised it's not worse. Anyway, 5x is bad enough and I'm glad we've got this test in place so we can start to work on improving things with real performance numbers to reference.

forrestglines commented 4 years ago

Thanks @forrestglines! Actually, for the kernel you're testing, I'm surprised it's not worse.

For the number of kernels that I launch, 5 kernels with Containers and 1 kernel with one array, the 5x slowdown is exactly what we should have expected.

If the Kokkos team doesn't have any suggestions for performant array of views, I'm hoping that the flexibility of the Container interface will allow us to re-implement it's internals with something fast for GPUs. With any luck, we won't have to give up an flexibility for a fast GPU solution.

jdolence commented 4 years ago

Oh fair enough. Didn't remember that there were 5 variables and so 5 kernel launches in the test. Ok so we're just measuring the cost of kernel launch, basically. Once we find a means -- any means -- of launching a single kernel with variables pulled from a Container, things should start to get more interesting.

pgrete commented 4 years ago

I just stumbled across this: https://github.com/google/benchmark Based on a quick first look that may be sth worth integrating in the long, e.g., move the performance tests (Release builds) to this framework rather than the unit tests.

AndrewGaspar commented 4 years ago

Yeah, I mentioned this on Matrix. Google Benchmark is a better solution for writing benchmarks in general because it will execute your benchmark many times in order to shake out run-to-run variability. This has the side effect, with mostly benefits but some small downsides, of also shaking out one-time initialization costs you might pay for running a loop exactly once. Think: Initializing OpenMP threads, first-run initialization of a kernel for the GPU, etc.

forrestglines commented 4 years ago

I'm wondering how well Google Benchmark interfaces with GPU testing. You would need to make sure that Google Benchmark does not force host-device synchronization after every kernel run, or else that latency will contaminate the kernel timing. I'm not sure if that's even possible with the adaptive approach they use, unless the timing resets after iteration of their adaptive sampling.

AndrewGaspar commented 4 years ago

unless the timing resets after iteration of their adaptive sampling.

I believe that's the case. I'd need to look at it some more, to be sure, but I think they pick an iteration count and run that number of times. So you could at least mask latency over that stretch of executions. But you'd need someway to fence at the end of the iteration stretch.

Yurlungur commented 4 years ago

Since I've had the most to say, maybe I should take the initiative on that or at least be involved. I can do that by Monday. Anyone's help is welcome though, I can't work on this full time.

Makes sense to me. (As you probably noticed. I was out of action all week anyway...)

I think this might be the most important piece to change. I don't see anything yet in the current interface that would prevent us from shoving everything into one array.

How do we shove everything into a single array, with the interface that allows individual packages to add their own physics variables to state? Especially since the variables are different rank and shape, the high-dimensional array that owns all of them would be ragged. This raggedness is why I keep mentioning array of views or view of views. I think that's the use case where such a data structure is ideal.

I suppose one option is we could own our own big pool of contiguous memory and manage it ourselves. That seems painful, though.

When it comes down to it, any kind of view of views solution necessarily means two array accesses instead of one so that it will never be as performant a single view solution. The key detail is from what memory space that extra array access happens, the performance difference might be negligible.

You're right of course. I think we need to get a better understanding of this. i.e., can we make the cost negligible?

forrestglines commented 4 years ago

I did a rough proof of concept View of views performance test in #105. Performance tested but not validated for correctness. It suffers from some loss of parallelization in the 4th dimension in order to handle ParArrayNDs that do not use the 4th rank. Each thread makes two accesses to global memory to get to data, the first access being heavier as it needs to get a whole ParArrayND.

With 16^3 grids it takes 1.7x the time of the raw array. With 128^3 grid it takes 7.5x of the raw array. I suspect that with the smaller grids, most of the time is eaten up in launch overhead and so the difference is less significant. It might be the missing parallelization. With the larger grids, where DRAM bandwidth is probably limiting performance, and so the increased DRAM usage is more apparent. If a ParArrayND object is about 8x the size of a Real, that would probably account for the difference, but that's just a guess.

forrestglines commented 4 years ago

How do we shove everything into a single array, with the interface that allows individual packages to add their own physics variables to state? Especially since the variables are different rank and shape, the high-dimensional array that owns all of them would be ragged. This raggedness is why I keep mentioning array of views or view of views. I think that's the use case where such a data structure is ideal.

Just considering cell-centered variables, the last three dimensions share the same xyz shape. You use a 4th rank array to hold all variables. If variables are vectors or tensors of higher rank, than you squash those extra dimensions along the 4th rank.

We did this in K-Athena by having a 4 dimension array where the first two rows where density and pressure and the next three were velocity.

I suppose one option is we could own our own big pool of contiguous memory and manage it ourselves. That seems painful, though.

That is what I'm suggesting, and yes, it would be painful. It's at the top of my wishlist of what Containers should do for us. I've seen it done in other codes code, but just as poorly and kludgy as you would expect.

I've been thinking about how I would do it, but I'm still trying to understand the Container interface to see where I can make changes. It might also come with impose some restrictions on layouts and when variables can be added.

Can we make the cost negligible?

We can do real code testing, but I almost feel like you can show on paper without doing any code that the costs won't be negligible in all cases. There's either a trade off of more reads from DRAM or you reduce parallelization in order to share what's been read from DRAM. The additional reads from DRAM is an issue for kernels over the whole domain (like update functions) and the reduced parallelization is an issue for boundary packing functions.

pgrete commented 4 years ago

I suppose one option is we could own our own big pool of contiguous memory and manage it ourselves. That seems painful, though.

That is what I'm suggesting, and yes, it would be painful. It's at the top of my wishlist of what Containers should do for us. I've seen it done in other codes code, but just as poorly and kludgy as you would expect.

I've been thinking about how I would do it, but I'm still trying to understand the Container interface to see where I can make changes. It might also come with impose some restrictions on layouts and when variables can be added.

If I remember our discussion correctly, then allocating variables on the fly is not really a use case (variables are defined during initialization and then allocated later. similarly in case of copying a container for another stage of the iterator the allocation would only happen once). Thus, I could see how allocating a large View covering all Variables of a single type could work given that we already have abstraction in place to extract subviews (here then applied to variables).

jdolence commented 4 years ago

I'm seeing some interesting results on CPUs too, actually. For example,

iterate_variables/raw_array 0.70594

suggesting the naive way can actually be more performant than the raw 4-D array. This ratio does seem to depend on the problem size -- perhaps some caching effect or something. Anyway, thought I'd mention to remind us all we should also pay attention to CPU performance.

forrestglines commented 4 years ago

I'm seeing some interesting results on CPUs too, actually. For example,

iterate_variables/raw_array 0.70594

suggesting the naive way can actually be more performant than the raw 4-D array. This ratio does seem to depend on the problem size -- perhaps some caching effect or something. Anyway, thought I'd mention to remind us all we should also pay attention to CPU performance.

That's concerning, I'd expect almost no difference between the solutions for CPUs. To me that would suggest that either the looping pattern, the data layout, or the vectorization is not working as expected.

forrestglines commented 4 years ago

After making the changes suggested by @jdolence so that I have the correct number of layout of variables, I don't quite understand the performance now.

With GPUs, on small 16^3 arrays, the naive solution is about 5 times slower than the raw arrays (expected, more kernels are launched), the view of views is only about 2% slower (understandable, kernel launch overhead should be the dominating factor here).

On GPUs with a larger 256^3 array, where kernel launch overhead should probably be negligible, the naive solution is slightly faster, on the order of 0.1%. I'm guessing that it's the reduced arithmetic to index into the 3D arrays used in the scalar kernels whereas it's always a 4D access in the raw arrays.

More strange is the view of views, which is only 5% slower than the raw array on the 256^3 grid. I expected it to be twice as fast when the kernel is DRAM bandwidth limited. So either this kernel is not DRAM bandwidth limited, the view of views isn't adding more DRAM access (I'm not sure why it wouldn't), or that most of the DRAM accesses happen despite what we're doing into side the kernel (which would be a concerning Kokkos performance hurdle).

I'm seeing som interesting results on CPUs too

I tested CPUs after fixing the test, the naive way is now 25% slower, way more than expected. The view of views is 16x slower, probably because it breaks vectorization. If the GPU solution ends up being slow on CPUs, I'll hide it behind an interface (probably a for_each) that changes implementation depending on architecture.

By the way, for CPU performance we've only been interested in the Intel compiler with AVX 512 vectorization as that reflects the production machines available to us.

I'm going to expand the test a little bit to probe different grid sizes and different kernels. I'll probably test an "update with fluxes" function that's a realistic low arithmetic intensity function and a boundary packing function that's a realistic low parallelization function.

Yurlungur commented 4 years ago

Thanks for doing all this, @forrestglines. If I'm reading this correctly, assuming the results don't change when we understand things better, it sounds like a view of views and the "vector of views" are both acceptably performant on the GPU and CPU respectively. We need to understand things, of course. But if that's the case, it would be very good news, since it would imply we don't have to be too clever.

forrestglines commented 4 years ago

I did some more testing in a self-contained repo to compare Kokkos and CUDA performance using a single array versus using an array of arrays structure. I tested two kernels: one kernel where an input array is doubled and written to another array across the entire grid (so a high bandwidth, high parallelization test) and another kernel where the lower z boundary of the grid is copied into a buffer ( a high bandwidth, low parallelization).

Summary: It doesn't look like array of arrays increases dram accesses, which is very surprising to me and not how I expected the device to behave. As far as I know, that DRAM lookup for the the first layer of arrays isn't cached or paged for consequent threads running on an SM on the GPU, so I'm not sure why the arithmetic intensity isn't clearly lower. Array of arrays could have performance penalties, on the order of 20% maybe in worst case, but it's not the factor of 2x that I was expecting. I think we can continue forward but with the possibility that we might have to re-implement how memory is managed.

More details:

I first setup a kernel that doubles an input array of 10 variables and feeds that into an output array. This is a higher arithmetic intensity than some of the lower AI loops in the code (which might be the du/dt update) but this was simpler to understand. The kernel operated over a 2d grid for raw views and a view of 1d views for the view of views stand in. I implemented it in both Kokkos and CUDA to make sure it wasn't Kokkos that added in extra DRAM transactions.

image

Here's the performance for that, with the top panels being the number of 10 variable cells that could be done per second on top, the ratio of View of View cycles per second over Raw cell cycles per second on bottom, the right panels are just a restricted range of n to show detail.

I saw that raw arrays vs. array of arrays makes more of a difference for CUDA than it does for Kokkos, and only at small n counts. I suspect that some extra overhead in setting up Kokkos kernels and views is responsible, but I'm surprised that Kokkos outperformed the CUDA array of arrays in some cases. At larger counts of n, there's little difference, even though DRAM bandwidth is probably the most constricting in that regime. Maybe some kind of caching is going on? There's also a jump in performance around $n = 20^3$, which I don't know what to make of.

I then setup a kernel that does the z boundary buffer packing for 10 variable on just one face from an $n^3$ grid. This kernel has a lower parallelization but all memory access are still coalesced (the x boundary packing function might be worse) and DRAM bandwidth should be the limiting factor at high n.

image

With a lower parallelization, the differences between view of views and the raw view are more apparent, but it isn't terrible. In the 16^3 case the Kokkos view of views slightly outperforms the raw array, which is surprising.

What's more telling is how faster much CUDA is compared to Kokkos for low $n$, which suggests to me that we need to maximize parallelization in order to close that gap.

Looking at different kernels with nvvp: I saw that view of views increase register usage by a couple on the order of ~5 for one kernel, which hopefully wont' be a problem; DRAM transactions increase by a little bit with view of views ( maybe a percent or few, not double like I thought they would); and other extra memory transactions are happening with the view of views but I couldn't figure out which ones.

I still have a lot of questions on why both Kokkos and CUDA performance behave this way, but it's time to move on.

In conclusion, I think fixing the memory management to avoid views of views is a second order performance improvement, but it might be necessary if the goal is performance parity with K-Athena. We should make an interface that allows us to fix the implementation later so that we can move on.

My testing repo is here:

https://github.com/forrestglines/GPU-View-of-Views-Performance

Yurlungur commented 4 years ago

Thanks for doing this, @forrestglines! I agree with your assesment. Let's discuss on the call today what the API should look like so that we can fix things later if needed.

At larger counts of n, there's little difference, even though DRAM bandwidth is probably the most constricting in that regime. Maybe some kind of caching is going on?

I am very out-of-date on my GPU understanding. Does the GPU automatically cache in the same way that the CPU does? I remember back int the old days, we had to do that manually.

There's also a jump in performance around $n = 20^3$, which I don't know what to make of.

There's a jump in relative performance, but overall a performance hit for all modes, right? That's probably some kind of vectorization or memory bandwidth boundary? Or a cache size limit if there is automatic caching?

I wonder if some of the difficult to understand timings we're experiencing are due to statistical error on the machine. It might be useful to do some timing tests with multiple trials and plot error bars. It might also provide no useful information. Probably not worth the time, if we've convinced ourselves this is okay for now.

dholladay00 commented 4 years ago

Do we know the total memory required for the container (or can it be cheaply precalculated)? Or is it that the inner views are coming from elsewhere?

If we are actually allocating on the fly, we can guarantee contiguity by allocating one giant flat view and use placement new for the inner views.

Yurlungur commented 4 years ago

The memory can be precalculated when the view of views is constructed.

forrestglines commented 4 years ago

I am very out-of-date on my GPU understanding. Does the GPU automatically cache in the same way that the CPU does? I remember back int the old days, we had to do that manually.

That's what I thought too, but looking here https://docs.nvidia.com/gameworks/content/developertools/desktop/nsight/analysis/report/cudaexperiments/kernellevel/memorystatisticsglobal.htm it looks like for read-only access to global memory, the data can be cached with texture memory, which I think lives with L2. This wasn't supported on the Kepler devices. The page makes it seem like it usually takes some work to get the compiler to recognize that.

This also made me realize that I have to double check these results on a Volta card, since I used a Pascal for this test out of convenience.

There's a jump in relative performance, but overall a performance hit for all modes, right? That's probably some kind of vectorization or memory bandwidth boundary? Or a cache size limit if there is automatic caching?

In the top panels, the y-axis should be a proxy for flops, so that it looks like certain smaller cell counts are pushing more flops for this particular operation. I find this very strange.

I wonder if some of the difficult to understand timings we're experiencing are due to statistical error on the machine. It might be useful to do some timing tests with multiple trials and plot error bars. It might also provide no useful information. Probably not worth the time, if we've convinced ourselves this is okay for now.

I tested each of these kernels for at least 0.1 second each, the smaller kernel sizes running for over 1e5 times each. I think these are well sampled but without error bars I can't really say.