Custom stencil operations

EelcoHoogendoorn commented 4 years ago

Let me start by saying this is a help request; not as issue as such. If there is a more appropriate place for those, I am all ears.

On a high level; I want to solve a 3d laplacian with non-constant coefficients. That means I cannot use any off-the-shelf optimised convolutional libraries since they all assume constant stencils per definition. I currently have this implemented using first order differential operators as div(non_constant_field*grad(T)); but thats quite inefficient, and I want to fuse that loop into a single stencil operation. Without reinventing the wheel of how to do efficient stencil operations on the GPU.

It seems to me that loopy is the right tool for the job. But I am not quite confident as to the best way to go about it. I believe the most optimal implementation is a wavefront of threads looping through the array, to make the best use of local memory; whereby each thread processes multiple array elements. But perhaps there are competitive methods that map better to the loopy paradigm?

Having examples of this nature in the project strikes me as very helpful; and id be happy to add a PR with such examples if I manage to figure out something worthwhile. That is, if I am not reinventing the wheel in the first place, since I could imagine people have used loopy before with exactly these type of use cases in mind.

EelcoHoogendoorn commented 4 years ago

Ive noticed that the tests contain some examples pertaining to stencil operations, which is nice.

But one thing I am trying to wrap my head around is boundary handling modes. It seems all examples and tests ive seen so far only deal with the case where the output array corresponds only to those elements that have the full stencil available. (output=input-stencil+1). However, being able to implement things like reflection or periodic bcs would be quite useful to me. Is the solution here to simply do the prefetching manually? And how do I override the default output shape?

Of course I could pad the arrays in a seperate operation, but that would double memory bandwidth compared to the fused ideal we are trying to pursue. (or maybe thats not true; and a gpu pure memcopy has very different performance characteristics unrelated to the bandwidth wrt the compute cores? EDIT: nope there does not seem to be such a thing)

kaushikcfd commented 4 years ago

Regaring competitive parallelization strategies: The 2.5D spatial tiling paper is an option for memory-bound stencil kernels. Work's in progress for loopy to support it.

However, being able to implement things like reflection or periodic bcs would be quite useful to me. Is the solution here to simply do the prefetching manually?

For periodic BCs: the indices of the array could be represented as modulo expressions. These modulo expressions being quasi-affine, prefetching etc. should follow naturally. However, BCs like reflective are a bit more complicated for loopy, as the array indices would be if-else expressions and such data-dependent expression won't play well with the prefetch transformation, sadly one would have to go with manual prefetching in such cases.

And how do I override the default output shape?

make_kernel takes a list of arguments like GlobalArg('out', dtype=float, shape=(10, 4)).

inducer commented 4 years ago

I agree with @kaushikcfd's advice. The only thing I might add is that @zachjweiner might have some further advice. AFAIK, his pystella uses Loopy for large-scale stencil operations. (though I'm not sure his application requires second derivatives)

For the case of second derivatives, the precise dependency tracking being developed by @jdsteve2 might also be of help.

zachjweiner commented 4 years ago

@EelcoHoogendoorn, some of what I've implemented in pystella is relevant to your use case. It has some stencil kernel generators (all implemented with NVIDIA GPUs in mind), essentially just boilerplate for my frequent use case. I do work with periodic boundary conditions, and since pystella is also distributed with MPI, I use padded input arrays out of necessity (and I think its a bit faster since it requires neither modulo nor conditionals).

You can look at the parallelize methods here for inspiration. The 2.5D tiling or "streaming" transformation is in there, too, but the required bits aren't currently in Loopy's master, as @kaushikcfd mentioned. (Hopefully not for long!)

FWIW (not to advertise too heavily), for my purposes I did develop some functionality for defining stencil operations symbolically, based on a Field object which keeps track of its indexing. It would let you expand your nested stencil quite neatly without an intermediary computation in the actual kernel. There's a simple example near the bottom of this notebook.

Regardless, I agree this would be a nice example to have and I'd be happy to help construct one.

EelcoHoogendoorn commented 4 years ago

Thanks for all the input. Glad to hear about the streaming prefetch; last time i checked (quite some years ago tbh) that was indeed the most performant technique. Funny how that paper is 10 years old and already they were complaining back then about memory bandwidth not growing as fast as compute. Im guestimating that problem compounded by at least an order of magnitude the past ten years; so all the more relevant this loop fusion business.

pystella looks really cool; thanks for the headsup. It seems indeed like there is some useful examples there for me. Im currently working on a framework that constructs convolution operators for any first order DEC operators on any n-dimensional cube grid. The non-loop-fused part of that was not too hard, and if I had way more time than I actually do id love to extend that with a little DSL like you do there, to compile arbitrary expression of first order differential operators into fused stencil operations. For now id be happy to efficiently loop over my manually fused stencil operations though.

inducer / loopy

Custom stencil operations #153