Open sherhut opened 4 years ago
mentioned in issue llvm/llvm-project#44224
What's the status here Stephan?
I've got another question here: do you actually want to do the promotion before introducing the gpu.launch operation or its equivalent? I can see arguments for and against it. If we do it after mapping to GPU, we would have already translated loop.parallel into a nest of sequential loops according to the current mapping scheme, so a simple extension to what we already have in -test-gpu-memory-promotion should be sufficient.
I have filed bug #44879 for moving copy and slice over.
I agree with moving copy to standard and also with using linalg.copy for now. We do the same with linalg.slice already. There is an internal bug for moving it over and we already use it outside of linalg when lowering LHLO reduce to GPU directly.
linalg.copy should become std.copy as other things that have been evolving from Linalg concepts in the past. If anyone has cycles for this, by all means go for it.
Re raising, this is an interesting point and I would argue that the part that "raises" is the part that would analyze pointwise accesses into regions to determine which copies are necessary.
Taking a subview and emitting an alloc + copy for it, specifically avoids this raising. The fact that the copy op is in a dialect that happens to operate on higher-level abstractions is orthogonal. Even when we have std.copy, the abstraction it will operate on will still be the same: there is no essential meaning carried by the fact that copy lives in Linalg today.
Still I would argue that if we agree introducing a high-level copy (as opposed to loops over load/stores) is what we want, we can just use linalg.copy today.
I have elements of this starting from non-parallel loops and going to shared memory, that share parts of the implementation with Linalg copies. If it's not urgent, I can implement that after I'm done with the other things in my pipeline.
One philosophical concern I might have about introducing linalg.copy is that it feels like raising in some cases, and the Linalg rationale document makes the case against raising. This can be resolved by promoting the copy operation from Linalg to Standard. Since Standard already has dma_copy, I don't see why it wouldn't also have a non-DMA copy.
Ideally we would avoid duplicating implementations. The easiest would probably be to create a proper subview and call the linalg promotion transformation which is implemented as a DRR. This is meant to compose with other canonicalizations etc and give you a statically shaped operation after tiling whenever possible.
assigned to @sherhut
Extended Description
The idea is to have a helper that, given a parallel loop nest and a load operation, creates a copy of the load-from memref into shared memory and redirects loads to that copy.
There are currently two ways to compute how much of the memref needs promoting. Either via affine reasoning (which would come from affine.parallel) or via subviews as created by linalg.
Ideally, we can take the subview route now and promote the entire subview and reuse the implementation with an affine based backing later.
We also need to revisit the requirement for barriers and whether we want implicit barriers for loop.parallel in this context, as promotion will need a barrier. We could start by inserting the right GPU dialect barrier to start with. Promotion needs to assume a mapping from loops to hardware indices anyway.