Open jinderek opened 4 years ago
If someone needed it today, then I'd support it via a define_extern that just does a matrix multiply. What we'd really like to do though is be able to map more general tensor contractions to the wmma instruction, and that's considerably harder.
@abadams are there examples of something similar in halide? if so I can check if i can get someone here interested in something like this.
I don't think there is anything close currently. Here's a strawman sketch of how to support it:
Add a tensor_core scheduling directive that takes two vars and an RVar. It is expected that there are outer dimensions that are marked GPUBlocks. There may also be GPUThreads dimensions, but there may not be GPULanes. tensor_core uses the full warp, so it's implicitly a GPULanes dimension of size 32. Here's a rough example of what a schedule might look like:
RDom k(0, 100);
C(i, j) += cast
We'd then assert a few things. 1) The stage's definition must be exactly the one written above. 2) The extent of the Var, Var, and RVar passed to tensor_core must all be statically known to be four.
Over time the goal would be to relax these constraints as much as possible. Supporting constant values other than four is easy by just unrolling/masking. Supporting different patterns is harder. Commutative reshufflings is fine, as are col major vs row major. But it would be nice to do something sensible in the face of things like convolutions: A(k) * B(i + k, j), by broadcasting A and shearing B as necessary.
To get something with ok performance your sketch seems like a good start, though it will still be a lot of work to get it to work. We have prototyped some GPU scheduling primitives in our own standalone DSL. I gave a lightening talk at the TVM conf earlier this month. I'd like to port that over to Halide and recruiting some folks internally to help. or if I get a good intern.
I don't think there is anything close currently. Here's a strawman sketch of how to support it:
Add a tensor_core scheduling directive that takes two vars and an RVar. It is expected that there are outer dimensions that are marked GPUBlocks. There may also be GPUThreads dimensions, but there may not be GPULanes. tensor_core uses the full warp, so it's implicitly a GPULanes dimension of size 32. Here's a rough example of what a schedule might look like:
RDom k(0, 100); C(i, j) += cast(A(i, k)) * B(k, j); C.update().tile(i, j, ii, ji, 4, 4).split(k, ko, ki, 4).gpu_blocks(i, j).reorder(ii, ji, ki, ko, i, j).tensor_core(ii, ji, ki)
We'd then assert a few things.
- The stage's definition must be exactly the one written above.
- The extent of the Var, Var, and RVar passed to tensor_core must all be statically known to be four.
Over time the goal would be to relax these constraints as much as possible. Supporting constant values other than four is easy by just unrolling/masking. Supporting different patterns is harder. Commutative reshufflings is fine, as are col major vs row major. But it would be nice to do something sensible in the face of things like convolutions: A(k) * B(i + k, j), by broadcasting A and shearing B as necessary.
@abadams Here is a questions from @frengels : Are the proposed rules above are up to date? The example seems to handle a single fragment, which is 4x4. But a fragment is just a part of a tile which in the typical case is 16x16, it would seem more accurate to have it as
C(i, j) += cast(A(i, k)) * B(k, j); C.update().tile(i, j, ii, ji, 16, 16).split(k, ko, ki, 16).gpu_blocks(i, j).reorder(ii, ji, ki, ko, i, j).tensor_core(ii, ji, ki)
rather than how it is now where sizes are required to be 4.
The goal above was to hit a single fragment, and then to use outer loops to hit the 16x16 idiom. I'm not 100% sure what I wrote is correct for a single fragment though. The single-fragment ops for tensor cores are a little confusing.
@abadams I think it's probably best not to focus too much on the single fragment. When performing an operation to a fragment, that operation still gets applied to all fragments within the tile as far as I know, so I don't think it's really possible to hit a single fragment on its own. It also makes the process of finding the outer loops more complicated because a record of loops has to be kept before a Tensor Core dimension is hit, unless there's an easy way to get back to the outer loops I've missed?
Is this at all useful? These guys seem to have implemented support for tensor_core. Seems like they branched off Halide somewhere between 8.0 and 10.0
https://github.com/TUE-EE-ES/HalideTCU
https://www.es.ele.tue.nl/~tbasten/papers/Scopes_camera_ready.pdf
NVIDIA Volta and Turing GPUs have Tensor Cores, which can massively accelerate large matrix operations.(https://www.nvidia.com/en-us/data-center/tensorcore/)
So is there any plan to support Tensor Core?