Closed jananisriram closed 2 weeks ago
This pull request was exported from Phabricator. Differential Revision: D58313958
This pull request was exported from Phabricator. Differential Revision: D58313958
This pull request was exported from Phabricator. Differential Revision: D58313958
This pull request has been merged in pytorch/benchmark@f4cbf782eccc759c13a013beacc8f79534dbd642.
Summary: Extend support for reducing across individual dimensions on 2-dimensional matrices by allowing for varying block sizes on both the
M
(first) andN
(second) dimensions.The existing kernel performed a simplified reduction, assuming that the entire reduction dimension fit within one thread block. The new kernel implementation removes the need for this assumption, allowing both the reduction and the non-reduction dimensions to fit in multiple thread blocks. This implementation also enables autotuning on block sizes for both the
M
andN
dimensions.For 1D results, add a
sum_then_buffer
configuration which decides which kernel configuration to run.Sum_then_buffer
sums individual blocks of input and adds these sums into a buffer.Buffer_then_sum
adds blocks of raw input into a buffer, then reduces the buffer.Reviewed By: davidberard98
Differential Revision: D58313958