Autotune Conv2d - Githubissues

Feature description

This is a more specific feature related to Issue #617. After playing around a bit with the shader for conv2d, I think it might be worth reworking the kernel to allow for autotuning to adjust the workgroup size and allow for loop unrolling of the innermost loop. More details below, but I think we can achieve greater utilization and saturation of host GPU's with some relatively straightforward adjustments.

Feature motivation

This can help with speeding up models and more efficient resource utilization. As you can see with some rudimentary testing below, autotuning can usually find a (barely) faster pipeline*. What I'd like to draw attention to though is the variance. Autotuning seems able to identify much more consistent compute pipelines. This can allow for better predictability and stable resource allocation.

* The computer I did this on is a bit of a potato right now. I'd be curious to see if someone can see similar trends with larger batch sizes, higher resolution images, and/or more channels in the input tensor. Also using a graphics API besides Vulkan.

Notes on the data

The only GPU I had readily available lying around is an old Nvidia GT 1030. All results below were run using Vulkan, the Dx12 drivers have some weird invocation limit that I'm having trouble getting around.
This data is very much cherry-picked. There were some runs where the variance of autotuning was higher/min time of the benchmark run was higher. But the overall trend aligns with what's shown below.
All benchmarks were run with no bias tensor and ConvOptions {stride: [1,1], padding: [1,1], dilation: [1,1], groups: 1 };. The bulk of this data is centered on input tensors of shape [32, 3, 224, 224], which should be a fairly reasonable input to something like ImageNet, VGG, ResNet-50.

Input Tensor Shape	Type + Bench Run	Samples	Mean	Variance	Median	Min	Max
[8, 3, 224, 224]	Autotune 1	5	489.791ms	80.587µs	492.758ms	478.120ms	501.855ms
[8, 3, 224, 224]	Default 1	5	504.855ms	281.282µs	511.095ms	481.818ms	522.776ms
[10, 3, 224, 224]	Autotune 2	5	619.695ms	192.013µs	623.878ms	593.247ms	633.878ms
[10, 3, 224, 224]	Default 2	5	614.064ms	215.598µs	618.749ms	594.727ms	630.743ms
[12, 3, 224, 224]	Autotune 3	5	730.587ms	215.773µs	736.289ms	711.603ms	748.222ms
[12, 3, 224, 224]	Default 3	5	768.905ms	6.479ms	728.904ms	718.285ms	929.267ms
[32, 3, 224, 224]	Autotune 4	5	2.082s	166.660µs	2.084s	2.066s	2.102s
[32, 3, 224, 224]	Default 4	5	2.122s	671.122µs	2.129s	2.090s	2.161s
[32, 3, 224, 224]	Autotune 5	5	2.083s	41.796µs	2.086s	2.075s	2.092s
[32, 3, 224, 224]	Default 5	5	2.100s	173.529µs	2.102s	2.076s	2.115s
[32, 3, 224, 224]	Autotune 6	5	2.112s	282.122µs	2.118s	2.083s	2.128s
[32, 3, 224, 224]	Default 6	5	2.110s	785.019µs	2.096s	2.083s	2.147s
[32, 3, 224, 224]	Autotune 7	5	2.120s	131.166µs	2.118s	2.102s	2.135s
[32, 3, 224, 224]	Default 7	5	2.141s	737.797µs	2.127s	2.115s	2.189s

Implementation

I'll link a PR that's a major WIP and brute forces the ability to autotune kernel::conv::conv2d. I think we'll need to sneak in a thin abstraction layer between module_ops::conv2d and kernel::conv::conv2d.

tracel-ai / burn

Autotune Conv2d #805

Feature description

Feature motivation

Notes on the data

Implementation