Open agelas opened 1 year ago
Thanks @agelas for this exploration! We are currently refactoring the wgpu backend to reduce allocations, this will probably help a lot with reducing the variance. We are thinking to add an abstraction over autotuning as well, it isn't very convenient for now to autotune new operations :/
Feature description
This is a more specific feature related to Issue #617. After playing around a bit with the shader for conv2d, I think it might be worth reworking the kernel to allow for autotuning to adjust the workgroup size and allow for loop unrolling of the innermost loop. More details below, but I think we can achieve greater utilization and saturation of host GPU's with some relatively straightforward adjustments.
Feature motivation
This can help with speeding up models and more efficient resource utilization. As you can see with some rudimentary testing below, autotuning can usually find a (barely) faster pipeline*. What I'd like to draw attention to though is the variance. Autotuning seems able to identify much more consistent compute pipelines. This can allow for better predictability and stable resource allocation.
* The computer I did this on is a bit of a potato right now. I'd be curious to see if someone can see similar trends with larger batch sizes, higher resolution images, and/or more channels in the input tensor. Also using a graphics API besides Vulkan.
Notes on the data
ConvOptions {stride: [1,1], padding: [1,1], dilation: [1,1], groups: 1 };
. The bulk of this data is centered on input tensors of shape [32, 3, 224, 224], which should be a fairly reasonable input to something like ImageNet, VGG, ResNet-50.Implementation
I'll link a PR that's a major WIP and brute forces the ability to autotune
kernel::conv::conv2d
. I think we'll need to sneak in a thin abstraction layer betweenmodule_ops::conv2d
andkernel::conv::conv2d
.