tracel-ai / burn

Burn is a new comprehensive dynamic Deep Learning Framework built using Rust with extreme flexibility, compute efficiency and portability as its primary goals.
https://burn.dev
Apache License 2.0
8.55k stars 422 forks source link

Autotune Conv2d #805

Open agelas opened 1 year ago

agelas commented 1 year ago

Feature description

This is a more specific feature related to Issue #617. After playing around a bit with the shader for conv2d, I think it might be worth reworking the kernel to allow for autotuning to adjust the workgroup size and allow for loop unrolling of the innermost loop. More details below, but I think we can achieve greater utilization and saturation of host GPU's with some relatively straightforward adjustments.

Feature motivation

This can help with speeding up models and more efficient resource utilization. As you can see with some rudimentary testing below, autotuning can usually find a (barely) faster pipeline*. What I'd like to draw attention to though is the variance. Autotuning seems able to identify much more consistent compute pipelines. This can allow for better predictability and stable resource allocation.

* The computer I did this on is a bit of a potato right now. I'd be curious to see if someone can see similar trends with larger batch sizes, higher resolution images, and/or more channels in the input tensor. Also using a graphics API besides Vulkan.

Notes on the data

  1. The only GPU I had readily available lying around is an old Nvidia GT 1030. All results below were run using Vulkan, the Dx12 drivers have some weird invocation limit that I'm having trouble getting around.
  2. This data is very much cherry-picked. There were some runs where the variance of autotuning was higher/min time of the benchmark run was higher. But the overall trend aligns with what's shown below.
  3. All benchmarks were run with no bias tensor and ConvOptions {stride: [1,1], padding: [1,1], dilation: [1,1], groups: 1 };. The bulk of this data is centered on input tensors of shape [32, 3, 224, 224], which should be a fairly reasonable input to something like ImageNet, VGG, ResNet-50.
Input Tensor Shape Type + Bench Run Samples Mean Variance Median Min Max
[8, 3, 224, 224] Autotune 1 5 489.791ms 80.587µs 492.758ms 478.120ms 501.855ms
[8, 3, 224, 224] Default 1 5 504.855ms 281.282µs 511.095ms 481.818ms 522.776ms
[10, 3, 224, 224] Autotune 2 5 619.695ms 192.013µs 623.878ms 593.247ms 633.878ms
[10, 3, 224, 224] Default 2 5 614.064ms 215.598µs 618.749ms 594.727ms 630.743ms
[12, 3, 224, 224] Autotune 3 5 730.587ms 215.773µs 736.289ms 711.603ms 748.222ms
[12, 3, 224, 224] Default 3 5 768.905ms 6.479ms 728.904ms 718.285ms 929.267ms
[32, 3, 224, 224] Autotune 4 5 2.082s 166.660µs 2.084s 2.066s 2.102s
[32, 3, 224, 224] Default 4 5 2.122s 671.122µs 2.129s 2.090s 2.161s
[32, 3, 224, 224] Autotune 5 5 2.083s 41.796µs 2.086s 2.075s 2.092s
[32, 3, 224, 224] Default 5 5 2.100s 173.529µs 2.102s 2.076s 2.115s
[32, 3, 224, 224] Autotune 6 5 2.112s 282.122µs 2.118s 2.083s 2.128s
[32, 3, 224, 224] Default 6 5 2.110s 785.019µs 2.096s 2.083s 2.147s
[32, 3, 224, 224] Autotune 7 5 2.120s 131.166µs 2.118s 2.102s 2.135s
[32, 3, 224, 224] Default 7 5 2.141s 737.797µs 2.127s 2.115s 2.189s

Implementation

I'll link a PR that's a major WIP and brute forces the ability to autotune kernel::conv::conv2d. I think we'll need to sneak in a thin abstraction layer between module_ops::conv2d and kernel::conv::conv2d.

nathanielsimard commented 1 year ago

Thanks @agelas for this exploration! We are currently refactoring the wgpu backend to reduce allocations, this will probably help a lot with reducing the variance. We are thinking to add an abstraction over autotuning as well, it isn't very convenient for now to autotune new operations :/