nod-ai / SHARK-ModelDev

Unified compiler/runtime for interfacing with PyTorch Dynamo.
Apache License 2.0
95 stars 48 forks source link

[tuning] Tune int8 model dispatches #751

Closed MaheshRavishankar closed 4 months ago

antiagainst commented 4 months ago

Update 7/8: need to extend to support int8/int32 and batch mmt.

kuhar commented 4 months ago

cc: @RattataKing

kuhar commented 4 months ago

Matmul Tuning Progress

image

Matmul (Broadcast RHS)

No Done? shape ms Improvement [ms] Commit
1 Yes matmul_like_2x1024x10240x1280_i8xi8xi32 6.154298 0.7  https://github.com/nod-ai/sdxl-scripts/commit/bec620efebcb2d57a62e69a1f2ff8a6b0e18e7a0  
2 Yes matmul_like_2x1024x1280x5120_i8xi8xi32 3.894541 0.8 https://github.com/nod-ai/sdxl-scripts/commit/f4aa06df048e272d84301d212523a0af065b04da
4 Yes matmul_like_2x1024x1280x1280_i8xi8xi32 2.884773 0.3 https://github.com/nod-ai/sdxl-scripts/commit/f1adf620cda5b8744713e4f38bda182747bd0663  
6 Yes matmul_like_2x4096x5120x640_i8xi8xi32 1.128907 0.1  https://github.com/nod-ai/sdxl-scripts/commit/0d7436d36804af7b45fcbf50b258bdb32b8ca74b 
9 Yes matmul_like_2x4096x640x640_i8xi8xi32 0.490236 0.1  https://github.com/nod-ai/sdxl-scripts/commit/2d5bc0ba6a720761f3ef90bf533be245a3508c25  
10 Dupe matmul_like_2x1024x1280x5120_i8xi8xi32 0.445312    
11 Yes matmul_like_2x4096x640x2560_i8xi8xi32 0.339843 0.2 https://github.com/nod-ai/sdxl-scripts/commit/3e6f2cfd719da8397bc43cf3e81b3204d962797d
12 Dupe matmul_like_2x4096x640x2560_i8xi8xi32 0.335937    

Cumulatively, measured 1.9ms speedup (55.5ms - 53.6ms) ==> 3.4%.

kuhar commented 4 months ago

Conv Tuning Progress

image

Conv 2D

No Done? shape ms Improvement [ms] Commit
1 Yes conv_2d_nhwc_hwcf_2x32x32x1280x3x3x1280_i8xi8xi32 1.023439 0.2 https://github.com/nod-ai/sdxl-scripts/commit/037c5c12018eee3925ac8a925aacccf94220357d  
2 Yes conv_2d_nhwc_hwcf_2x32x32x1280x3x3x2560_i8xi8xi32 0.982422 0.3 https://github.com/nod-ai/sdxl-scripts/commit/2de30d1cbdf10520f0d28bb60ddd8b8d201ba992 
3 Yes conv_2d_nhwc_hwcf_2x64x64x1280x3x3x1280_i8xi8xi32 0.923829 0.2 https://github.com/nod-ai/sdxl-scripts/commit/332aeffc66fb1cf52a3626ea00a3dac159e22363  
4 Unsuccessful conv_2d_nhwc_hwcf_2x64x64x640x3x3x640_i8xi8xi32 0.781249    
5 Dupe conv_2d_nhwc_hwcf_2x32x32x1280x3x3x1280_i8xi8xi32 0.771484    
6 Yes conv_2d_nhwc_hwcf_2x128x128x320x3x3x640_i8xi8xi32 0.740235 0.1 https://github.com/nod-ai/sdxl-scripts/commit/be1fa8793f80c754eb8e42a202cb0087908f8962  
7 Wrong pipeline conv_2d_nhwc_hwcf_2x32x32x640x3x3x640_i8xi8xi32 0.699219    
8 Yes conv_2d_nhwc_hwcf_2x128x128x640x3x3x640_i8xi8xi32 0.6875 0.3 https://github.com/nod-ai/sdxl-scripts/commit/cd0fc0c78518543b901ea01dc98fe298abf3e165  
9 Yes conv_2d_nhwc_hwcf_2x128x128x320x3x3x960_i8xi8xi32 0.570312 0.1 https://github.com/nod-ai/sdxl-scripts/commit/b25a124d6139b1af8129d482efdf1d5f86f5508e  
10 Yes conv_2d_nhwc_hwcf_2x128x128x320x3x3x320_i8xi8xi32 0.5625 0.1 https://github.com/nod-ai/sdxl-scripts/commit/b9623aa8030058eaab13d1cb746c1ae4b67073e7  
11 Unsuccessful conv_2d_nhwc_hwcf_2x64x64x640x3x3x1920_i8xi8xi32 0.515625    
12 Dupe conv_2d_nhwc_hwcf_2x32x32x1280x3x3x1280_i8xi8xi32 0.501953    
13   conv_2d_nhwc_hwcf_2x64x64x320x3x3x320_i8xi8xi32 0.382812    
14   conv_2d_nhwc_hwcf_2x32x32x1280x3x3x1920_i8xi8xi32 0.378906    
15 Dupe  conv_2d_nhwc_hwcf_2x128x128x320x3x3x320_i8xi8xi32 0.375    
16   conv_2d_nhwc_hwcf_2x64x64x640x3x3x1280_i8xi8xi32 0.335937    
17 Dupe conv_2d_nhwc_hwcf_2x32x32x1280x3x3x1280_i8xi8xi32 0.259766    
18   conv_2d_nhwc_hwcf_2x64x64x640x3x3x960_i8xi8xi32 0.25    

Cumulatively, measured 1.2ms speedup (51.8ms - 50.6ms) ==> 2.3%.

kuhar commented 4 months ago

Newest trace

Overall

image

Matmuls

image

Convs

image

kuhar commented 4 months ago

Newest trace as of https://github.com/nod-ai/sdxl-scripts/commit/0eb7ef0880285958ba8b29f8f886449932ec2190 (no horizontal fusion)

Overall

image

Matmuls

image

Convs

image

kuhar commented 4 months ago

Contractions (no horizontal fusions)

No Done? shape ms Improvement [ms] Commit
1 Yes matmul_like_2x20x1024x64x1280_i8xi8xi32 6.551762 0.7 https://github.com/nod-ai/sdxl-scripts/commit/5fd90152552e457e0ae0dd68e19d80e33b08d41f  
5 Yes matmul_like_2x20x64x64x2048_i8xi8xi32 2.011717 0.1 https://github.com/nod-ai/sdxl-scripts/commit/dfe5c6e11a468337136c510044cf7027faeff2ce  
7 Unsuccessful matmul_like_2x10x4096x64x640_i8xi8xi32 0.870121    
9 Yes matmul_like_2x10x64x64x2048_i8xi8xi32 0.344728 0.1  https://github.com/nod-ai/sdxl-scripts/commit/811dcecbb59cbf1a9835a4a13bfa41812bef3c26  

Contraction Dims

1: [m, n, m, n, k] --lhs-dims=bmk --rhs-dims=nnk --tile-dims='**mnk' 5: [m, n, m, n, k] --lhs-dims=bmk --rhs-dims=nnk --tile-dims='**mnk' 7: [m, n, m, n, k] --lhs-dims=bmk --rhs-dims=nnk --tile-dims='**mnk' 9: [m, n, m, n, k] --lhs-dims=bmk --rhs-dims=nnk --tile-dims='**mnk'

Contractions (horizontal fusions)

No Done? shape ms Improvement [ms] Commit
3 Yes matmul_like_3x2x20x1024x64x1280_i8xi8xi32 3.302736 0.7 https://github.com/nod-ai/sdxl-scripts/commit/676a9b93d4e304de577b00fbfd42d012d096ed69  
5 Same as no fusions matmul_like_2x20x1024x64x1280_i8xi8xi32 1.580081    
7 Yes matmul_like_2x2x20x64x64x2048_i8xi8xi32 0.958984 0.2 https://github.com/nod-ai/sdxl-scripts/commit/963af2d6c72df956315f5512ed7cdcec7e353058  
9 Unsuccessful matmul_like_3x2x10x4096x64x640_i8xi8xi32 0.458984    

Contraction Dims

3: [n, m, n, m, n, k] --lhs-dims=bmk --rhs-dims=nnnk --tile-dims='**mnk' 5: [m, n, m, n, k] --lhs-dims=bmk --rhs-dims=nnk --tile-dims='**mnk' 7: [n, m, n, m, n, k] --lhs-dims=bmk --rhs-dims=nnnk --tile-dims='**mnk' 9: [n, m, n, m, n, k] --lhs-dims=bmk --rhs-dims=nnnk --tile-dims='**mnk'

kuhar commented 4 months ago

I just double checked the total gain from tuning and it should be around 3.8 ms ==> 7.5%