pytorch / ao

Create and integrate custom data types, layouts and kernels with up to 2x speedups and 65% less VRAM for inference and training
BSD 3-Clause "New" or "Revised" License
342 stars 52 forks source link

Run semi-structured spare benchmarks on consumer hardware #174

Open jcaip opened 2 months ago

jcaip commented 2 months ago

2:4 sparisty is only supported on Ampere+ , we've only run benchmarks with A100s, but Phil (@philipbutler) has access to consumer GPUs that could also take advantage of sparse acceleration as well.

Steps to get numbers:

  1. install pytorch pip nightlies from here
  2. verify that your consumer GPU supports semi-structured sparsity
    import torch
    from torch.sparse import to_sparse_semi_structured
    to_sparse_semi_structured(torch.ones(256, 256).half().cuda())
  3. Clone pytorch and get benchmark script:
  4. Run benchmarks. For now, let's see if the nvidia-fixed-mn / nvidia-fixed-k benchmarks still show speedups.
    python benchmarks/sparse/benchmark_semi_structured_sparsity.py  --mode nvidia-fixed-k --dtype bfloat16 --backend cutlass
    python benchmarks/sparse/benchmark_semi_structured_sparsity.py  --mode nvidia-fixed-mn --dtype bfloat16 --backend cutlass

Afterwards, it would be great to get benchmarks for the ViT-B shapes found here: https://github.com/pytorch/ao/blob/main/benchmarks/sam_vit_b_shapes.csv

philipbutler commented 2 months ago

Had to set up this PC, so had to do a clean Python install, and noticing neither pandas nor tqdm is in requirements.txt

philipbutler commented 2 months ago

The benchmark command should use --dtype bf16

philipbutler commented 2 months ago

Ran into RuntimeError: sparse_semi_structured_mad_op : CUTLASS not supported

~~Consider adding install CUDA 12.1 and the CUTLASS Quickstart to the steps. Running through it now!~~ (I'm confused rn)

philipbutler commented 2 months ago

Actually, @jcaip, does it make sense that to_sparse_semi_structured(torch.ones(256, 256).half().cuda()) works, but running the first benchmark script shows RuntimeError: sparse_semi_structured_mad_op : CUTLASS not supported ?

jcaip commented 2 months ago

That's strange to me @philipbutler let me think for a bit

Can you open powershell and run nvidia-smi and screenshot the results?

philipbutler commented 2 months ago

@jcaip image

jcaip commented 2 months ago

@philipbutler as a sanity check - can you run using the 2.3 release instead of the nightlies?

I think this might be an issue with windows, but I'm not sure.

philipbutler commented 2 months ago

@jcaip Just making this easy as possible for future benchmarking, step 2 should say

import torch
from torch.sparse import to_sparse_semi_structured
to_sparse_semi_structured(torch.ones(256, 256).half().cuda())
philipbutler commented 2 months ago

@philipbutler as a sanity check - can you run using the 2.3 release instead of the nightlies?

I think this might be an issue with windows, but I'm not sure.

@jcaip Same error with the 2.3 release

gau-nernst commented 2 months ago

4070Ti Super, running Ubuntu 22.04. torch==2.4.0.dev20240426+cu121 bfloat16, cutlass

Fixed k

m n k sparse_latency (ms) dense_latency (ms) speedup (d/s)
0 3072 3072 10240 1.10574 2.131 1.92722
1 4096 4096 10240 1.9605 3.73044 1.9028
2 5120 5120 10240 3.12083 6.10269 1.95547
3 6144 6144 10240 4.74411 8.79509 1.8539
4 7168 7168 10240 7.29741 11.9486 1.63738
5 8192 8192 10240 10.6073 15.4296 1.45462
6 9216 9216 10240 13.6835 19.1741 1.40125
7 10240 10240 10240 16.8367 23.4461 1.39256
8 11264 11264 10240 20.37 28.2801 1.38832
9 12288 12288 10240 24.1402 33.545 1.38959
10 13312 13312 10240 28.4292 39.2493 1.3806
11 14336 14336 10240 32.851 45.5614 1.38691
12 15360 15360 10240 37.7906 54.6426 1.44593
13 16384 16384 10240 42.789 63.5041 1.48412
14 17408 17408 10240 48.5377 69.684 1.43567
15 18432 18432 10240 54.2561 77.7116 1.43231
16 19456 19456 10240 60.3411 85.183 1.41169
17 20480 20480 10240 66.7151 97.5466 1.46214

Fixed mn

m n k sparse_latency (ms) dense_latency (ms) speedup (d/s)
0 10240 10240 2560 3.12135 6.23817 1.99855
1 10240 10240 3840 4.59394 9.28166 2.02041
2 10240 10240 5120 7.15086 12.251 1.71322
3 10240 10240 6400 10.5324 14.7059 1.39625
4 10240 10240 7680 13.0499 18.0573 1.38372
5 10240 10240 8960 15.3995 20.6897 1.34353
6 10240 10240 10240 16.8406 23.4697 1.39364
7 10240 10240 11520 19.2673 26.2984 1.36493
8 10240 10240 12800 20.9322 29.0503 1.38782
9 10240 10240 14080 23.14 31.9612 1.38121
10 10240 10240 15360 25.6844 34.6865 1.35049
11 10240 10240 16640 26.2421 37.4893 1.42859
12 10240 10240 17920 30.1967 40.3297 1.33556
13 10240 10240 19200 32.4673 43.1666 1.32954
14 10240 10240 20480 33.5382 46.002 1.37163

SAM ViT-B shapes

m n k sparse_latency (ms) dense_latency (ms) speedup (d/s)
0 32768 768 3072 1.22253 1.7901 1.46426
1 32768 2304 768 0.787232 1.33425 1.69486
2 32768 3072 768 1.04701 1.74003 1.66191
3 32768 768 768 0.271155 0.437884 1.61488
4 39200 2304 768 0.948154 1.5765 1.66271
5 39200 768 768 0.324627 0.510302 1.57196

I omit some redundant columns from the saved csv file. correct and contiguous columns are all True.

msaroufim commented 2 months ago

Nice work @gau-nernst pretty cool to see results that seem uniformily faster @philipbutler would highly recommend using WSL or dual booting (I personally dual boot), getting windows and cuda to work is just not worth it

jcaip commented 2 months ago

@gau-nernst 💯 Thanks for running these - that's awesome! For others reading, I'd like to collect these, with our A100 results somewhere. So please contribute and I'll collate these together in a nice doc. We can also collect block sparse microbenchmarks too, I know @cpuhrsch is interested in those.

@philipbutler Thank you for giving it a shot + your edits we're super helpful too :) . Yeah I think I agree with mark that dual booting linux is probably the easiest solution - but could you open an issue for tracking purposes (feel free to tag me) in pytorch about lack of windows support for semi-structured sparsity?

philipbutler commented 2 weeks ago

I'm back lol. @msaroufim I have joined you in dual booting

NVIDIA GeForce RTX 3060 Ubuntu 24.04 torch==2.4.0.dev20240604+cu121

Fixed k m k n sparse_latency (ms) dense_latency (ms) speedup (d/s)
3072 10240 3072 3.749355 7.256761 1.935469
4096 10240 4096 6.678134 13.164187 1.971237
5120 10240 5120 10.565052 20.252486 1.916932
6144 10240 6144 15.589268 28.900475 1.853870
7168 10240 7168 21.814860 42.035703 1.926930
8192 10240 8192 35.252837 65.011371 1.844146
9216 10240 9216 36.577059 63.589550 1.738509
10240 10240 10240 45.712477 78.786396 1.723521
11264 10240 11264 54.966579 95.234777 1.732594
12288 10240 12288 66.754359 113.816444 1.705004
13312 10240 13312 77.615483 132.878653 1.712012
14336 10240 14336 88.930020 153.554204 1.726686
15360 10240 15360 104.564087 176.714434 1.690011
16384 10240 16384 117.693106 200.706747 1.705340
17408 10240 17408 133.979721 226.706458 1.692095
18432 10240 18432 154.624529 254.379024 1.645140
19456 10240 19456 176.906274 285.979967 1.616562
20480 10240 20480 220.200146 353.289990 1.604404
Fixed mn m k n sparse_latency (ms) dense_latency (ms) speedup (d/s)
10240 2560 10240 10.746263 20.144145 1.874526
10240 3840 10240 16.097398 29.854866 1.854639
10240 5120 10240 21.708938 42.514653 1.958394
10240 6400 10240 28.194147 51.166154 1.814779
10240 7680 10240 33.638563 59.061538 1.755769
10240 8960 10240 39.498353 68.803321 1.741929
10240 10240 10240 45.607403 78.554697 1.722411
10240 11520 10240 51.829187 88.352723 1.704691
10240 12800 10240 57.777900 98.682663 1.707966
10240 14080 10240 64.676653 107.832529 1.667256
10240 15360 10240 71.463405 117.638282 1.646133
10240 16640 10240 74.602912 127.399095 1.707696
10240 17920 10240 84.782167 138.429159 1.632763
10240 19200 10240 90.615144 147.502713 1.627793
10240 20480 10240 97.573600 177.413003 1.818248
SAM ViT-B shapes m k n sparse_latency (ms) dense_latency (ms) speedup (d/s)
32768 3072 768 3.106270 6.297788 2.027444
32768 768 2304 2.698760 4.917082 1.821978
32768 768 3072 3.599539 6.097759 1.694039
32768 768 768 0.908029 1.753664 1.931286
39200 768 2304 3.648182 5.655835 1.550316
39200 768 768 1.087033 1.838328 1.691143