Open jcaip opened 2 months ago
Had to set up this PC, so had to do a clean Python install, and noticing neither pandas
nor tqdm
is in requirements.txt
The benchmark command should use --dtype bf16
Ran into RuntimeError: sparse_semi_structured_mad_op : CUTLASS not supported
~~Consider adding install CUDA 12.1 and the CUTLASS Quickstart to the steps. Running through it now!~~ (I'm confused rn)
Actually, @jcaip, does it make sense that to_sparse_semi_structured(torch.ones(256, 256).half().cuda())
works, but running the first benchmark script shows RuntimeError: sparse_semi_structured_mad_op : CUTLASS not supported
?
That's strange to me @philipbutler let me think for a bit
Can you open powershell and run nvidia-smi
and screenshot the results?
@jcaip
@philipbutler as a sanity check - can you run using the 2.3 release instead of the nightlies?
I think this might be an issue with windows, but I'm not sure.
@jcaip Just making this easy as possible for future benchmarking, step 2 should say
import torch
from torch.sparse import to_sparse_semi_structured
to_sparse_semi_structured(torch.ones(256, 256).half().cuda())
@philipbutler as a sanity check - can you run using the 2.3 release instead of the nightlies?
I think this might be an issue with windows, but I'm not sure.
@jcaip Same error with the 2.3 release
4070Ti Super, running Ubuntu 22.04. torch==2.4.0.dev20240426+cu121 bfloat16, cutlass
Fixed k
m | n | k | sparse_latency (ms) | dense_latency (ms) | speedup (d/s) | |
---|---|---|---|---|---|---|
0 | 3072 | 3072 | 10240 | 1.10574 | 2.131 | 1.92722 |
1 | 4096 | 4096 | 10240 | 1.9605 | 3.73044 | 1.9028 |
2 | 5120 | 5120 | 10240 | 3.12083 | 6.10269 | 1.95547 |
3 | 6144 | 6144 | 10240 | 4.74411 | 8.79509 | 1.8539 |
4 | 7168 | 7168 | 10240 | 7.29741 | 11.9486 | 1.63738 |
5 | 8192 | 8192 | 10240 | 10.6073 | 15.4296 | 1.45462 |
6 | 9216 | 9216 | 10240 | 13.6835 | 19.1741 | 1.40125 |
7 | 10240 | 10240 | 10240 | 16.8367 | 23.4461 | 1.39256 |
8 | 11264 | 11264 | 10240 | 20.37 | 28.2801 | 1.38832 |
9 | 12288 | 12288 | 10240 | 24.1402 | 33.545 | 1.38959 |
10 | 13312 | 13312 | 10240 | 28.4292 | 39.2493 | 1.3806 |
11 | 14336 | 14336 | 10240 | 32.851 | 45.5614 | 1.38691 |
12 | 15360 | 15360 | 10240 | 37.7906 | 54.6426 | 1.44593 |
13 | 16384 | 16384 | 10240 | 42.789 | 63.5041 | 1.48412 |
14 | 17408 | 17408 | 10240 | 48.5377 | 69.684 | 1.43567 |
15 | 18432 | 18432 | 10240 | 54.2561 | 77.7116 | 1.43231 |
16 | 19456 | 19456 | 10240 | 60.3411 | 85.183 | 1.41169 |
17 | 20480 | 20480 | 10240 | 66.7151 | 97.5466 | 1.46214 |
Fixed mn
m | n | k | sparse_latency (ms) | dense_latency (ms) | speedup (d/s) | |
---|---|---|---|---|---|---|
0 | 10240 | 10240 | 2560 | 3.12135 | 6.23817 | 1.99855 |
1 | 10240 | 10240 | 3840 | 4.59394 | 9.28166 | 2.02041 |
2 | 10240 | 10240 | 5120 | 7.15086 | 12.251 | 1.71322 |
3 | 10240 | 10240 | 6400 | 10.5324 | 14.7059 | 1.39625 |
4 | 10240 | 10240 | 7680 | 13.0499 | 18.0573 | 1.38372 |
5 | 10240 | 10240 | 8960 | 15.3995 | 20.6897 | 1.34353 |
6 | 10240 | 10240 | 10240 | 16.8406 | 23.4697 | 1.39364 |
7 | 10240 | 10240 | 11520 | 19.2673 | 26.2984 | 1.36493 |
8 | 10240 | 10240 | 12800 | 20.9322 | 29.0503 | 1.38782 |
9 | 10240 | 10240 | 14080 | 23.14 | 31.9612 | 1.38121 |
10 | 10240 | 10240 | 15360 | 25.6844 | 34.6865 | 1.35049 |
11 | 10240 | 10240 | 16640 | 26.2421 | 37.4893 | 1.42859 |
12 | 10240 | 10240 | 17920 | 30.1967 | 40.3297 | 1.33556 |
13 | 10240 | 10240 | 19200 | 32.4673 | 43.1666 | 1.32954 |
14 | 10240 | 10240 | 20480 | 33.5382 | 46.002 | 1.37163 |
SAM ViT-B shapes
m | n | k | sparse_latency (ms) | dense_latency (ms) | speedup (d/s) | |
---|---|---|---|---|---|---|
0 | 32768 | 768 | 3072 | 1.22253 | 1.7901 | 1.46426 |
1 | 32768 | 2304 | 768 | 0.787232 | 1.33425 | 1.69486 |
2 | 32768 | 3072 | 768 | 1.04701 | 1.74003 | 1.66191 |
3 | 32768 | 768 | 768 | 0.271155 | 0.437884 | 1.61488 |
4 | 39200 | 2304 | 768 | 0.948154 | 1.5765 | 1.66271 |
5 | 39200 | 768 | 768 | 0.324627 | 0.510302 | 1.57196 |
I omit some redundant columns from the saved csv file. correct
and contiguous
columns are all True
.
Nice work @gau-nernst pretty cool to see results that seem uniformily faster @philipbutler would highly recommend using WSL or dual booting (I personally dual boot), getting windows and cuda to work is just not worth it
@gau-nernst 💯 Thanks for running these - that's awesome! For others reading, I'd like to collect these, with our A100 results somewhere. So please contribute and I'll collate these together in a nice doc. We can also collect block sparse microbenchmarks too, I know @cpuhrsch is interested in those.
@philipbutler Thank you for giving it a shot + your edits we're super helpful too :) . Yeah I think I agree with mark that dual booting linux is probably the easiest solution - but could you open an issue for tracking purposes (feel free to tag me) in pytorch about lack of windows support for semi-structured sparsity?
I'm back lol. @msaroufim I have joined you in dual booting
NVIDIA GeForce RTX 3060 Ubuntu 24.04 torch==2.4.0.dev20240604+cu121
Fixed k | m | k | n | sparse_latency (ms) | dense_latency (ms) | speedup (d/s) |
---|---|---|---|---|---|---|
3072 | 10240 | 3072 | 3.749355 | 7.256761 | 1.935469 | |
4096 | 10240 | 4096 | 6.678134 | 13.164187 | 1.971237 | |
5120 | 10240 | 5120 | 10.565052 | 20.252486 | 1.916932 | |
6144 | 10240 | 6144 | 15.589268 | 28.900475 | 1.853870 | |
7168 | 10240 | 7168 | 21.814860 | 42.035703 | 1.926930 | |
8192 | 10240 | 8192 | 35.252837 | 65.011371 | 1.844146 | |
9216 | 10240 | 9216 | 36.577059 | 63.589550 | 1.738509 | |
10240 | 10240 | 10240 | 45.712477 | 78.786396 | 1.723521 | |
11264 | 10240 | 11264 | 54.966579 | 95.234777 | 1.732594 | |
12288 | 10240 | 12288 | 66.754359 | 113.816444 | 1.705004 | |
13312 | 10240 | 13312 | 77.615483 | 132.878653 | 1.712012 | |
14336 | 10240 | 14336 | 88.930020 | 153.554204 | 1.726686 | |
15360 | 10240 | 15360 | 104.564087 | 176.714434 | 1.690011 | |
16384 | 10240 | 16384 | 117.693106 | 200.706747 | 1.705340 | |
17408 | 10240 | 17408 | 133.979721 | 226.706458 | 1.692095 | |
18432 | 10240 | 18432 | 154.624529 | 254.379024 | 1.645140 | |
19456 | 10240 | 19456 | 176.906274 | 285.979967 | 1.616562 | |
20480 | 10240 | 20480 | 220.200146 | 353.289990 | 1.604404 |
Fixed mn | m | k | n | sparse_latency (ms) | dense_latency (ms) | speedup (d/s) |
---|---|---|---|---|---|---|
10240 | 2560 | 10240 | 10.746263 | 20.144145 | 1.874526 | |
10240 | 3840 | 10240 | 16.097398 | 29.854866 | 1.854639 | |
10240 | 5120 | 10240 | 21.708938 | 42.514653 | 1.958394 | |
10240 | 6400 | 10240 | 28.194147 | 51.166154 | 1.814779 | |
10240 | 7680 | 10240 | 33.638563 | 59.061538 | 1.755769 | |
10240 | 8960 | 10240 | 39.498353 | 68.803321 | 1.741929 | |
10240 | 10240 | 10240 | 45.607403 | 78.554697 | 1.722411 | |
10240 | 11520 | 10240 | 51.829187 | 88.352723 | 1.704691 | |
10240 | 12800 | 10240 | 57.777900 | 98.682663 | 1.707966 | |
10240 | 14080 | 10240 | 64.676653 | 107.832529 | 1.667256 | |
10240 | 15360 | 10240 | 71.463405 | 117.638282 | 1.646133 | |
10240 | 16640 | 10240 | 74.602912 | 127.399095 | 1.707696 | |
10240 | 17920 | 10240 | 84.782167 | 138.429159 | 1.632763 | |
10240 | 19200 | 10240 | 90.615144 | 147.502713 | 1.627793 | |
10240 | 20480 | 10240 | 97.573600 | 177.413003 | 1.818248 |
SAM ViT-B shapes | m | k | n | sparse_latency (ms) | dense_latency (ms) | speedup (d/s) |
---|---|---|---|---|---|---|
32768 | 3072 | 768 | 3.106270 | 6.297788 | 2.027444 | |
32768 | 768 | 2304 | 2.698760 | 4.917082 | 1.821978 | |
32768 | 768 | 3072 | 3.599539 | 6.097759 | 1.694039 | |
32768 | 768 | 768 | 0.908029 | 1.753664 | 1.931286 | |
39200 | 768 | 2304 | 3.648182 | 5.655835 | 1.550316 | |
39200 | 768 | 768 | 1.087033 | 1.838328 | 1.691143 |
2:4 sparisty is only supported on Ampere+ , we've only run benchmarks with A100s, but Phil (@philipbutler) has access to consumer GPUs that could also take advantage of sparse acceleration as well.
Steps to get numbers:
Afterwards, it would be great to get benchmarks for the ViT-B shapes found here: https://github.com/pytorch/ao/blob/main/benchmarks/sam_vit_b_shapes.csv