Run semi-structured spare benchmarks on consumer hardware - Githubissues

pytorch / ao

Create and integrate custom data types, layouts and kernels with up to 2x speedups and 65% less VRAM for inference and training

BSD 3-Clause "New" or "Revised" License

342 stars 52 forks source link

Run semi-structured spare benchmarks on consumer hardware #174

Open jcaip opened 2 months ago

jcaip commented 2 months ago

2:4 sparisty is only supported on Ampere+ , we've only run benchmarks with A100s, but Phil (@philipbutler) has access to consumer GPUs that could also take advantage of sparse acceleration as well.

Steps to get numbers:

install pytorch pip nightlies from here

verify that your consumer GPU supports semi-structured sparsity

import torch
from torch.sparse import to_sparse_semi_structured
to_sparse_semi_structured(torch.ones(256, 256).half().cuda())

Clone pytorch and get benchmark script:

Run benchmarks. For now, let's see if the nvidia-fixed-mn / nvidia-fixed-k benchmarks still show speedups.

python benchmarks/sparse/benchmark_semi_structured_sparsity.py  --mode nvidia-fixed-k --dtype bfloat16 --backend cutlass
python benchmarks/sparse/benchmark_semi_structured_sparsity.py  --mode nvidia-fixed-mn --dtype bfloat16 --backend cutlass

Afterwards, it would be great to get benchmarks for the ViT-B shapes found here: https://github.com/pytorch/ao/blob/main/benchmarks/sam_vit_b_shapes.csv

philipbutler commented 2 months ago

Had to set up this PC, so had to do a clean Python install, and noticing neither pandas nor tqdm is in requirements.txt

philipbutler commented 2 months ago

The benchmark command should use --dtype bf16

philipbutler commented 2 months ago

Ran into RuntimeError: sparse_semi_structured_mad_op : CUTLASS not supported

~~Consider adding install CUDA 12.1 and the CUTLASS Quickstart to the steps. Running through it now!~~ (I'm confused rn)

philipbutler commented 2 months ago

Actually, @jcaip, does it make sense that to_sparse_semi_structured(torch.ones(256, 256).half().cuda()) works, but running the first benchmark script shows RuntimeError: sparse_semi_structured_mad_op : CUTLASS not supported ?

jcaip commented 2 months ago

That's strange to me @philipbutler let me think for a bit

Can you open powershell and run nvidia-smi and screenshot the results?

philipbutler commented 2 months ago

@jcaip

jcaip commented 2 months ago

@philipbutler as a sanity check - can you run using the 2.3 release instead of the nightlies?

I think this might be an issue with windows, but I'm not sure.

philipbutler commented 2 months ago

@jcaip Just making this easy as possible for future benchmarking, step 2 should say

import torch
from torch.sparse import to_sparse_semi_structured
to_sparse_semi_structured(torch.ones(256, 256).half().cuda())

philipbutler commented 2 months ago

@philipbutler as a sanity check - can you run using the 2.3 release instead of the nightlies?

I think this might be an issue with windows, but I'm not sure.

@jcaip Same error with the 2.3 release

gau-nernst commented 2 months ago

4070Ti Super, running Ubuntu 22.04. torch==2.4.0.dev20240426+cu121 bfloat16, cutlass

Fixed k

	m	n	k	sparse_latency (ms)	dense_latency (ms)	speedup (d/s)
0	3072	3072	10240	1.10574	2.131	1.92722
1	4096	4096	10240	1.9605	3.73044	1.9028
2	5120	5120	10240	3.12083	6.10269	1.95547
3	6144	6144	10240	4.74411	8.79509	1.8539
4	7168	7168	10240	7.29741	11.9486	1.63738
5	8192	8192	10240	10.6073	15.4296	1.45462
6	9216	9216	10240	13.6835	19.1741	1.40125
7	10240	10240	10240	16.8367	23.4461	1.39256
8	11264	11264	10240	20.37	28.2801	1.38832
9	12288	12288	10240	24.1402	33.545	1.38959
10	13312	13312	10240	28.4292	39.2493	1.3806
11	14336	14336	10240	32.851	45.5614	1.38691
12	15360	15360	10240	37.7906	54.6426	1.44593
13	16384	16384	10240	42.789	63.5041	1.48412
14	17408	17408	10240	48.5377	69.684	1.43567
15	18432	18432	10240	54.2561	77.7116	1.43231
16	19456	19456	10240	60.3411	85.183	1.41169
17	20480	20480	10240	66.7151	97.5466	1.46214

Fixed mn

	m	n	k	sparse_latency (ms)	dense_latency (ms)	speedup (d/s)
0	10240	10240	2560	3.12135	6.23817	1.99855
1	10240	10240	3840	4.59394	9.28166	2.02041
2	10240	10240	5120	7.15086	12.251	1.71322
3	10240	10240	6400	10.5324	14.7059	1.39625
4	10240	10240	7680	13.0499	18.0573	1.38372
5	10240	10240	8960	15.3995	20.6897	1.34353
6	10240	10240	10240	16.8406	23.4697	1.39364
7	10240	10240	11520	19.2673	26.2984	1.36493
8	10240	10240	12800	20.9322	29.0503	1.38782
9	10240	10240	14080	23.14	31.9612	1.38121
10	10240	10240	15360	25.6844	34.6865	1.35049
11	10240	10240	16640	26.2421	37.4893	1.42859
12	10240	10240	17920	30.1967	40.3297	1.33556
13	10240	10240	19200	32.4673	43.1666	1.32954
14	10240	10240	20480	33.5382	46.002	1.37163

SAM ViT-B shapes

	m	n	k	sparse_latency (ms)	dense_latency (ms)	speedup (d/s)
0	32768	768	3072	1.22253	1.7901	1.46426
1	32768	2304	768	0.787232	1.33425	1.69486
2	32768	3072	768	1.04701	1.74003	1.66191
3	32768	768	768	0.271155	0.437884	1.61488
4	39200	2304	768	0.948154	1.5765	1.66271
5	39200	768	768	0.324627	0.510302	1.57196

I omit some redundant columns from the saved csv file. correct and contiguous columns are all True.

msaroufim commented 2 months ago

Nice work @gau-nernst pretty cool to see results that seem uniformily faster @philipbutler would highly recommend using WSL or dual booting (I personally dual boot), getting windows and cuda to work is just not worth it

jcaip commented 2 months ago

@gau-nernst 💯 Thanks for running these - that's awesome! For others reading, I'd like to collect these, with our A100 results somewhere. So please contribute and I'll collate these together in a nice doc. We can also collect block sparse microbenchmarks too, I know @cpuhrsch is interested in those.

@philipbutler Thank you for giving it a shot + your edits we're super helpful too :) . Yeah I think I agree with mark that dual booting linux is probably the easiest solution - but could you open an issue for tracking purposes (feel free to tag me) in pytorch about lack of windows support for semi-structured sparsity?

philipbutler commented 2 weeks ago

I'm back lol. @msaroufim I have joined you in dual booting

NVIDIA GeForce RTX 3060 Ubuntu 24.04 torch==2.4.0.dev20240604+cu121

Fixed k	m	k	n	sparse_latency (ms)	dense_latency (ms)	speedup (d/s)
3072	10240	3072	3.749355	7.256761	1.935469
4096	10240	4096	6.678134	13.164187	1.971237
5120	10240	5120	10.565052	20.252486	1.916932
6144	10240	6144	15.589268	28.900475	1.853870
7168	10240	7168	21.814860	42.035703	1.926930
8192	10240	8192	35.252837	65.011371	1.844146
9216	10240	9216	36.577059	63.589550	1.738509
10240	10240	10240	45.712477	78.786396	1.723521
11264	10240	11264	54.966579	95.234777	1.732594
12288	10240	12288	66.754359	113.816444	1.705004
13312	10240	13312	77.615483	132.878653	1.712012
14336	10240	14336	88.930020	153.554204	1.726686
15360	10240	15360	104.564087	176.714434	1.690011
16384	10240	16384	117.693106	200.706747	1.705340
17408	10240	17408	133.979721	226.706458	1.692095
18432	10240	18432	154.624529	254.379024	1.645140
19456	10240	19456	176.906274	285.979967	1.616562
20480	10240	20480	220.200146	353.289990	1.604404

Fixed mn	m	k	n	sparse_latency (ms)	dense_latency (ms)	speedup (d/s)
10240	2560	10240	10.746263	20.144145	1.874526
10240	3840	10240	16.097398	29.854866	1.854639
10240	5120	10240	21.708938	42.514653	1.958394
10240	6400	10240	28.194147	51.166154	1.814779
10240	7680	10240	33.638563	59.061538	1.755769
10240	8960	10240	39.498353	68.803321	1.741929
10240	10240	10240	45.607403	78.554697	1.722411
10240	11520	10240	51.829187	88.352723	1.704691
10240	12800	10240	57.777900	98.682663	1.707966
10240	14080	10240	64.676653	107.832529	1.667256
10240	15360	10240	71.463405	117.638282	1.646133
10240	16640	10240	74.602912	127.399095	1.707696
10240	17920	10240	84.782167	138.429159	1.632763
10240	19200	10240	90.615144	147.502713	1.627793
10240	20480	10240	97.573600	177.413003	1.818248

SAM ViT-B shapes	m	k	n	sparse_latency (ms)	dense_latency (ms)	speedup (d/s)
32768	3072	768	3.106270	6.297788	2.027444
32768	768	2304	2.698760	4.917082	1.821978
32768	768	3072	3.599539	6.097759	1.694039
32768	768	768	0.908029	1.753664	1.931286
39200	768	2304	3.648182	5.655835	1.550316
39200	768	768	1.087033	1.838328	1.691143