Closed sayakpaul closed 2 months ago
Using torch nightly doesn't solve it.
CC: @jcaip
The error seems to be known here https://github.com/pytorch/pytorch/issues/115077 - we're making a release tomorrow so will look at this immediately after
Hey @sayakpaul this is because our current version of cuSPARSELt does not support hopper, the neuralmagic folk have also produced this issue: https://github.com/pytorch/pytorch/issues/132928
I plan to update our cuSPARSELt version to add support, but those int8 kernels were specifically designed for A100s so I would recommend using that hardware for max speedups.
Thank you!
I currently do not want to introduce different hardware results in the benchmarks. Would you be able to provide me the A100 sparsification results with the current code?
@sayakpaul Sure I should have some time at the end of this week / start of the next to run benchmarks. I'll let you know when I start working on them / if I have any issues
Thanks!
Hey @sayakpaul, I had a couple more people bump this issue, so I'm going to work on bumping the cslt version to add hopper support to the nightlies this week instead. Sorry about the additional delay but I think this way I can unblock all of you at once.
Once I'm done I can grab Ampere numbers if the results from the nightlies aren't performant.
Cool
Sorry @sayakpaul, forgot to let you know before I went on PTO, but I landed https://github.com/pytorch/pytorch/pull/134022 last week, which added Hopper support to the nightlies for CUDA 12.4. You should be able to install with:
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124
Thanks! Would it be possible to give this a try?
python benchmark_pixart.py --sparsify
branch: benchmark-pixart
| ckpt_id | batch_size | fuse | compile | compile_vae | quantization | sparsify | memory | time |
|:--------------------------------------:|-------------:|:------:|:---------:|:-------------:|:--------------:|:----------:|---------:|-------:|
| PixArt-alpha/PixArt-Sigma-XL-2-1024-MS | 8 | False | False | False | None | True | 9.44 | 29.165 |
FYI cuSPARSELt has some dense matrix shape constraints, so you must run with batch_size >= 8.
Nice. It's good to know that it should be used when we're in a compute-bound regime, right?
I will run this on Flux (12.5B) and see what happens. Thanks, Jesse!
In case anyone is using Grace + Hopper, https://github.com/pytorch/pytorch/pull/136818 should bring parity to x86_64 + Hopper.
nvidia-smi
:Command:
Error:
ao
was installed from this commit:ed83ae2a69d68129993dd3a0ea5b8af7130abdd1
.diffusers
was installed fromqkv-rest
branch (pip install git+https://github.com/huggingface/diffusers@qkv-rest
).Rest of the dependencies are:
transformers
accelerate
sentencepiece
beautifulsoup4
ftfy