saprsification + dynamic int8 doesn't work

sayakpaul commented 3 months ago

nvidia-smi:

Mon Aug  5 08:50:51 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 80GB HBM3          On  | 00000000:53:00.0 Off |                    0 |
| N/A   37C    P0              68W / 700W |      2MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  | 00000000:64:00.0 Off |                    0 |
| N/A   40C    P0              68W / 700W |      2MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  | 00000000:75:00.0 Off |                    0 |
| N/A   37C    P0              67W / 700W |      2MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  | 00000000:86:00.0 Off |                    0 |
| N/A   42C    P0              68W / 700W |      2MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Command:

python benchmark_pixart.py --sparsify

Error:

Traceback (most recent call last):
  File "/fsx/sayak/diffusers-torchao/inference/benchmark_pixart.py", line 172, in <module>
    pipeline = load_pipeline(
  File "/fsx/sayak/diffusers-torchao/inference/benchmark_pixart.py", line 87, in load_pipeline
    sparsify_(
  File "/fsx/sayak/miniconda3/envs/diffusers/lib/python3.9/site-packages/torchao/sparsity/sparse_api.py", line 73, in sparsify_
    _replace_with_custom_fn_if_matches_filter(
  File "/fsx/sayak/miniconda3/envs/diffusers/lib/python3.9/site-packages/torchao/quantization/quant_api.py", line 175, in _replace_with_custom_fn_if_matches_filter
    new_child = _replace_with_custom_fn_if_matches_filter(
  File "/fsx/sayak/miniconda3/envs/diffusers/lib/python3.9/site-packages/torchao/quantization/quant_api.py", line 175, in _replace_with_custom_fn_if_matches_filter
    new_child = _replace_with_custom_fn_if_matches_filter(
  File "/fsx/sayak/miniconda3/envs/diffusers/lib/python3.9/site-packages/torchao/quantization/quant_api.py", line 175, in _replace_with_custom_fn_if_matches_filter
    new_child = _replace_with_custom_fn_if_matches_filter(
  [Previous line repeated 1 more time]
  File "/fsx/sayak/miniconda3/envs/diffusers/lib/python3.9/site-packages/torchao/quantization/quant_api.py", line 171, in _replace_with_custom_fn_if_matches_filter
    model = replacement_fn(model)
  File "/fsx/sayak/miniconda3/envs/diffusers/lib/python3.9/site-packages/torchao/quantization/quant_api.py", line 262, in insert_subclass
    lin.weight = torch.nn.Parameter(constructor(lin.weight), requires_grad=False)
  File "/fsx/sayak/miniconda3/envs/diffusers/lib/python3.9/site-packages/torchao/quantization/quant_api.py", line 455, in apply_int8_dynamic_activation_int8_weight_quant
    weight = to_affine_quantized(weight, mapping_type, block_size, target_dtype, eps=eps, zero_point_dtype=zero_point_dtype, layout_type=layout_type)
  File "/fsx/sayak/miniconda3/envs/diffusers/lib/python3.9/site-packages/torchao/dtypes/affine_quantized_tensor.py", line 202, in from_float
    layout_tensor = layout_tensor_ctr(int_data, scale, zero_point, layout_type)
  File "/fsx/sayak/miniconda3/envs/diffusers/lib/python3.9/site-packages/torchao/dtypes/affine_quantized_tensor.py", line 489, in from_plain
    int_data_compressed = torch._cslt_compress(int_data)
RuntimeError: CUDA error: architecture mismatch when calling `cusparseLtInit(&handle)`

ao was installed from this commit: ed83ae2a69d68129993dd3a0ea5b8af7130abdd1.

diffusers was installed from qkv-rest branch (pip install git+https://github.com/huggingface/diffusers@qkv-rest).

Rest of the dependencies are:

PyTorch 2.4.0
transformers
accelerate
sentencepiece
beautifulsoup4
ftfy

sayakpaul commented 3 months ago

Using torch nightly doesn't solve it.

sayakpaul commented 3 months ago

CC: @jcaip

msaroufim commented 3 months ago

The error seems to be known here https://github.com/pytorch/pytorch/issues/115077 - we're making a release tomorrow so will look at this immediately after

jcaip commented 3 months ago

Hey @sayakpaul this is because our current version of cuSPARSELt does not support hopper, the neuralmagic folk have also produced this issue: https://github.com/pytorch/pytorch/issues/132928

I plan to update our cuSPARSELt version to add support, but those int8 kernels were specifically designed for A100s so I would recommend using that hardware for max speedups.

sayakpaul commented 3 months ago

Thank you!

I currently do not want to introduce different hardware results in the benchmarks. Would you be able to provide me the A100 sparsification results with the current code?

jcaip commented 3 months ago

@sayakpaul Sure I should have some time at the end of this week / start of the next to run benchmarks. I'll let you know when I start working on them / if I have any issues

sayakpaul commented 3 months ago

Thanks!

jcaip commented 3 months ago

Hey @sayakpaul, I had a couple more people bump this issue, so I'm going to work on bumping the cslt version to add hopper support to the nightlies this week instead. Sorry about the additional delay but I think this way I can unblock all of you at once.

Once I'm done I can grab Ampere numbers if the results from the nightlies aren't performant.

sayakpaul commented 3 months ago

Cool

jcaip commented 2 months ago

Sorry @sayakpaul, forgot to let you know before I went on PTO, but I landed https://github.com/pytorch/pytorch/pull/134022 last week, which added Hopper support to the nightlies for CUDA 12.4. You should be able to install with:

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124

sayakpaul commented 2 months ago

Thanks! Would it be possible to give this a try?

python benchmark_pixart.py --sparsify

branch: benchmark-pixart

jcaip commented 2 months ago

|                ckpt_id                 |   batch_size |  fuse  |  compile  |  compile_vae  |  quantization  |  sparsify  |   memory |   time |
|:--------------------------------------:|-------------:|:------:|:---------:|:-------------:|:--------------:|:----------:|---------:|-------:|
| PixArt-alpha/PixArt-Sigma-XL-2-1024-MS |            8 | False  |   False   |     False     |      None      |    True    |     9.44 | 29.165 |

FYI cuSPARSELt has some dense matrix shape constraints, so you must run with batch_size >= 8.

sayakpaul commented 2 months ago

Nice. It's good to know that it should be used when we're in a compute-bound regime, right?

I will run this on Flux (12.5B) and see what happens. Thanks, Jesse!

nWEIdia commented 2 months ago

In case anyone is using Grace + Hopper, https://github.com/pytorch/pytorch/pull/136818 should bring parity to x86_64 + Hopper.

sayakpaul / diffusers-torchao

saprsification + dynamic int8 doesn't work #1