pytorch / ao

PyTorch native quantization and sparsity for training and inference
BSD 3-Clause "New" or "Revised" License
1.56k stars 169 forks source link

Fail to reproduce benchmark results #1135

Open ThisisBillhe opened 3 weeks ago

ThisisBillhe commented 3 weeks ago

Hi! I try to reproduce the benchmark results using torchao/_models/llama/generate.py. However, I can not benchmark the quantized model successfully. Specifically, when using a torch version < 2.5.0, I got the following error:

  File "/mnt/workspace/Lumina-mGPT/torchao_benchmark.py", line 310, in main
    unwrap_tensor_subclass(model)
  File "/mnt/workspace/anaconda3/envs/lumina_mgpt/lib/python3.10/site-packages/torchao/utils.py", line 287, in unwrap_tensor_subclass
    unwrap_tensor_subclass(child)
  File "/mnt/workspace/anaconda3/envs/lumina_mgpt/lib/python3.10/site-packages/torchao/utils.py", line 287, in unwrap_tensor_subclass
    unwrap_tensor_subclass(child)
  File "/mnt/workspace/anaconda3/envs/lumina_mgpt/lib/python3.10/site-packages/torchao/utils.py", line 287, in unwrap_tensor_subclass
    unwrap_tensor_subclass(child)
  File "/mnt/workspace/anaconda3/envs/lumina_mgpt/lib/python3.10/site-packages/torchao/utils.py", line 286, in unwrap_tensor_subclass
    parametrize.register_parametrization(child, "weight", UnwrapTensorSubclass())
  File "/mnt/workspace/anaconda3/envs/lumina_mgpt/lib/python3.10/site-packages/torch/nn/utils/parametrize.py", line 562, in register_parametrization
    parametrizations = ParametrizationList([parametrization], original, unsafe=unsafe)
  File "/mnt/workspace/anaconda3/envs/lumina_mgpt/lib/python3.10/site-packages/torch/nn/utils/parametrize.py", line 173, in __init__
    originali = Parameter(originali)
  File "/mnt/workspace/anaconda3/envs/lumina_mgpt/lib/python3.10/site-packages/torch/nn/parameter.py", line 40, in __new__
    return torch.Tensor._make_subclass(cls, data, requires_grad)
RuntimeError: Only Tensors of floating point and complex dtype can require gradients

When upgrading the torch version to 2.5.0, the process got stucked and not responding for a very long time:

Using device=cuda
Loading model ...
Time to load model: 54.85 seconds
Compiling Model
^C^C^C^C^C^C

I do not see any CPU usage with top command, and I have to kill the process by its id.

Also, it there any way to accelerate a huggingface model by quantizing it with torchao, without converting the model format?

supriyar commented 3 weeks ago

cc @HDCharles @jerryzh168

jerryzh168 commented 3 weeks ago

I haven't seen the require gradients error before, can you give us a repro?

for huggingface quantization, you can take a look at https://huggingface.co/docs/transformers/main/en/quantization/torchao

ThisisBillhe commented 3 weeks ago

I haven't seen the require gradients error before, can you give us a repro?

for huggingface quantization, you can take a look at https://huggingface.co/docs/transformers/main/en/quantization/torchao

Hi, I use the same benchmark script in your repo. You can see that if torch.version < 2.5, unwrap_tensor_subclass function will be called, which leads to the error.

As for the problem of process stuck, it may be related to the static cache. The program got stuck with static cache on my A100 machine but the same program works on my 3090 machine.