pytorch / ao

PyTorch native quantization and sparsity for training and inference
BSD 3-Clause "New" or "Revised" License
1.51k stars 151 forks source link

Autoquant fails on CPU with CPU packages #714

Open ynimmaga opened 2 months ago

ynimmaga commented 2 months ago

Autoquant fails when CPU packages are used. Tried with the latest nightly packages by installing torchao and torch using the below:

pip install --pre torchao-nightly torch --index-url https://download.pytorch.org/whl/nightly/cpu

I modified the simple example from the documentation for CPU as below:

import torch
import torchao

# Plug in your model and example input
model = torch.nn.Sequential(torch.nn.Linear(32, 64))
input = torch.randn(32,32)

# perform autoquantization and torch.compile
model = torchao.autoquant(torch.compile(model, mode='max-autotune'))

# pass in an input which is used in order to pick fastest quantization operations
# and apply torch compilation.
model(input)

Above script gives the below output:

activation_shapes: torch.Size([32, 32]), times_seen: 1
weight_shape: torch.Size([64, 32]), dtype: torch.float32, bias_shape: torch.Size([64])
warning: failed to autoquant AQFloatLinearWeight for shape: (torch.Size([32, 32]), torch.Size([64, 32]), torch.Size([64]), torch.float32) due to Torch not compiled with CUDA enabled
warning: failed to autoquant AQWeightOnlyQuantizedLinearWeight for shape: (torch.Size([32, 32]), torch.Size([64, 32]), torch.Size([64]), torch.float32) due to Torch not compiled with CUDA enabled
warning: failed to autoquant AQWeightOnlyQuantizedLinearWeight2 for shape: (torch.Size([32, 32]), torch.Size([64, 32]), torch.Size([64]), torch.float32) due to Torch not compiled with CUDA enabled
warning: failed to autoquant AQInt8DynamicallyQuantizedLinearWeight for shape: (torch.Size([32, 32]), torch.Size([64, 32]), torch.Size([64]), torch.float32) due to Torch not compiled with CUDA enabled
best_cls=<class 'torchao.quantization.autoquant.AQInt8DynamicallyQuantizedLinearWeight'>

Hope I am not missing any steps here. Does autoquant support CPU? Hope someone can give me some advice. Thank you.

msaroufim commented 2 months ago

We don't recommend using the torchao-nightly package anymore and will move two --pre torchao there's still a few issues though

In the meantime mind just building from source using USE_CPP=0 pip install . and letting me know if that works?

ynimmaga commented 2 months ago

@msaroufim, thank you for the suggestion. I tried building from source with the above command and verified that the new version is installed from pip list. However, I still see the same error.

I also commented out the the try except block with the error message in autoquant.py and got a better traceback of the error:

activation_shapes: torch.Size([32, 32]), times_seen: 1
weight_shape: torch.Size([64, 32]), dtype: torch.float32, bias_shape: torch.Size([64])
Traceback (most recent call last):
  File "/home/ao/scripts/simple_example.py", line 13, in <module>
    model(input)
  File "/home/new_ao_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/new_ao_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1767, in _call_impl
    args_kwargs_result = hook(self, args, kwargs)  # type: ignore[misc]
  File "/home/new_ao_env/lib/python3.10/site-packages/torchao/quantization/autoquant.py", line 608, in autoquant_prehook
    module.finalize_autoquant()
  File "/home/new_ao_env/lib/python3.10/site-packages/torchao/quantization/autoquant.py", line 620, in finalize_autoquant
    _change_autoquantizable_to_quantized(
  File "/home/new_ao_env/lib/python3.10/site-packages/torchao/quantization/autoquant.py", line 494, in _change_autoquantizable_to_quantized
    _replace_with_custom_fn_if_matches_filter(
  File "/home/new_ao_env/lib/python3.10/site-packages/torchao/quantization/quant_api.py", line 187, in _replace_with_custom_fn_if_matches_filter
    new_child = _replace_with_custom_fn_if_matches_filter(
  File "/home/new_ao_env/lib/python3.10/site-packages/torchao/quantization/quant_api.py", line 183, in _replace_with_custom_fn_if_matches_filter
    model = replacement_fn(model)
  File "/home/new_ao_env/lib/python3.10/site-packages/torchao/quantization/quant_api.py", line 232, in insert_subclass
    getattr(cls, from_float)(lin.weight, **kwargs), requires_grad=False
  File "/home/new_ao_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/new_ao_env/lib/python3.10/site-packages/torchao/quantization/autoquant.py", line 146, in to_quantized
    self.tune_autoquant(q_cls, shapes_and_dtype, time_for_best_shape)
  File "/home/new_ao_env/lib/python3.10/site-packages/torchao/quantization/autoquant.py", line 97, in tune_autoquant
    res = q_cls._autoquant_test(act_mat, self.weight, bias, best_time, self.mode)
  File "/home/new_ao_env/lib/python3.10/site-packages/torchao/quantization/autoquant.py", line 267, in _autoquant_test
    res = do_autoquant_bench(q_c_op, act_mat, w_qtensor, bias, warmup=25, rep=100)
  File "/home/new_ao_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/new_ao_env/lib/python3.10/site-packages/torchao/quantization/autoquant.py", line 214, in do_autoquant_bench
    torch.cuda.synchronize()
  File "/home/new_ao_env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 948, in synchronize
    _lazy_init()
  File "/home/new_ao_env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 306, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

It seems there are several CUDA calls in autoquant.py. To provide more context, I want to try autoquant with torch.compile OpenVINO backend, which is a non-CUDA backend. Please let me know if there are any more suggestions.

msaroufim commented 2 months ago

Ah interesting indeed the script is CUDA only right now but perhaps we should fix that - is this something you're interested in contributing? Otherwise we were planning on taking at expanding our hardware support matrix after PyTorch conference - cc @supriyar

supriyar commented 2 months ago

Hi @ynimmaga, nice to see you here - I believe we met last year at the PTC poster sessions and after to discuss how to use PyTorch quantization with OpenViNO. A lot has happened since and its great to see more native integration of the openvino backend with torch.compile.

Like @msaroufim mentioned, the autoquant tool is currently built for GPU backends and has a lot of assumptions in the code assuming cuda is available. While we are planning to expand HW support to add some CPU kernels to torchao, we don't have plans for now to extend autoquant specifically for CPU. Are you specifically interested in supporting the OpenVINO backend via autoquant or general support for CPU backend?

ynimmaga commented 2 months ago

Thank you @supriyar. Nice to see you too :). We took your suggestion from last year and already supported PT2 quantization flow natively with OpenVINO (link). This flow works well, but saw that autoquant API provides even better user experience (recommended by @agunapal). We are interested in supporting OpenVINO backend specifically.

supriyar commented 2 months ago

@ynimmaga what kind of use cases do you have in mind? We won't have the bandwidth to support OpenVINO specifically but if that's something your team would like to contribute to or work on let me know! cc @HDCharles

ynimmaga commented 2 months ago

Thanks @supriyar. If there is general CPU support, we can easily support other torch.compile backends like OpenVINO. I will check internally and let you know if we can find bandwidth to work on this.