Quantize: CLI command to quantize input model

shaahji commented 2 months ago

Quantize: CLI command to quantize input model

Usage:

  olive quantize                      \
    --model_name_or_path <model-name> \
    --trust_remote_code               \
    --device <cpu|gpu|npu>            \
    --algorithms <awq,gptq>           \
    --data_name <data-name>           \
    --subset <subset-name>            \
    --split <split-name>              \
    --batch_size <batch-size>         \
    --output_path <output-dir>

Checklist before requesting a review

[x] Add unit tests for this change.
[x] Make sure all tests can pass.
[ ] Update documents if necessary.
[x] Lint and apply fixes to your code by running lintrunner -a
[ ] Is this a user-facing change? If yes, give a description of this change to be included in the release notes.
[ ] Is this PR including examples changes? If yes, please remember to update example documentation in a follow-up PR.

(Optional) Issue link

xiaoyu-work commented 2 months ago

Can you also add a unit test for this?

samuel100 commented 2 months ago

Some feedback....

whack-a-mole package install. I had to install auto-awq, which was not easy as the package name is different to the module (i.e. awq module not found error leads a user to try pip install awq which is not correct.)
It is not really clear what is expected in the data_config YAML/JSON. In general data_configs are hard - what data does a user need to use, how would different datasets impact the results? what do they need to put into data_config? If the dataset impacts results we really need amazing documentation and guidance on what a user needs to provide (including how to generate [synthetic] data). If the data does not really impact the results can we get rid of the option and provide either dummy data or a static option (e.g. wikitext)?
The help file should give some more information on the different algorithms. For example, it would be good to know that AWQ will output a 4bit model.
The --providers_list assumes the user knows about ORT e.g. CPUExecutionProvider. The help information should enumerate the different options for the user.
How would a user evaluate the results for speed up/memory utilization/quality. Taking a step back, the motivation for quantization is to lower footprint and speed up execution without sacrificing efficacy for the task. The CLI command allows a user to try different algorithms (good) but it needs some evaluation information so that the user can make a decision on the "best" method.

A E2E here would be to run Quantization -> Capture the ONNX Graph for using in ORT. I therefore tried to take the output of quantization and run through capture-onnx-graph. Below is what happened:

Firstly, I tried to use Dynamo exporter option:

olive capture-onnx-graph \
    --model_name_or_path models/qwen-awq/awq/cpu-cpu_model/model \
    --use_dynamo_exporter True \
    --use_ort_genai True \
    --output_path models/qwen-awq/captured \
    --device cpu \
    --log_level 1

This hit the following error:

[2024-09-17 09:35:36,294] [INFO] [engine.py:874:_run_pass] Running pass c:OnnxConversion
You have loaded an AWQ model on CPU and have a CUDA device available, make sure to set your model on a GPU device in order to run your model.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
[2024-09-17 09:35:40,494] [ERROR] [engine.py:972:_run_pass] Pass run failed.
Traceback (most recent call last):
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/olive/engine/engine.py", line 960, in _run_pass
    output_model_config = host.run_pass(p, input_model_config, output_model_path, pass_search_point)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/olive/systems/local.py", line 30, in run_pass
    output_model = the_pass.run(model, output_model_path, point)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/olive/passes/olive_pass.py", line 206, in run
    output_model = self._run_for_config(model, config, output_model_path)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/olive/passes/onnx/conversion.py", line 116, in _run_for_config
    output_model = self._run_for_config_internal(model, config, output_model_path)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/olive/passes/onnx/conversion.py", line 149, in _run_for_config_internal
    return self._convert_model_on_device(model, config, output_model_path, device, torch_dtype)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/olive/passes/onnx/conversion.py", line 367, in _convert_model_on_device
    converted_onnx_model = OnnxConversion._export_pytorch_model(
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/olive/passes/onnx/conversion.py", line 205, in _export_pytorch_model
    pytorch_model(*dummy_inputs)
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1104, in forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 915, in forward
    layer_outputs = decoder_layer(
                    ^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 655, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
                                                          ^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 542, in forward
    query_states = self.q_proj(hidden_states)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/awq/modules/linear/gemm.py", line 243, in forward
    out = WQLinearMMFunction.apply(
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/autograd/function.py", line 598, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/awq/modules/linear/gemm.py", line 47, in forward
    out = awq_ext.gemm_forward_cuda(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: t == DeviceType::CUDA INTERNAL ASSERT FAILED at "/usr/share/miniconda3/envs/build/lib/python3.11/site-packages/torch/include/c10/cuda/impl/CUDAGuardImpl.h":28, please report a bug to PyTorch.

So, next I tried Model Builder option. This worked... BUT it is not clear what really happened - I had to set the --precision int4 option... does that re-quantize the model using RTN? The model output from MB was 1.3GB, which compared to 0.6MB for the safetensor file after the AWQ quantization.

shaahji commented 2 months ago

whack-a-mole package install. I had to install auto-awq, which was not easy as the package name is different to the module (i.e. awq module not found error leads a user to try pip install awq which is not correct.)

Ironically, I ran into the same problem initially. I had some thoughts about addressing this at a larger level. Olive already knows the dependencies for each Pass. They are defined in olive_config.json. We could potentially iterate and verify that those dependencies are present even before we start running the passes. That avoids long wait before a workflow fails after running a few expensive passes. Thoughts?

It is not really clear what is expected in the data_config YAML/JSON.

The data_config requirement is removed in a follow up commit. The arguments to the command line are inline with the finetune commend i.e. data_name, train_subset, eval_subset, etc.

The help file should give some more information on the different algorithms. For example, it would be good to know that AWQ will output a 4bit model.

There can never be enough information in help. :) One thing could be argued to be more important the other. I propose to provide a link to specific algorithm's documentation.

The --providers_list assumes the user knows about ORT e.g. CPUExecutionProvider. The help information should enumerate the different options for the user.

I will add the available options in the choices list.

How would a user evaluate the results for speed up/memory utilization/quality. Taking a step back, the motivation for quantization is to lower footprint and speed up execution without sacrificing efficacy for the task. The CLI command allows a user to try different algorithms (good) but it needs some evaluation information so that the user can make a decision on the "best" method.

As I understand the intent, CLI commands are meant to be "do one job only". For evaluation, we might introduce a separate cli command that user can chain along with this.

shaahji commented 2 months ago

All comments/inputs addressed.

microsoft / Olive