Closed shaahji closed 1 month ago
Can you also add a unit test for this?
Some feedback....
awq module not found
error leads a user to try pip install awq
which is not correct.)--providers_list
assumes the user knows about ORT e.g. CPUExecutionProvider
. The help information should enumerate the different options for the user. A E2E here would be to run Quantization -> Capture the ONNX Graph for using in ORT. I therefore tried to take the output of quantization and run through capture-onnx-graph
. Below is what happened:
Firstly, I tried to use Dynamo exporter option:
olive capture-onnx-graph \
--model_name_or_path models/qwen-awq/awq/cpu-cpu_model/model \
--use_dynamo_exporter True \
--use_ort_genai True \
--output_path models/qwen-awq/captured \
--device cpu \
--log_level 1
This hit the following error:
[2024-09-17 09:35:36,294] [INFO] [engine.py:874:_run_pass] Running pass c:OnnxConversion
You have loaded an AWQ model on CPU and have a CUDA device available, make sure to set your model on a GPU device in order to run your model.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
[2024-09-17 09:35:40,494] [ERROR] [engine.py:972:_run_pass] Pass run failed.
Traceback (most recent call last):
File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/olive/engine/engine.py", line 960, in _run_pass
output_model_config = host.run_pass(p, input_model_config, output_model_path, pass_search_point)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/olive/systems/local.py", line 30, in run_pass
output_model = the_pass.run(model, output_model_path, point)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/olive/passes/olive_pass.py", line 206, in run
output_model = self._run_for_config(model, config, output_model_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/olive/passes/onnx/conversion.py", line 116, in _run_for_config
output_model = self._run_for_config_internal(model, config, output_model_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/olive/passes/onnx/conversion.py", line 149, in _run_for_config_internal
return self._convert_model_on_device(model, config, output_model_path, device, torch_dtype)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/olive/passes/onnx/conversion.py", line 367, in _convert_model_on_device
converted_onnx_model = OnnxConversion._export_pytorch_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/olive/passes/onnx/conversion.py", line 205, in _export_pytorch_model
pytorch_model(*dummy_inputs)
File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1104, in forward
outputs = self.model(
^^^^^^^^^^^
File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 915, in forward
layer_outputs = decoder_layer(
^^^^^^^^^^^^^^
File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 655, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 542, in forward
query_states = self.q_proj(hidden_states)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/awq/modules/linear/gemm.py", line 243, in forward
out = WQLinearMMFunction.apply(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/torch/autograd/function.py", line 598, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/quant-cli/lib/python3.11/site-packages/awq/modules/linear/gemm.py", line 47, in forward
out = awq_ext.gemm_forward_cuda(
^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: t == DeviceType::CUDA INTERNAL ASSERT FAILED at "/usr/share/miniconda3/envs/build/lib/python3.11/site-packages/torch/include/c10/cuda/impl/CUDAGuardImpl.h":28, please report a bug to PyTorch.
So, next I tried Model Builder option. This worked... BUT it is not clear what really happened - I had to set the --precision int4
option... does that re-quantize the model using RTN? The model output from MB was 1.3GB, which compared to 0.6MB for the safetensor file after the AWQ quantization.
whack-a-mole package install. I had to install auto-awq, which was not easy as the package name is different to the module (i.e. awq module not found error leads a user to try pip install awq which is not correct.)
Ironically, I ran into the same problem initially. I had some thoughts about addressing this at a larger level. Olive already knows the dependencies for each Pass. They are defined in olive_config.json. We could potentially iterate and verify that those dependencies are present even before we start running the passes. That avoids long wait before a workflow fails after running a few expensive passes. Thoughts?
It is not really clear what is expected in the data_config YAML/JSON.
The data_config requirement is removed in a follow up commit. The arguments to the command line are inline with the finetune commend i.e. data_name, train_subset, eval_subset, etc.
The help file should give some more information on the different algorithms. For example, it would be good to know that AWQ will output a 4bit model.
There can never be enough information in help. :) One thing could be argued to be more important the other. I propose to provide a link to specific algorithm's documentation.
The --providers_list assumes the user knows about ORT e.g. CPUExecutionProvider. The help information should enumerate the different options for the user.
I will add the available options in the choices list.
How would a user evaluate the results for speed up/memory utilization/quality. Taking a step back, the motivation for quantization is to lower footprint and speed up execution without sacrificing efficacy for the task. The CLI command allows a user to try different algorithms (good) but it needs some evaluation information so that the user can make a decision on the "best" method.
As I understand the intent, CLI commands are meant to be "do one job only". For evaluation, we might introduce a separate cli command that user can chain along with this.
All comments/inputs addressed.
Quantize: CLI command to quantize input model
Usage:
Checklist before requesting a review
lintrunner -a
(Optional) Issue link