Unable to save llama2 after SmoothQuant

Hi all,

I'm attempting to follow the SmoothQuant tutorial for the LLAMA2-7b model: [https://github.com/intel/neural-compressor/tree/master/examples/onnxrt/nlp/huggingface_model/text_generation/llama/quantization/ptq_static]

System configuration: OS : WINDOWS 11 Python: Python 3.10.11

My steps:

CREATE PROJECT FOLDERr: neural-compressor-tutorial
CREATE VIRTUAL ENV: python -m venv neural-compressor-env
DOWNLOAD:d the folder of the guide
RUN: pip install neural-compressor and SKIP_RUNTIME=True pip install -r requirements.txt (successful))
RUN: python prepare_model.py --input_model="meta-llama/Llama-2-7b-chat-hf" --output_model="./llama-2-7b-chat-hf" (successful)
RUN WITH GIT BASH TERMINAL: bash run_quant.sh --input_model=C:/Users/Dario/Downloads/INTEL/neural-compressor-tutorial/llama-2-7b-chat-hf --output_model=C:/Users/Dario/Downloads/INTEL/neural-compressor-tutorial/output_model

TERMINAL LOG - ERROR:

2024-02-02 11:28:30.1017397 [E:onnxruntime:, inference_session.cc:1935 onnxruntime::InferenceSession::Initialize::<lambda_5a23845ba810e30de3b9e7b450415bf5>::operator ()] Exception during initialization: bad allocation 2024-02-02 11:28:30 [ERROR] Unexpected exception RuntimeException('[ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: bad allocation') happened during tuning. Traceback (most recent call last): File "C:\Users\Dario\Downloads\INTEL\neural-compressor-tutorial\neural-compressor-env\lib\site-packages\neural_compressor\quantization.py", line 234, in fit strategy.traverse() File "C:\Users\Dario\Downloads\INTEL\neural-compressor-tutorial\neural-compressor-env\lib\site-packages\neural_compressor\strategy\auto.py", line 140, in traverse super().traverse() File "C:\Users\Dario\Downloads\INTEL\neural-compressor-tutorial\neural-compressor-env\lib\site-packages\neural_compressor\strategy\strategy.py", line 483, in traverse self._setup_pre_tuning_algo_scheduler() File "C:\Users\Dario\Downloads\INTEL\neural-compressor-tutorial\neural-compressor-env\lib\site-packages\neural_compressor\strategy\strategy.py", line 361, in _setup_pre_tuning_algo_scheduler self.model = self._pre_tuning_algo_scheduler("pre_quantization") File "C:\Users\Dario\Downloads\INTEL\neural-compressor-tutorial\neural-compressor-env\lib\site-packages\neural_compressor\algorithm\algorithm.py", line 127, in __call__ self._q_model = algo(self._origin_model, self._q_model, self._adaptor, self._dataloader, self._calib_iter) File "C:\Users\Dario\Downloads\INTEL\neural-compressor-tutorial\neural-compressor-env\lib\site-packages\neural_compressor\algorithm\smooth_quant.py", line 89, in __call__ q_model = adaptor.smooth_quant( File "C:\Users\Dario\Downloads\INTEL\neural-compressor-tutorial\neural-compressor-env\lib\site-packages\neural_compressor\adaptor\onnxrt.py", line 228, in smooth_quant self.smooth_quant_model = self.sq.transform(**self.cur_sq_args) File "C:\Users\Dario\Downloads\INTEL\neural-compressor-tutorial\neural-compressor-env\lib\site-packages\neural_compressor\adaptor\ox_utils\smooth_quant.py", line 183, in transform self._dump_op_info(percentile, op_types, calib_iter, quantize_config) File "C:\Users\Dario\Downloads\INTEL\neural-compressor-tutorial\neural-compressor-env\lib\site-packages\neural_compressor\adaptor\ox_utils\smooth_quant.py", line 395, in _dump_op_info self.max_vals_per_channel, self.shape_info, self.tensors_to_node = augment.calib_smooth( File "C:\Users\Dario\Downloads\INTEL\neural-compressor-tutorial\neural-compressor-env\lib\site-packages\neural_compressor\adaptor\ox_utils\calibration.py", line 774, in calib_smooth _, output_dicts = self.get_intermediate_outputs() File "C:\Users\Dario\Downloads\INTEL\neural-compressor-tutorial\neural-compressor-env\lib\site-packages\neural_compressor\adaptor\ox_utils\calibration.py", line 254, in get_intermediate_outputs else onnxruntime.InferenceSession(self.model_wrapper.model_path + "_augment.onnx", so, providers=[backend]) File "C:\Users\Dario\Downloads\INTEL\neural-compressor-tutorial\neural-compressor-env\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 419, in __init__ self._create_inference_session(providers, provider_options, disabled_optimizers) File "C:\Users\Dario\Downloads\INTEL\neural-compressor-tutorial\neural-compressor-env\lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 483, in _create_inference_session sess.initialize_session(providers, provider_options, disabled_optimizers) onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: bad allocation 2024-02-02 11:28:36 [ERROR] Specified timeout or max trials is reached! Not found any quantized model which meet accuracy goal. Exit. model: decoder_model.onnx args.output_model: C:/Users/Dario/Downloads/INTEL/neural-compressor-tutorial/output_model Traceback (most recent call last): File "C:\Users\Dario\Downloads\INTEL\neural-compressor-tutorial\main.py", line 336, in <module> q_model.save(os.path.join(args.output_model, model)) AttributeError: 'NoneType' object has no attribute 'save'

What could be the solution? Did I miss any crucial steps during the installation or while executing the commands listed above?

Thank you for any suggestions.

intel / neural-compressor

Unable to save llama2 after SmoothQuant #1600