Qwen-7b int8 inference fails when using run_quantization.py script

MadhumithaSrini commented 4 weeks ago

Describe the issue

Tried running https://github.com/intel/intel-extension-for-pytorch/blob/release/2.3/examples/cpu/inference/python/llm/run.py to generate the q_config_summary file
Then, I tried inferring with ipex 2.3.0, torch 2.3.0, transformers==4.38.2 and it fails with the following error: return forward_call(*args, **kwargs) RuntimeError: The following operation failed in the TorchScript interpreter. Traceback of TorchScript (most recent call last): RuntimeError: cannot return dims when ndims < 0

Doubt: I generated the q_config_summary_file without enabling bf16, but while inferring I am enabling it with "--quant-with-amp" flag. I tested couple of other models like gpt-j-6b, chatglm3, llama-2-7b-chat-hf and all of this pass while qwen fails at inference. So, will using the flag "quant-with-amp" in inference and not while generating the config file matter?

Vasud-ha commented 4 weeks ago

Hi @MadhumithaSrini, thanks for reporting it we will check at our end and return to you.

MadhumithaSrini commented 4 weeks ago

Thank you

Vasud-ha commented 3 weeks ago

Hi @MadhumithaSrini, could you please share the commands you used for step 1 and step 2?

MadhumithaSrini commented 3 weeks ago

Hi sure, the command is: OMP_NUM_THREADS= numactl -m 0 python run.py --benchmark -m Qwen/Qwen-7B --output-dir saved_results_qwen --ipex-smooth-quant --quant-with-amp --qconfig-summary-file ./saved_results_qwen/best_configure.json --max-new-tokens 1 --input-tokens 512 --num-iter 6 --num-warmup 3 --batch-size 1 --profile

intel / intel-extension-for-pytorch

Qwen-7b int8 inference fails when using run_quantization.py script #648

Describe the issue