intel / neural-compressor

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime
https://intel.github.io/neural-compressor/
Apache License 2.0
2.14k stars 252 forks source link

Model execution is single threaded? #1663

Open akhauriyash opened 5 months ago

akhauriyash commented 5 months ago

Hello,

I am trying to run the following script: https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/quantization/llm

I use the script below:

OMP_NUM_THREADS=32 python run_clm_no_trainer.py     --model facebook/opt-1.3b    
 --quantize     --sq     --alpha 0.5     --ipex     --output_dir "saved_results"     --int8_bf16_mixed                                                              

However, on htop I see that only a single thread is being used. Even if I set torch.set_num_threads(32). It is extremely slow, making smoothquant unusable in my case.

I have a system with Intel® Xeon® Gold 5218 Processor.

Am I missing something? Thanks!

violetch24 commented 5 months ago

Hi @akhauriyash , I was not able to reproduce this issue on several machines yet. Could you please share your enviroment where the issue occurs using pip list?