This project will no longer be maintained by Intel.
Intel has ceased development and contributions including, but not limited to, maintenance, bug fixes, new releases, or updates, to this project.
Intel no longer accepts patches to this project.
Neural Speed is an innovative library designed to support the efficient inference of large language models (LLMs) on Intel platforms through the state-of-the-art (SOTA) low-bit quantization powered by Intel Neural Compressor. The work is inspired by llama.cpp and further optimized for Intel platforms with our innovations in NeurIPS' 2023
Neural Speed is under active development so APIs are subject to change.
Hardware | Supported |
---|---|
Intel Xeon Scalable Processors | ✔ |
Intel Xeon CPU Max Series | ✔ |
Intel Core Processors | ✔ |
Support almost all the LLMs in PyTorch format from Hugging Face such as Llama2, ChatGLM2, Baichuan2, Qwen, Mistral, Whisper, etc. File an issue if your favorite LLM does not work.
Support typical LLMs in GGUF format such as Llama2, Falcon, MPT, Bloom etc. More are coming. Check out the details.
pip install -r requirements.txt
pip install neural-speed
pip install .
Note: GCC requires version 10+
Install Intel Extension for Transformers to use Transformer-like APIs.
from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v3-1" # Hugging Face model_id or local model
prompt = "Once upon a time, there existed a little girl,"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)
from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
# Specify the GGUF repo on the Hugginface
model_name = "TheBloke/Llama-2-7B-Chat-GGUF"
# Download the the specific gguf model file from the above repo
gguf_file = "llama-2-7b-chat.Q4_0.gguf"
# make sure you are granted to access this model on the Huggingface.
tokenizer_name = "meta-llama/Llama-2-7b-chat-hf"
prompt = "Once upon a time"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
model = AutoModelForCausalLM.from_pretrained(model_name, gguf_file = gguf_file)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)
from transformers import TextStreamer
from modelscope import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "qwen/Qwen-7B" # Modelscope model_id or local model
prompt = "Once upon a time, there existed a little girl,"
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True, model_hub="modelscope")
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)
Neural Speed
can be used in Neural Chat Server of Intel Extension for Transformers
. You can choose to enable it by adding use_neural_speed: true
in config.yaml
.
optimization
key section to use Neural Speed
and its RTN quantization (example).
device: "cpu"
optimization: use_neural_speed: true optimization_type: "weight_only" compute_dtype: "fp32" weight_dtype: "int4"
- add key `use_neural_speed` and key `use_gptq` to use `Neural Speed` and load `GPT-Q` model ([example](https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/neural_chat/examples/deployment/codegen/backend/pc/gptq/codegen.yaml)).
```yaml
device: "cpu"
use_neural_speed: true
use_gptq: true
More details please refer to Neural Chat.
python scripts/run.py model-path --weight_dtype int4 -p "She opened the door and see"
# skip the step if GGUF model is from Hugging Face or generated by llama.cpp
python scripts/convert.py --outtype f32 --outfile ne-f32.bin EleutherAI/gpt-j-6b
# Using the quantize script requires a binary installation of Neural Speed
mkdir build&&cd build
cmake ..&&make -j
cd ..
python scripts/quantize.py --model_name gptj --model_file ne-f32.bin --out_file ne-q4_j.bin --build_dir ./build --weight_dtype int4 --alg sym
# Linux and WSL
OMP_NUM_THREADS=<physic_cores> numactl -m 0 -C 0-<physic_cores-1> python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t <physic_cores> --color -p "She opened the door and see"
# Windows
python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t <physic_cores|P-cores> --color -p "She opened the door and see"
Please refer to Advanced Usage for more details.
You can consider adding your own models, please follow the document: graph developer document.
Enable NEURAL_SPEED_VERBOSE
environment variable for performance profiling.
Available modes:
NS_PROFILING
to ON and recompile.NS_PROFILING
to ON and recompile.