## 🏃Installation
### Quick Install from Pypi
```bash
pip install intel-extension-for-transformers
```
> For system requirements and other installation tips, please refer to [Installation Guide](./docs/installation.md)
## 🌟Introduction
Intel® Extension for Transformers is an innovative toolkit designed to accelerate GenAI/LLM everywhere with the optimal performance of Transformer-based models on various Intel platforms, including Intel Gaudi2, Intel CPU, and Intel GPU. The toolkit provides the below key features and examples:
* Seamless user experience of model compressions on Transformer-based models by extending [Hugging Face transformers](https://github.com/huggingface/transformers) APIs and leveraging [Intel® Neural Compressor](https://github.com/intel/neural-compressor)
* Advanced software optimizations and unique compression-aware runtime (released with NeurIPS 2022's paper [Fast Distilbert on CPUs](https://arxiv.org/abs/2211.07715) and [QuaLA-MiniLM: a Quantized Length Adaptive MiniLM](https://arxiv.org/abs/2210.17114), and NeurIPS 2021's paper [Prune Once for All: Sparse Pre-Trained Language Models](https://arxiv.org/abs/2111.05754))
* Optimized Transformer-based model packages such as [Stable Diffusion](examples/huggingface/pytorch/text-to-image/deployment/stable_diffusion), [GPT-J-6B](examples/huggingface/pytorch/text-generation/deployment), [GPT-NEOX](examples/huggingface/pytorch/language-modeling/quantization#2-validated-model-list), [BLOOM-176B](examples/huggingface/pytorch/language-modeling/inference#BLOOM-176B), [T5](examples/huggingface/pytorch/summarization/quantization#2-validated-model-list), [Flan-T5](examples/huggingface/pytorch/summarization/quantization#2-validated-model-list), and end-to-end workflows such as [SetFit-based text classification](docs/tutorials/pytorch/text-classification/SetFit_model_compression_AGNews.ipynb) and [document level sentiment analysis (DLSA)](workflows/dlsa)
* [NeuralChat](intel_extension_for_transformers/neural_chat), a customizable chatbot framework to create your own chatbot within minutes by leveraging a rich set of [plugins](https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/neural_chat/docs/advanced_features.md) such as [Knowledge Retrieval](./intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/README.md), [Speech Interaction](./intel_extension_for_transformers/neural_chat/pipeline/plugins/audio/README.md), [Query Caching](./intel_extension_for_transformers/neural_chat/pipeline/plugins/caching/README.md), and [Security Guardrail](./intel_extension_for_transformers/neural_chat/pipeline/plugins/security/README.md). This framework supports Intel Gaudi2/CPU/GPU.
* [Inference](https://github.com/intel/neural-speed/tree/main) of Large Language Model (LLM) in pure C/C++ with weight-only quantization kernels for Intel CPU and Intel GPU (TBD), supporting [GPT-NEOX](https://github.com/intel/neural-speed/tree/main/neural_speed/models/gptneox), [LLAMA](https://github.com/intel/neural-speed/tree/main/neural_speed/models/llama), [MPT](https://github.com/intel/neural-speed/tree/main/neural_speed/models/mpt), [FALCON](https://github.com/intel/neural-speed/tree/main/neural_speed/models/falcon), [BLOOM-7B](https://github.com/intel/neural-speed/tree/main/neural_speed/models/bloom), [OPT](https://github.com/intel/neural-speed/tree/main/neural_speed/models/opt), [ChatGLM2-6B](https://github.com/intel/neural-speed/tree/main/neural_speed/models/chatglm), [GPT-J-6B](https://github.com/intel/neural-speed/tree/main/neural_speed/models/gptj), and [Dolly-v2-3B](https://github.com/intel/neural-speed/tree/main/neural_speed/models/gptneox). Support AMX, VNNI, AVX512F and AVX2 instruction set. We've boosted the performance of Intel CPUs, with a particular focus on the 4th generation Intel Xeon Scalable processor, codenamed [Sapphire Rapids](https://www.intel.com/content/www/us/en/products/docs/processors/xeon-accelerated/4th-gen-xeon-scalable-processors.html).
## 🔓Validated Hardware
Hardware |
Fine-Tuning |
Inference |
Full |
PEFT |
8-bit |
4-bit |
Intel Gaudi2 |
✔ |
✔ |
WIP (FP8) |
- |
Intel Xeon Scalable Processors |
✔ |
✔ |
✔ (INT8, FP8) |
✔ (INT4, FP4, NF4) |
Intel Xeon CPU Max Series |
✔ |
✔ |
✔ (INT8, FP8) |
✔ (INT4, FP4, NF4) |
Intel Data Center GPU Max Series |
WIP |
WIP |
WIP (INT8) |
✔ (INT4) |
Intel Arc A-Series |
- |
- |
WIP (INT8) |
✔ (INT4) |
Intel Core Processors |
- |
✔ |
✔ (INT8, FP8) |
✔ (INT4, FP4, NF4) |
> In the table above, "-" means not applicable or not started yet.
## 🔓Validated Software
Software |
Fine-Tuning |
Inference |
Full |
PEFT |
8-bit |
4-bit |
PyTorch |
2.0.1+cpu, 2.0.1a0 (gpu) |
2.0.1+cpu, 2.0.1a0 (gpu) |
2.1.0+cpu, 2.0.1a0 (gpu) |
2.1.0+cpu, 2.0.1a0 (gpu) |
Intel® Extension for PyTorch |
2.1.0+cpu, 2.0.110+xpu |
2.1.0+cpu, 2.0.110+xpu |
2.1.0+cpu, 2.0.110+xpu |
2.1.0+cpu, 2.0.110+xpu |
Transformers |
4.35.2(CPU), 4.31.0 (Intel GPU) |
4.35.2(CPU), 4.31.0 (Intel GPU) |
4.35.2(CPU), 4.31.0 (Intel GPU) |
4.35.2(CPU), 4.31.0 (Intel GPU) |
Synapse AI |
1.13.0 |
1.13.0 |
1.13.0 |
1.13.0 |
Gaudi2 driver |
1.13.0-ee32e42 |
1.13.0-ee32e42 |
1.13.0-ee32e42 |
1.13.0-ee32e42 |
intel-level-zero-gpu |
1.3.26918.50-736~22.04 |
1.3.26918.50-736~22.04 |
1.3.26918.50-736~22.04 |
1.3.26918.50-736~22.04 |
> Please refer to the detailed requirements in [CPU](intel_extension_for_transformers/neural_chat/requirements_cpu.txt), [Gaudi2](intel_extension_for_transformers/neural_chat/requirements_hpu.txt), [Intel GPU](intel_extension_for_transformers/neural_chat/requirements_xpu.txt).
## 🔓Validated OS
Ubuntu 20.04/22.04, Centos 8.
## 🌱Getting Started
### Chatbot
Below is the sample code to create your chatbot. See more [examples](intel_extension_for_transformers/neural_chat/docs/full_notebooks.md).
#### Serving (OpenAI-compatible RESTful APIs)
NeuralChat provides OpenAI-compatible RESTful APIs for chat, so you can use NeuralChat as a drop-in replacement for OpenAI APIs.
You can start NeuralChat server either using the Shell command or Python code.
```shell
# Shell Command
neuralchat_server start --config_file ./server/config/neuralchat.yaml
```
```python
# Python Code
from intel_extension_for_transformers.neural_chat import NeuralChatServerExecutor
server_executor = NeuralChatServerExecutor()
server_executor(config_file="./server/config/neuralchat.yaml", log_file="./neuralchat.log")
```
NeuralChat service can be accessible through [OpenAI client library](https://github.com/openai/openai-python), `curl` commands, and `requests` library. See more in [NeuralChat](intel_extension_for_transformers/neural_chat/README.md).
#### Offline
```python
from intel_extension_for_transformers.neural_chat import build_chatbot
chatbot = build_chatbot()
response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.")
```
### Transformers-based extension APIs
Below is the sample code to use the extended Transformers APIs. See more [examples](https://github.com/intel/neural-speed/tree/main).
#### INT4 Inference (CPU)
We encourage you to install [NeuralSpeed](https://github.com/intel/neural-speed) to get the latest features (e.g., GGUF support) of LLM low-bit inference on CPUs. You may also want to use v1.3 without NeuralSpeed by following the [document](https://github.com/intel/intel-extension-for-transformers/tree/v1.3/intel_extension_for_transformers/llm/runtime/graph/README.md)
```python
from transformers import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v3-1"
prompt = "Once upon a time, there existed a little girl,"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs)
```
You can also load GGUF format model from Huggingface, we only support Q4_0/Q5_0/Q8_0 gguf format for now.
```python
from transformers import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
# Specify the GGUF repo on the Hugginface
model_name = "TheBloke/Llama-2-7B-Chat-GGUF"
# Download the the specific gguf model file from the above repo
gguf_file = "llama-2-7b-chat.Q4_0.gguf"
# make sure you are granted to access this model on the Huggingface.
tokenizer_name = "meta-llama/Llama-2-7b-chat-hf"
prompt = "Once upon a time, there existed a little girl,"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
model = AutoModelForCausalLM.from_pretrained(model_name, gguf_file = gguf_file)
outputs = model.generate(inputs)
```
You can also load PyTorch Model from Modelscope
>**Note**:require modelscope
```python
from transformers import TextStreamer
from modelscope import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "qwen/Qwen-7B" # Modelscope model_id or local model
prompt = "Once upon a time, there existed a little girl,"
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True, model_hub="modelscope")
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)
```
You can also load the low-bit model quantized by GPTQ/AWQ/RTN/AutoRound algorithm.
```python
from transformers import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, GPTQConfig
# Hugging Face GPTQ/AWQ model or use local quantize model
model_name = "MODEL_NAME_OR_PATH"
prompt = "Once upon a time, a little girl"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
outputs = model.generate(inputs)
```
#### INT4 Inference (GPU)
```python
import intel_extension_for_pytorch as ipex
from intel_extension_for_transformers.transformers.modeling import AutoModelForCausalLM
from transformers import AutoTokenizer
import torch
device_map = "xpu"
model_name ="Qwen/Qwen-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
prompt = "Once upon a time, there existed a little girl,"
inputs = tokenizer(prompt, return_tensors="pt").input_ids.to(device_map)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True,
device_map=device_map, load_in_4bit=True)
model = ipex.optimize_transformers(model, inplace=True, dtype=torch.float16, quantization_config=True, device=device_map)
output = model.generate(inputs)
```
> Note: Please refer to the [example](https://github.com/intel/intel-extension-for-transformers/blob/main/docs/weightonlyquant.md#examples-for-gpu) and [script](https://github.com/intel/intel-extension-for-transformers/blob/main/examples/huggingface/pytorch/text-generation/quantization/run_generation_gpu_woq.py) for more details.
### Langchain-based extension APIs
Below is the sample code to use the extended Langchain APIs. See more [examples](intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/README.md).
```python
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from langchain.chains import RetrievalQA
from langchain_core.vectorstores import VectorStoreRetriever
from intel_extension_for_transformers.langchain.vectorstores import Chroma
retriever = VectorStoreRetriever(vectorstore=Chroma(...))
retrievalQA = RetrievalQA.from_llm(llm=HuggingFacePipeline(...), retriever=retriever)
```
## 🎯Validated Models
You can access the validated models, accuracy and performance from [Release data](./docs/release_data.md) or [Medium blog](https://medium.com/@NeuralCompressor/llm-performance-of-intel-extension-for-transformers-f7d061556176).
## 📖Documentation
## 🙌Demo
* LLM Infinite Inference (up to 4M tokens)
https://github.com/intel/intel-extension-for-transformers/assets/109187816/1698dcda-c9ec-4f44-b159-f4e9d67ab15b
* LLM QLoRA on Client CPU
https://github.com/intel/intel-extension-for-transformers/assets/88082706/9d9bdb7e-65db-47bb-bbed-d23b151e8b31
## 📃Selected Publications/Events
* Blog published on Huggingface: [Building Cost-Efficient Enterprise RAG applications with Intel Gaudi 2 and Intel Xeon](https://huggingface.co/blog/cost-efficient-rag-applications-with-intel) (May 2024)
* Blog published on Intel Developer News: [Efficient Natural Language Embedding Models with Intel® Extension for Transformers](https://www.intel.com/content/www/us/en/developer/articles/technical/efficient-natural-language-embedding-models.html) (May 2024)
* Blog published on Techcrunch: [Intel and others commit to building open generative AI tools for the enterprise](https://techcrunch.com/2024/04/16/intel-and-others-commit-to-building-open-generative-ai-tools-for-the-enterprise) (Apr 2024)
* Video on YouTube: [Intel Vision Keynotes 2024](https://www.youtube.com/watch?v=QB7FoIpx8os&t=2280s) (Apr 2024)
* Blog published on Vectara: [Do Smaller Models Hallucinate More?](https://vectara.com/blog/do-smaller-models-hallucinate-more) (Apr 2024)
* Blog of Intel Developer News: [Use the neural-chat-7b Model for Advanced Fraud Detection: An AI-Driven Approach in Cybersecurity](https://www.intel.com/content/www/us/en/developer/articles/technical/bilics-approach-cybersecurity-using-neuralchat-7b.html) (March 2024)
* CES 2024: [CES 2024 Great Minds Keynote: Bringing the Limitless Potential of AI Everywhere: Intel Hybrid Copilot demo](https://youtu.be/70J3uO3eLZA?t=1348) (Jan 2024)
* Blog published on Medium: [Connect an AI agent with your API: Intel Neural-Chat 7b LLM can replace Open AI Function Calling](https://medium.com/11tensors/connect-an-ai-agent-with-your-api-intel-neural-chat-7b-llm-can-replace-open-ai-function-calling-242d771e7c79) (Dec 2023)
* NeurIPS'2023 on Efficient Natural Language and Speech Processing: [Efficient LLM Inference on CPUs](https://arxiv.org/abs/2311.00502) (Nov 2023)
* Blog published on Hugging Face: [Intel Neural-Chat 7b: Fine-Tuning on Gaudi2 for Top LLM Performance](https://huggingface.co/blog/Andyrasika/neural-chat-intel) (Nov 2023)
* Blog published on VMware: [AI without GPUs: A Technical Brief for VMware Private AI with Intel](https://core.vmware.com/resource/ai-without-gpus-technical-brief-vmware-private-ai-intel#section6) (Nov 2023)
> View [Full Publication List](./docs/publication.md)
## Additional Content
* [Release Information](./docs/release.md)
* [Contribution Guidelines](./docs/contributions.md)
* [Legal Information](./docs/legal.md)
* [Security Policy](SECURITY.md)
* [Apache License](./LICENSE)
## Acknowledgements
* Excellent open-source projects: [bitsandbytes](https://github.com/TimDettmers/bitsandbytes), [FastChat](https://github.com/lm-sys/FastChat), [fastRAG](https://github.com/IntelLabs/fastRAG), [ggml](https://github.com/ggerganov/ggml), [gptq](https://github.com/IST-DASLab/gptq), [llama.cpp](https://github.com/ggerganov/llama.cpp), [lm-evauation-harness](https://github.com/EleutherAI/lm-evaluation-harness), [peft](https://github.com/huggingface/peft), [trl](https://github.com/huggingface/trl), [streamingllm](https://github.com/mit-han-lab/streaming-llm) and many others.
* Thanks to all the [contributors](./docs/contributors.md).
## 💁Collaborations
Welcome to raise any interesting ideas on model compression techniques and LLM-based chatbot development! Feel free to reach [us](mailto:itrex.maintainers@intel.com), and we look forward to our collaborations on Intel Extension for Transformers!