[RFC] Speedup vLLM inference with Intel@ Extension for PyTorch*

liangan1 commented 8 months ago

Motivation

In the current technological landscape, Generative AI (GenAI) workloads and models have gained widespread attention and popularity. Large Language Models (LLMs) have emerged as the dominant models driving these GenAI applications. Most of LLMs are GPT-like architectures that consist of multiple Decoder layers. The MultiHeadAttention and FeedForward layer are two key components of every Decoder layer. The generation task is memory bound because iterative decode and kv_cache require special management to reduce memory overheads. Intel® Extension for PyTorch* provides a lot of specific optimizations for these LLMs. On the operator level, the extension provides highly efficient GEMM kernel to speed up Linear layer and customized operators to reduce the memory footprint. To better trade-off the performance and accuracy, different low-precision solutions e.g., smoothQuant and weight-only-quantization are also enabled. Besides, tensor parallel can also adopt to get lower latency for LLMs and we also enable the shared memory based all-reduce to reduce the latency of all-reduce.

We already integrated Intel@ Extension for PyTorch into huggingface (#RFC #PR17138 ) and user can easily get performance gain by passing "--ipex" parameters in the launcher script. Similarly, we also want to integrate it into vLLM to speedup the LLMs inference on intel platform. ([More information about Intel@ Extension for PyTorch](https://github.com/intel/intel-extension-for-pytorch/tree/main/examples/cpu/inference/python/llm))

LLM Features In Intel Extension for PyTorch*

In this part, we will introduce the LLM related features in Intel Extension for PyTorch* . These operators is highly optimized on Intel platform and can be reused by vLLM.

Linear Operator Optimization

Linear operator is the most obvious hotspot in LLMs inference. There are three backend to speedup linear GEMM kernels in Intel® Extension for PyTorch*. They are oneDNN, Tensor Processing Primitives (TPP), which are used by Fast BERT feature, and customized linear kernels for weight only quantization. All of them use specific block format to utilize hardware resources in a highly efficient way.

Low Precision Data Types

While Generative AI (GenAI) workloads and models are getting more and more popular, LLMs used in these workloads are getting more and more parameters. The increasing size of LLMs enhances workload accuracies; however, it also leads to significantly heavier computations and places higher requirements to the underlying hardware. Given that, quantization becomes a more important methodology for inference workloads.

Quantization with shorter data types benefits from its nature to improve memory IO throughputs and amount of computations on CPU. Moreover, shorter data types make it possible to keep more data in CPU cache, thus reducing memory access occurrences. Comparing to cache access, memory access is much more time costing. Specifically from computation perspective, AVX-512 Vector Neural Network Instructions (VNNI) instruction set shipped with the 2nd Generation Intel® Xeon® Scalable Processors and newer, as well as Intel® Advanced Matrix Extensions (Intel® AMX) instruction set shipped with the 4th Generation Intel® Xeon® Scalable Processors, provide instruction level accelerations to INT8 computations.

Except for the mixed-precision and INT8 native quantization solution, e.g., post-training static quantization and dynamic quantization in Pytorch, SmoothQuant and weight only quantization (both INT8 weight and INT4 weight are supported) are also enabled in Intel® Extension for PyTorch* to get beeter accuracy and performance compared with native solution.

Intel® Extension for PyTorch speeds up INT8 computations by leveraging oneDNN and oneDNN graph as the backend. Intel® Extension for PyTorch static quantization provides a default recipe to automatically decide which operators to quantize. Its backend oneDNN graph brings matrix-multiplication-based fusions for common seen operator patterns and other common fusions like quantization + data type casting. These fusions help achieve best computation cache locality and efficiency, and thus reduce INT8 quantization overhead significantly.

Intel® Extension for PyTorch* also delivers INT4 optimizations via 4-bit weight-only quantization (WOQ). As the name indicates, WOQ quantizes only weights to 4-bit integers to further improve the computation efficiency via saved memory bandwidth utilization. This technique reduces text generation latency especially from the second token. AMX INT8 instructions and fusions are also applied for these performant computations.

Indirect Access KV Cache

kv_cache is used to reduce computation for decoder layer but it also brings memory overheads. For example, when we use beam search, the kv_cache should be reordered according to latest beam idx and the current key/value should also be concat with kv_cache in the attention layer to get entire context to do scale dot product. When the sequence is very long, memory overheads caused by the reorder_cache and concat will be performance bottleneck. Indirect Access KV_cache (IAKV) is provided to reduce these overheads. Firstly, IAKV pre-allocates buffers (key and value use different buffer) to store all key/value hidden states and beam index information, the data format is shown in the following left figure (beam_width=4 in this case) and token state of key (value) in every timestamp will be store in this pre-allocated buffer. Secondly, we can use beam index history which is shown in the following right figure to decide which beam should be used by a timestamp and this information will generate a offset to access the kv_cache buffer which means that the reorder_cache and concat overheads will be eliminated by this way.

Paged Attention

We follow the API of vLLM to enable the paged attention kernel in IPEX and use the layout of (num_blocks, self.block_size, num_heads, head_size) for key/value cache. The details of these two APIs as following:

_reshape_andcache

torch.ops.torch_ipex.reshape_and_cache(key,  value,  key_cache, value_cache, slot_mapping)

_single_query_cached_kvattention

torch.ops.torch_ipex.single_query_cached_kv_attention(
                                                      out, 
                                                      query,
                                                      key_cache, 
                                                      value_cache, 
                                                      head_mapping, 
                                                      scale,  
                                                      block_tables, 
                                                      context_lens, 
                                                      block_size, 
                                                      max_context_len, 
                                                      alibi_slopes
                                                      )

Graph Optimization

Operators fusion is generally used to enable sub-graph fusion to reduce the memory footprint. Except for linear post ops fusion, e.g, linear + activation function, a lot of customized operators are also provided in Intel® Extension for PyTorch* for further performance improvement. For example, Rotary Position Embedding (ROPE) and Root Mean Square Layer Normalization (RMSNorm).

Distributed Inference

All above optimizations already help you to get very good performance with single instance. To furthly reduce the inference latency and improve throughput, tensor parallel is also enabled in our soluction. You can firstly use DeepSpeed to auto shard the model and then apply above optimizations with the frontend API function provided by Intel® Extension for PyTorch.

Performance

This page shows performance boost with Intel® Extension for PyTorch* on several popular topologies.

Performance Data for Intel® AI Data Center Products

Find the latest performance data for 4th gen Intel® Xeon® Scalable processors and 3rd gen Intel® Xeon® processors, including detailed hardware and software configurations, at Intel® Developer Zone article.

We benchmarked LLaMA2 7B, 13B, GPT-J 6B with test input token length set to 256 and 1024 respectively. The tests were carried out on AWS M7i and M6i instances. CPUs of M6i instances are 3rd Gen Intel® Xeon® Processors which do not have AMX instructions for BF16 computing acceleration, so we take FP32 precision for benchmarking instead of BF16 on M6i instances.

The LLM inference performances on M7i and M6i instances are compared based on the above results. M7i, with the 4th Gen Xeon® processors, has a remarkable performance advantage over M6i with the 3rd Gen Xeon® processors.

M7i performance boost ratio over M6i for non-quantized (BF16 or FP32) models:

	Speedup	Throughput
LLaMA2 7B	2.47x	2.62x
LLaMA2 13B	2.57x	2.62x
GPT-J 6B	2.58x	2.85x

M7i performance boost ratio over M6i for INT8 quantized models:

	Speedup	Throughput
LLaMA2 7B	1.27x	1.38x
LLaMA2 13B	1.27x	1.27x
GPT-J 6B	1.29x	1.36x

We can also conclude that with a larger batch size the capacity of the model service can be improved at the cost of longer response latency for the individual sessions. The following table exhibits that for INT8 quantized LLaMA2-7b model on M7i instances, input batch_size=8 would increase the total throughput by 6.47x compared with batch_size=1, whereas P90 token latency gets 1.26x longer.

Batch size	Decoder latency	Total tokens per sec
1	39	26.32
8	49	170.21

*Ratio*	1.26x	6.47x

Note: Measured by Intel on 17th Aug 2023; M7i.16xLarge, M6i.16xLarge instances in US-west-2. OS-Ubuntu 22.04-lts, kernel 6.20.0-1009-aws, SW: PyTorch 2.1 and Intel® Extension for PyTorch 2.1/llm_feature_branch.

gserapio commented 8 months ago

@liangan1 I'm also interested in Intel support. Is there a timeline for integrating Intel's extension for PyTorch (IPEX) into vLLM? Is there currently a temporary workaround to build vLLM from source to support inference on Intel GPUs?

hmellor commented 5 months ago

@liangan1 do you intend to add IPEX support to vLLM now that x86 CPUs are supported by #3634?

jgong5 commented 5 months ago

@liangan1 do you intend to add IPEX support to vLLM now that x86 CPUs are supported by #3634?

Hi @hmellor The work is in progress.

jikunshang commented 5 months ago

@liangan1 I'm also interested in Intel support. Is there a timeline for integrating Intel's extension for PyTorch (IPEX) into vLLM? Is there currently a temporary workaround to build vLLM from source to support inference on Intel GPUs?

Hi @gserapio, for Intel GPU support, you can reference PR #3814 to take a try.

LetianLee commented 4 months ago

@liangan1 do you intend to add IPEX support to vLLM now that x86 CPUs are supported by #3634?

Hi @hmellor The work is in progress.

@jgong5 Has it been completed yet? Could you please let me know where to find the PR? Many thanks.

liangan1 commented 4 months ago

pls refer to this PR to integrate IPEX into vLLM. https://github.com/vllm-project/vllm/pull/4971

vllm-project / vllm