[RFC]: OpenVINO vLLM backend

ilya-lavrenov commented 3 weeks ago

Motivation.

OpenVINO is open source solution for inference deep learning models, including LLMs. OpenVINO supports both Intel and ARM CPUs, Intel integrated and discrete GPUs, NPU and has a good reputation as production ready solution for client and server scenarios. The idea is to create OpenVINO backend for vLLM which will initially support x86 CPU as primary device, later other devices can be enabled.

Because of Intel Optimum HuggingFace extension https://github.com/huggingface/optimum-intel, OpenVINO vLLM backend can support wide range of models, including https://docs.vllm.ai/en/stable/models/supported_models.html

OpenVINO provides better performance compared to current vLLM CPU implementation, which will be shown in integration PR. Also, OpenVINO implementation of Paged Attention operation supports modern vLLM features like chunked prefill and prefix caching.

Proposed Change.

Introduce OpenVINO vLLM backend, which:

Loads model via optimum-intel extension for HuggingFace https://github.com/huggingface/optimum-intel
(Optional step) Compresses model weights to low-bit format
Automatically converts PyTorch model to OpenVINO IR representation, which contains PagedAttention operation
Custom implementation of OpenVINO model loader, model runner and cache manager to hide OpenVINO API details.

Feedback Period.

No response

CC List.

@WoosukKwon @zhuohan123 @Yard1

Any Other Things.

OpenVINO has a wide list of customers awaiting OpenVINO vLLM backend integrated to upstream vLLM repository.

robertgshaw2-neuralmagic commented 3 weeks ago

Super exciting initiative!

It there any way you guys can support the existing (safetensors-based) weight formats that we have for the PyTorch-based backends at some point in the future?

This would go a very long way in enabling adoption for both development and production workflows if users did not have to deal with vendor specific model formats. Especially true for quantized weights, which would require users to pass though the OV specific flows otherwise. We already have a huge ecosystem of models for:

Mixed Precision: w4a16 and w8a16 from AutoGPTQ and AutoAWQ (with all the flavors for act_order, grouping, etc)
Activation Quantization: in8 and fp8 from CompressedTensors (with all the flavors for channelwise, input_scales, etc)

slyalin commented 3 weeks ago

This would go a very long way in enabling adoption for both development and production workflows if users did not have to deal with vendor specific model formats.

Explicit preparation of OpenVINO model is not required step in the flow proposed in this RFC (and the corresponding PR). You can still just pass a model directory with original PyTorch weights stored as safetensors. Indeed, internally we will convert these weights in our format, but it looks more as an implementation detail that runs automatically, not something that user needs to constantly bother about. Indeed it now requires some extra space to do the conversion, but we are working on improvements.

Depending on the production environment, it can be beneficial to pre-convert the model with optimum-cli explicitly to skip PyTorch-to-OpenVINO conversion at every vllm startup and optionally apply weights quantization. In this case OpenVINO model explicitly appears as an artifact observed by the user.

Concerning models with quantized weights: we support GPTQ models (to some extent) out-of-the box. So you can just pass GPTQ model and it will use original quantized weights as-is (still converting it to our internal layout, but it is the same int4 content). And our goal is to continue extending this support for other formats, but we are not there where we are supporting everything yet.

If you are talking about physical reuse of original safetensors files from the disk without spending any additional resource for their conversion to OV model, so currently it doesn't work in this way. We can consider this as an interesting option for future development.

robertgshaw2-neuralmagic commented 3 weeks ago

This would go a very long way in enabling adoption for both development and production workflows if users did not have to deal with vendor specific model formats.

Explicit preparation of OpenVINO model is not required step in the flow proposed in this RFC (and the corresponding PR). You can still just pass a model directory with original PyTorch weights stored as safetensors. Indeed, internally we will convert these weights in our format, but it looks more as an implementation detail that runs automatically, not something that user needs to constantly bother about. Indeed it now requires some extra space to do the conversion, but we are working on improvements.

Depending on the production environment, it can be beneficial to pre-convert the model with optimum-cli explicitly to skip PyTorch-to-OpenVINO conversion at every vllm startup and optionally apply weights quantization. In this case OpenVINO model explicitly appears as an artifact observed by the user.

Concerning models with quantized weights: we support GPTQ models (to some extent) out-of-the box. So you can just pass GPTQ model and it will use original quantized weights as-is (still converting it to our internal layout, but it is the same int4 content). And our goal is to continue extending this support for other formats, but we are not there where we are supporting everything yet.

If you are talking about physical reuse of original safetensors files from the disk without spending any additional resource for their conversion to OV model, so currently it doesn't work in this way. We can consider this as an interesting option for future development.

Sounds good - thanks for the explanation and this makes sense 100%! Its been awesome to see how OV has been adopting the standard model formats

We have a good plan in place for support activation quantization in vllm (which includes a weight representation in safetensors). I know how useful this is for CPU prefill performance with AMX and VNNI, so I would be happy to discuss more on it!

vllm-project / vllm