Open bigPYJ1151 opened 8 months ago
Thanks for yout excellent work! Looking forward to supporting inference on ARM CPU. Further, support for ray distributed computing .
could you give me a cpu inference example? I try
//start
python3 -m vllm.entrypoints.openai.api_server \
--device cpu
//input
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "facebook/opt-125m",
"messages": [
{"role": "system", "content": "You are an intelligent British female writer and translator who is good at writing science fiction using multiple languages. You won a Nobel price in literature five years ago."},
{"role": "user", "content": "Please detailedly tell a story about an exciting aerospace expedition for a Chinese boy Lam and his German dog. They are sent to aerospace by mistake and strive to wait for rescue from motherland with no water and food supply for over a month. They are almost caught by aliens disguised as his mother. Moreover, please translate the above story to Chinese, German, French, Portuguese and Japanese respectively."}
], "temperature": 0
}'
But I got error. Are there any engine arguments that need to be added here?
@bigPYJ1151 Are you planning to support AVX/AVX2 to enable a broader range of Intel/x86 CPUs?
Hi @mgiessing, it is not in our plan right now, but we may add it after the basic features finished.
Could you help with https://github.com/vllm-project/vllm/pull/4415?
I was trying to compile cool with intel compiler but I had some issues and I think I almost have it working.
Hi @bigPYJ1151,
I'd like to ask why the initial CPU support defines device specific vector types in https://github.com/vllm-project/vllm/blob/main/csrc/cpu/cpu_types_x86.hpp?
PyTorch contains a vector type Vectorized that appears to serve the same purpose, while also being architecture agnostic. Could the custom ops for CPU switch to using this PyTorch type to make the CPU backend architecture agnostic? (i.e. PowerPC, AArch64, etc.)
Hi @hmellor
Yes, Pytorch contains such vector structures, and it is feasible to use them in the CPU backend. I didn't aware them before so defined the custom types🤣. vLLM is adapting torch compile and some custom ops will be generated by JIT, so the number of custom types will be very limited after we clean them. Then we can try to replace them with the Pytorch vectors.
That's great to hear! Is it just #7110 that we're waiting for, or are there other PRs?
Yes, after the #7110 I think we can do some code refactors.
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
@bigPYJ1151 is this something you'd want to work on?
Progress
3634
3824
4113
4971
5452
5446
6008
6125
5492
7257
Features
The CPU executor plans to support the following features:
Design
Our target is seamless porting vLLM to CPU devices and sharing most of vLLM core components (e.g., schedular, cache management, model definitions, Megatron-style model partitioning, ...).
The CPU executor will depend on Pytorch CPU and leverage optimized kernels and features from intel-extension-for-pytorch.
The main changes to vLLM include:
Torch APIs Adaption
CPU device is supported in PyTorch by default. It allows the CPU Executor to share the same model definitions with the GPU Executor. Thanks to recent code refactors, many hardcoded
cuda
device flags have been removed and Torch APIs are dispatched based on the device flag fromDeviceConfig
. For the CPU executor, a newcpu
device flag is added.Sharing the same model definitions and Torch APIs also allows the CPU executor to easily support new models and features in vLLM (e.g.,
torch.compile
).Custom Ops Adaption
vLLM implemented many efficient CUDA kernels and packaged as
_C
library by pybind. These kernels are ported to CPU using C++ and OpenMP, with the same function signatures to replace the CUDA kernels directly. The CPU custom kernel building procedure is integrated into vLLM CMake build system as a CMake module.Currently, all of CPU kernels require
AVX512
ISA support.Python APIs Adaption
New
CPUExecutor
andCPUWorker
are added to initialize the environment and model runner. TheCPUModelRunner
is derived fromModelRunner
of the GPU code path, because most of the code could be shared. Even though it might have potential risks due to changes in the GPU code path,CPUModelRunner
could fix them by rewriting configurations or overloading member functions easily.In special, different from the GPU executor profiling available KV cache memory, the cache memory in the CPU executor is specified by the
swap_space
parameter. Because the memory management of CPU is more complex than GPU (e.g., NUMA).