How to install VLLM with CPU?

valer1435 commented 1 week ago

How to install VLLM with CPU? @RepoPilotAssistant

RepoPilotAssistant commented 1 week ago

To install VLLM with CPU, you can follow these steps:

Quick Start Using Dockerfile:

Build the Docker image:

$ docker build -f Dockerfile.cpu -t vllm-cpu-env --shm-size=4g .

Run the Docker container:

$ docker run -it \
          --rm \
          --network=host \
          --cpuset-cpus= \
          --cpuset-mems= \
          vllm-cpu-env

Build from Source:

Install the recommended compiler (e.g., gcc/g++ >= 12.3.0). For example, on Ubuntu 22.4:

$ sudo apt-get update  -y
$ sudo apt-get install -y gcc-12 g++-12 libnuma-dev
$ sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12

Install Python packages for building the vLLM CPU backend:

$ pip install --upgrade pip
$ pip install cmake>=3.26 wheel packaging ninja "setuptools-scm>=8" numpy
$ pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu

Build and install the oneDNN library from source:

$ git clone -b rls-v3.5 https://github.com/oneapi-src/oneDNN.git
$ cmake -B ./oneDNN/build -S ./oneDNN -G Ninja -DONEDNN_LIBRARY_TYPE=STATIC \ 
 -DONEDNN_BUILD_DOC=OFF \ 
 -DONEDNN_BUILD_EXAMPLES=OFF \ 
 -DONEDNN_BUILD_TESTS=OFF \ 
 -DONEDNN_BUILD_GRAPH=OFF \ 
 -DONEDNN_ENABLE_WORKLOAD=INFERENCE \ 
 -DONEDNN_ENABLE_PRIMITIVE=MATMUL
$ cmake --build ./oneDNN/build --target install --config Release

Finally, build and install the vLLM CPU backend:

$ VLLM_TARGET_DEVICE=cpu python setup.py install

These steps will help you install and run vLLM on an x86 CPU platform with support for data types FP32 and BF16.

Please follow links below:

valer1435 commented 1 week ago

How Tensor parallelization is implemented in the code? @RepoPilotAssistant

RepoPilotAssistant commented 1 week ago

Tensor parallelization in the code involves sending and receiving tensor dictionaries across different ranks in a distributed environment. It handles both CPU and GPU tensors, using different groups for communication. For CPU tensors, it uses the metadata_group, while for GPU tensors, it uses the group. The implementation also supports a send-allgather mechanism where only a slice of the tensor is sent initially, followed by an allgather operation. Additionally, it includes a barrier synchronization method that uses the CPU group to avoid issues with GPU tensors.

Please follow links below:

valer1435 commented 1 week ago

What will happen if i set tensor parallel size equals 3? @RepoPilotAssistant

valer1435 / RepoPilot

How to install VLLM with CPU? #33