valer1435 / RepoPilot

Open-source AI-powered tool for smart repository maintainance
2 stars 0 forks source link

How to install VLLM with CPU? #33

Open valer1435 opened 1 week ago

valer1435 commented 1 week ago

How to install VLLM with CPU? @RepoPilotAssistant

RepoPilotAssistant commented 1 week ago

To install VLLM with CPU, you can follow these steps:

  1. Quick Start Using Dockerfile:

    • Build the Docker image:
      $ docker build -f Dockerfile.cpu -t vllm-cpu-env --shm-size=4g .
    • Run the Docker container:
      $ docker run -it \
                --rm \
                --network=host \
                --cpuset-cpus= \
                --cpuset-mems= \
                vllm-cpu-env
  2. Build from Source:

    • Install the recommended compiler (e.g., gcc/g++ >= 12.3.0). For example, on Ubuntu 22.4:
      $ sudo apt-get update  -y
      $ sudo apt-get install -y gcc-12 g++-12 libnuma-dev
      $ sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12
    • Install Python packages for building the vLLM CPU backend:
      $ pip install --upgrade pip
      $ pip install cmake>=3.26 wheel packaging ninja "setuptools-scm>=8" numpy
      $ pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
    • Build and install the oneDNN library from source:
      $ git clone -b rls-v3.5 https://github.com/oneapi-src/oneDNN.git
      $ cmake -B ./oneDNN/build -S ./oneDNN -G Ninja -DONEDNN_LIBRARY_TYPE=STATIC \ 
       -DONEDNN_BUILD_DOC=OFF \ 
       -DONEDNN_BUILD_EXAMPLES=OFF \ 
       -DONEDNN_BUILD_TESTS=OFF \ 
       -DONEDNN_BUILD_GRAPH=OFF \ 
       -DONEDNN_ENABLE_WORKLOAD=INFERENCE \ 
       -DONEDNN_ENABLE_PRIMITIVE=MATMUL
      $ cmake --build ./oneDNN/build --target install --config Release
    • Finally, build and install the vLLM CPU backend:
      $ VLLM_TARGET_DEVICE=cpu python setup.py install

These steps will help you install and run vLLM on an x86 CPU platform with support for data types FP32 and BF16.

Please follow links below:

  1. .. _installation_cpu:
  2. .. _installation_cpu:
  3. Related runtime environment variables#
  4. Build from source#
  5. vLLM Engine
valer1435 commented 1 week ago

How Tensor parallelization is implemented in the code? @RepoPilotAssistant

RepoPilotAssistant commented 1 week ago

Tensor parallelization in the code involves sending and receiving tensor dictionaries across different ranks in a distributed environment. It handles both CPU and GPU tensors, using different groups for communication. For CPU tensors, it uses the metadata_group, while for GPU tensors, it uses the group. The implementation also supports a send-allgather mechanism where only a slice of the tensor is sent initially, followed by an allgather operation. Additionally, it includes a barrier synchronization method that uses the CPU group to avoid issues with GPU tensors.

Please follow links below:

  1. vllm\distributed\parallel_state.py
  2. vllm\distributed\parallel_state.py
  3. vllm\distributed\communication_op.py
  4. vllm\distributed\parallel_state.py
  5. vllm\distributed\parallel_state.py
valer1435 commented 1 week ago

What will happen if i set tensor parallel size equals 3? @RepoPilotAssistant