vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
Apache License 2.0
27.69k stars 4.09k forks source link

[Bug]: VLLM usage on AWS Inferentia instances #5738

Open ashutoshsaboo opened 3 months ago

ashutoshsaboo commented 3 months ago

Your current environment

See below for detailed setup and run script that I use. 

🐛 Describe the bug

Hi I'm trying to deploy llama-8b using vllm on aws inferentia (inf2.8xlarge) instances. After lots of hacks/tiring attempts have been able to ensure the vllm server gets spawned up correctly. However when I'm trying to do model inference for say even a "hi" input prompt it gives this error as a warning on console & the llm returns nothing on the gradio ui that i've setup. See thread for code related details. Would appreciate help from someone for a fix for the below! I'm using Skypilot to deploy if in case it matters :

(task, pid=33413) INFO 06-21 09:15:21 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
(task, pid=33413) INFO: - "POST /v1/chat/completions HTTP/1.1" 200 OK
(task, pid=33413) INFO 06-21 09:15:27 async_llm_engine.py:582] Received request cmpl-410ee0fe3db44e05a79d0112fb3ec571: prompt: '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a great ai assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nhi<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.8, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[128009, 128001], include_stop_str_in_output=False, ignore_eos=False, max_tokens=2025, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [128000, 128006, 9125, 128007, 271, 2675, 527, 264, 2294, 16796, 18328, 13, 128009, 128006, 882, 128007, 271, 6151, 128009, 128006, 78191, 128007, 271], lora_request: None.
(task, pid=33413) WARNING 06-21 09:15:27 scheduler.py:683] Input prompt (23 tokens) is too long and exceeds the capacity of block_manager

Here's how I setup the vllm specific things in the instance:

  . /etc/os-release
  sudo tee /etc/apt/sources.list.d/neuron.list > /dev/null <<EOF
  deb https://apt.repos.neuron.amazonaws.com ${VERSION_CODENAME} main
  wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | sudo apt-key add -

  sudo apt-get update -y

  # Install OS headers
  sudo apt-get install linux-headers-$(uname -r) -y

  # Install git
  sudo apt-get install git -y

  # Install Neuron Driver
  sudo apt-get install aws-neuronx-dkms=2.* -y

  # Install Neuron Runtime
  sudo apt-get install aws-neuronx-collectives=2.* -y
  sudo apt-get install aws-neuronx-runtime-lib=2.* -y

  # Install Neuron Tools
  sudo apt-get install aws-neuronx-tools=2.* -y

  # Add PATH
  export PATH=/opt/aws/neuron/bin:$PATH

  # Install Python venv
  sudo apt-get install -y python3.10-venv g++

  # Create Python venv
  python3.10 -m venv aws_neuron_venv_pytorch

  # Activate Python venv
  source aws_neuron_venv_pytorch/bin/activate

  # Install Jupyter notebook kernel
  pip install ipykernel
  python3.10 -m ipykernel install --user --name aws_neuron_venv_pytorch --display-name "Python (torch-neuronx)"
  pip install jupyter notebook
  pip install environment_kernels

  # Set pip repository pointing to the Neuron repository
  python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com

  # Install wget, awscli
  python -m pip install wget
  python -m pip install awscli

  # Update Neuron Compiler and Framework
  python -m pip install --upgrade neuronx-cc==2.* --pre torch-neuronx==2.1.* torchvision transformers-neuronx

  # Install vLLM from source
  git clone https://github.com/vllm-project/vllm.git
  touch ./vllm/model_executor/models/neuron/__init__.py
  cd vllm
  pip install -U -r requirements-neuron.txt
  # Create an empty __init__.py file in the neuron directory
  pip install .

  # Install Gradio for web UI
  pip install gradio openai

And here's how I run the server:

  source aws_neuron_venv_pytorch/bin/activate
  echo 'Starting vllm api server...'
  export LD_LIBRARY_PATH="/opt/conda/lib/:$LD_LIBRARY_PATH"
  export PATH=/opt/aws/neuron/bin:$PATH

  # NOTE: --gpu-memory-utilization 0.95 needed for 4-GPU nodes.
  python -u -m vllm.entrypoints.openai.api_server \
    --port 8081 \
    --model $MODEL_NAME \
    --trust-remote-code \
    --max-num-seqs 1 \
    --device neuron \
    --max-model-len 2048 \
    2>&1 | tee api_server.log &

  while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do
    echo 'Waiting for vllm api server to start...'
    sleep 5

  echo 'Starting gradio server...'
  git clone https://github.com/vllm-project/vllm.git || true
  python vllm/examples/gradio_openai_chatbot_webserver.py \
    -m $MODEL_NAME \
    --port 8811 \
    --model-url http://localhost:8081/v1 \
    --stop-token-ids 128009,128001

Few things that immediately came to my mind was NEURON_RT_VISIBLE_CORES env var, i tried increasing it to more than 0-1 to say 0-3, but the vllm server fails, and doesnt even boot up. This is on inf2.8xlarge instance. Each inf2 accelerator has 8 cores (and 8xlarge has a single inferentia accelerator), so this should ideally be 0-7 isn't it, but even smaller values than it dont work? I tried increasing max-model-len to 4096, but even that doesnt boot up the vllm server & it fails.

(task, pid=34615) performing partition vectorization on AG_2[[0, 1032, 0, 0, 0, 0]]{2 nodes (1 sources, 0 stops)}. dags covered: {dag_1036_TC_SRC, dag_1032}
(task, pid=34615) ..Waiting for vllm api server to start...
(task, pid=34615) root = /opt/conda/lib/python3.10/multiprocessing/process.py
(task, pid=34615) root = /opt/conda/lib/python3.10/multiprocessing
(task, pid=34615) root = /opt/conda/lib/python3.10
(task, pid=34615) root = /opt/conda/lib
(task, pid=34615) root = /opt/conda
(task, pid=34615) root = /opt
(task, pid=34615)
(task, pid=34615) 2024-06-21 09:33:40.000866:  38168  ERROR ||NEURON_CC_WRAPPER||: Failed compilation with ['neuronx-cc', 'compile', '--target=trn1', '--framework=XLA', '/tmp/ubuntu/neuroncc_compile_workdir/270aa309-fc42-41dc-8e08-d69177b6ded8/model.MODULE_39ea0777b538681fb821+2c2d707e.hlo_module.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/270aa309-fc42-41dc-8e08-d69177b6ded8/model.MODULE_39ea0777b538681fb821+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']: 2024-06-21T09:33:40Z [PGT002] Too many instructions after unroll! - Compiling under --optlevel=1 may result in smaller graphs. If you are using a transformer model, try using a smaller context_length_estimate value.
(task, pid=34615)
(task, pid=34615) 2024-06-21 09:33:40.000866:  38168  ERROR ||NEURON_CC_WRAPPER||: Compilation failed for /tmp/ubuntu/neuroncc_compile_workdir/270aa309-fc42-41dc-8e08-d69177b6ded8/model.MODULE_39ea0777b538681fb821+2c2d707e.hlo_module.pb after 0 retries.
(task, pid=34615) 2024-06-21 09:33:40.000867:  38168  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
(task, pid=34615) Waiting for vllm api server to start...
(task, pid=34615) Compiler status PASS
(task, pid=34615) 2024-06-21 09:36:42.000494:  38167  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
(task, pid=34615) concurrent.futures.process._RemoteTraceback:
(task, pid=34615) """
(task, pid=34615) Traceback (most recent call last):
(task, pid=34615)   File "/opt/conda/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
(task, pid=34615)     r = call_item.fn(*call_item.args, **call_item.kwargs)
(task, pid=34615)   File "/home/ubuntu/sky_workdir/aws_neuron_venv_pytorch/lib/python3.10/site-packages/libneuronxla/neuron_cc_wrapper.py", line 163, in call_neuron_compiler
(task, pid=34615)     raise subprocess.CalledProcessError(res.returncode, cmd, stderr=error_info)
(task, pid=34615) subprocess.CalledProcessError: Command '['neuronx-cc', 'compile', '--target=trn1', '--framework=XLA', '/tmp/ubuntu/neuroncc_compile_workdir/270aa309-fc42-41dc-8e08-d69177b6ded8/model.MODULE_39ea0777b538681fb821+2c2d707e.hlo_module.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/270aa309-fc42-41dc-8e08-d69177b6ded8/model.MODULE_39ea0777b538681fb821+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']' returned non-zero exit status 70.

(task, pid=34615) subprocess.CalledProcessError: Command '['neuronx-cc', 'compile', '--target=trn1', '--framework=XLA', '/tmp/ubuntu/neuroncc_compile_workdir/270aa309-fc42-41dc-8e08-d69177b6ded8/model.MODULE_39ea0777b538681fb821+2c2d707e.hlo_module.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/270aa309-fc42-41dc-8e08-d69177b6ded8/model.MODULE_39ea0777b538681fb821+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']' returned non-zero exit status 70.

Increasing --max-num-seqs to >1 also fails in starting the vllm server. Can someone please help on what I could be missing here and how to fix for this error? 🙏 Have tried numerous things, and what not - but sadly most of them fail on vllm's side. 😦

Can someone please help with the above!

youkaichao commented 3 months ago

cc @liangfu

ashutoshsaboo commented 3 months ago

@liangfu would appreciate if you can help with the above issue!

mgoin commented 3 months ago

@aws-patlange could you please look into this?

aws-patlange commented 3 months ago

We currently don't support paged attention in the neuron integration. You need to explicitly set block-size to the max-model-len. See https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/transformers-neuronx-developer-guide-for-continuous-batching.html.

This will likely need some edits here to be able to pass it to one of API entrypoints provided in vllm.

aws-patlange commented 3 months ago

Please try the following after editing the argument parser that is currently restricting --block-size to only some specific values:

 python -u -m vllm.entrypoints.openai.api_server \
    --port 8081 \
    --model $MODEL_NAME \
    --trust-remote-code \
    --max-num-seqs 1 \
    --device neuron \
    --max-model-len 2048 \
    --block-size 2048 \
    2>&1 | tee api_server.log &
minhtcai commented 3 months ago

@aws-patlange Hi, I use your command but getting: TypeError: Can't instantiate abstract class NeuronWorker with abstract method execute_worker Any pointers? Thanks!