Open ashutoshsaboo opened 3 months ago
cc @liangfu
@liangfu would appreciate if you can help with the above issue!
@aws-patlange could you please look into this?
We currently don't support paged attention in the neuron integration. You need to explicitly set block-size
to the max-model-len
. See https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/transformers-neuronx-developer-guide-for-continuous-batching.html.
This will likely need some edits here to be able to pass it to one of API entrypoints provided in vllm.
Please try the following after editing the argument parser that is currently restricting --block-size
to only some specific values:
python -u -m vllm.entrypoints.openai.api_server \
--port 8081 \
--model $MODEL_NAME \
--trust-remote-code \
--max-num-seqs 1 \
--device neuron \
--max-model-len 2048 \
--block-size 2048 \
2>&1 | tee api_server.log &
@aws-patlange Hi, I use your command but getting: TypeError: Can't instantiate abstract class NeuronWorker with abstract method execute_worker Any pointers? Thanks!
Your current environment
🐛 Describe the bug
Hi I'm trying to deploy llama-8b using vllm on aws inferentia (inf2.8xlarge) instances. After lots of hacks/tiring attempts have been able to ensure the vllm server gets spawned up correctly. However when I'm trying to do model inference for say even a "hi" input prompt it gives this error as a warning on console & the llm returns nothing on the gradio ui that i've setup. See thread for code related details. Would appreciate help from someone for a fix for the below! I'm using Skypilot to deploy if in case it matters :
Here's how I setup the vllm specific things in the instance:
And here's how I run the server:
Few things that immediately came to my mind was NEURON_RT_VISIBLE_CORES env var, i tried increasing it to more than 0-1 to say 0-3, but the vllm server fails, and doesnt even boot up. This is on inf2.8xlarge instance. Each inf2 accelerator has 8 cores (and 8xlarge has a single inferentia accelerator), so this should ideally be 0-7 isn't it, but even smaller values than it dont work? I tried increasing max-model-len to 4096, but even that doesnt boot up the vllm server & it fails.
Increasing --max-num-seqs to >1 also fails in starting the vllm server. Can someone please help on what I could be missing here and how to fix for this error? 🙏 Have tried numerous things, and what not - but sadly most of them fail on vllm's side. 😦
Can someone please help with the above!