vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
22.08k stars 3.11k forks source link

[Feature]: AssertionError: Speculative decoding not yet supported for RayGPU backend. #4358

Open cocoza4 opened 2 months ago

cocoza4 commented 2 months ago

🚀 The feature, motivation and pitch

Hi,

Do you guys have any workaround for the Speculative decoding not yet supported for RayGPU backend. error or idea when the RayGPU backend will support speculative decoding?

I run vllm server with the following command:

python3 -u -m vllm.entrypoints.openai.api_server \
       --host 0.0.0.0 \
       --model casperhansen/mixtral-instruct-awq \
       --tensor-parallel-size 4 \
       --enforce-eager \
       --quantization awq \
       --gpu-memory-utilization 0.96 \
       --kv-cache-dtype fp8 \
       --speculative-model mistralai/Mistral-7B-Instruct-v0.2 \
       --num-speculative-tokens 3 \
       --use-v2-block-manager \
       --num-lookahead-slots 5

However, I got AssertionError: Speculative decoding not yet supported for RayGPU backend.

Alternatives

No response

Additional context

No response

psych0v0yager commented 2 months ago

I am having the same issue

python -m vllm.entrypoints.openai.api_server --model /home/llama3_70B_awq --port 8000 --tensor-parallel-size 2 --gpu-memory-utilization 0.95 --kv-cache-dtype fp8 --max-num-seqs 32 --speculative-model /home/llama3_8B_gptq --num-speculative-tokens 3 --use-v2-block-manager
jamestwhedbee commented 1 month ago

running into this as well

bkchang commented 1 month ago

Running into this as well

YuCheng-Qi commented 1 month ago

Running into this as well

MRKINKI commented 1 month ago

Running into this as well

bkchang commented 1 month ago

This issue should have been resolved by #4840