Closed jeffreyling closed 1 month ago
Thanks for creating the issue. Two questions:
@cadedaniel Thanks for the quick response!
--enable-prefix-caching
and it eventually ran into the same error.--enable-prefix-caching
, and enabled --enforce-eager
. This didn't error on the set of queries I ran.Thanks for trying those out so fast :)
OK the issue is very likely caused by CUDA graphs + batch expansion. This should be fixed, but currently since spec decode performance isn't good, it won't be prioritized until after that.
FYI @LiuXiaoxuanPKU another issue with batch expansion + cuda graph
Do you recommend just using --enforce-eager
until this is fixed?
yep.
If you are blocked by this issue, the fix shouldn't be very hard. I think we simply need to configure the cuda graph max size to include the expanded batch size.
The code that is breaking is:
if use_captured_graph:
# The shape of graph_block_tables is
# [max batch size, max context len // block size].
input_block_tables = self.graph_block_tables[:batch_size]
for i, block_table in enumerate(block_tables):
if block_table:
input_block_tables[i, :len(block_table)] = block_table
The issue is that len(block_table) > input_block_tables.shape[1]
and the second dimension corresponds to max context len // block size
. Am i misunderstanding in how this is a batch-size issue and not a context len issue?
good point. Wonder why this is specific to spec decode then.
Does the sequence length plus proposal length go over the max model length ?
Does the sequence length plus proposal length go over the max model length ?
That was our suspicion as well so we made speculative-max-model-len shorter than the max-model-len - num-speculative-tokens but that doesnt seem to stop that issue.
--max-model-len 16384 \
--speculative-max-model-len 16000 \
--speculative-model [ngram] \
--num-speculative-tokens 128 \
--ngram-prompt-lookup-max 32 \
--ngram-prompt-lookup-min 16 \
@Adhyyan1252 could you see if you also get this error with vLLM 0.4.3?
@njhill we ran into this error with 0.4.3 originally before we tried upgrading to 0.5.0.
try to add params : '--max-seq-len-to-capture' eqauls to max_model_len
Still the same issue
Still the same issue..
Still the same issue
I found some workaround solution. set max_seq_len_to_capture to max_model_len.
I think https://github.com/vllm-project/vllm/pull/8340 should solve this.
This fixed for me, looks like related to eager mode cuda graphs:
max_model_len=max_tokens, max_seq_len_to_capture=max_tokens
Your current environment
🐛 Describe the bug
I am running into an issue in the vllm server in speculative decoding mode. The server is launched with this command on an 8xH100 machine for a Mixtral 8x22B model
After running several queries, the server runs into an error and does not recover. This takes some time, presumably because it's only a bug once the KV caching is populated
It seems to be an off-by-one error coming from the speculative decoding code.
Let me know if more information is needed.