[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)
Reproduction
1) Try to use speculative decoding
2) Observe this error
I am attempting to use Speculative decoding with inflight_fused_batching, as is required.
I have set up tensorrtllm correctly and have sent the request. I set the num_draft_tokens as 1000 just to get it working.
Now I get this error
c_triton_request_single
tensorrtllm_backend-1 | raise pb_utils.TritonModelException(responses.error().message())
tensorrtllm_backend-1 | c_python_backend_utils.TritonModelException: Encountered error for requestId 424238336: Encountered an error in forward function: [TensorRT-LLM][ERROR] Assertion failed: Number of draft tokens (56) is larger than maximum number of draft tokens (0) (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp:536)
Which is weird?
But I am sure my tensorrtllm config is correct. Decoupled mode is also off.
The draft tokens are generated but it doesnt go another forward pass, why?
Expected behavior
Speculative decoding works
actual behavior
Speculative decoding doesnt work
additional notes
I would like speculative decoding to also work with streaming, is this planned @ncomly-nvidia
System Info
Running 2 x H100 NVL GPUs.
Who can help?
@kaiyux @byshiue @ncomly-nvidia
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
1) Try to use speculative decoding 2) Observe this error
I am attempting to use Speculative decoding with inflight_fused_batching, as is required.
I have set up tensorrtllm correctly and have sent the request. I set the num_draft_tokens as 1000 just to get it working.
Now I get this error
c_triton_request_single tensorrtllm_backend-1 | raise pb_utils.TritonModelException(responses.error().message()) tensorrtllm_backend-1 | c_python_backend_utils.TritonModelException: Encountered error for requestId 424238336: Encountered an error in forward function: [TensorRT-LLM][ERROR] Assertion failed: Number of draft tokens (56) is larger than maximum number of draft tokens (0) (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp:536)
Which is weird?
But I am sure my tensorrtllm config is correct. Decoupled mode is also off.
The draft tokens are generated but it doesnt go another forward pass, why?
Expected behavior
Speculative decoding works
actual behavior
Speculative decoding doesnt work
additional notes
I would like speculative decoding to also work with streaming, is this planned @ncomly-nvidia