Speculative decoding Assertion failed: Number of draft tokens (56) is larger than maximum number of draft tokens (0

System Info

Running 2 x H100 NVL GPUs.

Who can help?

@kaiyux @byshiue @ncomly-nvidia

Information

[X] The official example scripts
[X] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

1) Try to use speculative decoding 2) Observe this error

I am attempting to use Speculative decoding with inflight_fused_batching, as is required.

I have set up tensorrtllm correctly and have sent the request. I set the num_draft_tokens as 1000 just to get it working.

Now I get this error

c_triton_request_single tensorrtllm_backend-1 | raise pb_utils.TritonModelException(responses.error().message()) tensorrtllm_backend-1 | c_python_backend_utils.TritonModelException: Encountered error for requestId 424238336: Encountered an error in forward function: [TensorRT-LLM][ERROR] Assertion failed: Number of draft tokens (56) is larger than maximum number of draft tokens (0) (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp:536)

Which is weird?

But I am sure my tensorrtllm config is correct. Decoupled mode is also off.

The draft tokens are generated but it doesnt go another forward pass, why?

Expected behavior

Speculative decoding works

actual behavior

Speculative decoding doesnt work

additional notes

I would like speculative decoding to also work with streaming, is this planned @ncomly-nvidia

triton-inference-server / tensorrtllm_backend