triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
581 stars 81 forks source link

Speculative decoding Assertion failed: Number of draft tokens (56) is larger than maximum number of draft tokens (0 #473

Closed avianion closed 1 month ago

avianion commented 1 month ago

System Info

Running 2 x H100 NVL GPUs.

Who can help?

@kaiyux @byshiue @ncomly-nvidia

Information

Tasks

Reproduction

1) Try to use speculative decoding 2) Observe this error

I am attempting to use Speculative decoding with inflight_fused_batching, as is required.

I have set up tensorrtllm correctly and have sent the request. I set the num_draft_tokens as 1000 just to get it working.

Now I get this error

c_triton_request_single tensorrtllm_backend-1 | raise pb_utils.TritonModelException(responses.error().message()) tensorrtllm_backend-1 | c_python_backend_utils.TritonModelException: Encountered error for requestId 424238336: Encountered an error in forward function: [TensorRT-LLM][ERROR] Assertion failed: Number of draft tokens (56) is larger than maximum number of draft tokens (0) (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp:536)

Which is weird?

But I am sure my tensorrtllm config is correct. Decoupled mode is also off.

The draft tokens are generated but it doesnt go another forward pass, why?

Expected behavior

Speculative decoding works

actual behavior

Speculative decoding doesnt work

additional notes

I would like speculative decoding to also work with streaming, is this planned @ncomly-nvidia