Closed XiaobingSuper closed 2 months ago
When using the BLS model instead of the ensemble, you should set the number of model instances to the maximum batch size supported by the TRT engine to allow concurrent request execution. This can be done by modifying the count
value in the instance_group
section of the BLS model config.pbtxt
.
The documentation will be improved to include this information.
@pcastonguay, thanks, it works now.
I Create this issue for the question about inflight decoding for
tensorrt_llm_bls
mode.I find that the inflight decoding works well when running
ensemble
mode,:but for
tensorrt_llm_bls
mode(peculative decoding is not used), the inflight decoding doesn't not work well, and it gets a worse performance:Is it make sense? I hope that the
inflight decoding
also can work well fortensorrt_llm_bls
mode