Open binhtranmcs opened 1 month ago
@binhtranmcs You can set ${bls_instance_count}
to max_batch_size
, requests should be able to be processed in parallel.
@activezhao Considering the link you provided with the deployment of a TensorRT-LLM model, do we need to modify the count
field of the instance_group
block also for preprocessing
, tensorrt_llm
and postprocessing
models? If not, can you explain why?
Another question related to instance_group
: should we keep the KIND_CPU
value for the kind
field in the config.pbtxt
files of preprocessing
, tensorrt_llm
, tensorrt_llm_bls
and postprocessing
? If we deploy a TensorRT-LLM engine it makes sense to me to change KIND_CPU to KIND_GPU in tensorrt_llm
config.pbtxt
file, but I am not really sure this is right.
Thanks in advance for your time.
I think there is a bug here in the implementation of bls backend. The
return
is inside the for loop, so the backend only handle 1 request per execution and ignore the rest.Also the for loop means that the following requests must wait till the prior ones finished to be processed, which is really not efficient at all. Please have a look!