Bug in tensorrt_llm_bls

triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend

Apache License 2.0

648 stars 93 forks source link

Bug in tensorrt_llm_bls #535

Open binhtranmcs opened 1 month ago

binhtranmcs commented 1 month ago

I think there is a bug here in the implementation of bls backend. The return is inside the for loop, so the backend only handle 1 request per execution and ignore the rest.

    def execute(self, requests):

        responses = []

        for request in requests:
            if self.decoupled:
                response_sender = request.get_response_sender()
            try:

                req = self.decoder.convert_triton_request(request)
                req.validate()
                speculative_decode = (req.num_draft_tokens is not None
                                      and req.num_draft_tokens[0][0] > 0)
                if speculative_decode and (self.draft_llm_model_name is None
                                           or self.draft_llm_model_name == ""):
                    raise Exception(
                        "cannot perform speculative decoding without draft model"
                    )
                res_gen = self.decoder.decode(
                    req, speculative_decoding=speculative_decode)

                for res in res_gen:
                    triton_response = self.decoder.create_triton_response(res)
                    if self.decoupled:
                        response_sender.send(triton_response)
                    else:
                        responses.append(triton_response)

                if self.decoupled:
                    response_sender.send(
                        flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL)

            except Exception:
                self.logger.log_error(traceback.format_exc())
                # If encountering an error, send a response with err msg
                error_response = pb_utils.InferenceResponse(
                    output_tensors=[],
                    error=pb_utils.TritonError(traceback.format_exc()))

                if self.decoupled:
                    response_sender.send(error_response)
                    response_sender.send(
                        flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL)
                else:
                    responses.append(error_response)

            self.decoder.reset_decoder()
            if self.decoupled:
                return None
            else:
                assert len(responses) == len(requests)
                return responses

Also the for loop means that the following requests must wait till the prior ones finished to be processed, which is really not efficient at all. Please have a look!

activezhao commented 1 month ago

@binhtranmcs You can set ${bls_instance_count} to max_batch_size, requests should be able to be processed in parallel.

bls_instance_count

MatteoPagliani commented 1 month ago

@activezhao Considering the link you provided with the deployment of a TensorRT-LLM model, do we need to modify the count field of the instance_group block also for preprocessing, tensorrt_llm and postprocessing models? If not, can you explain why?

Another question related to instance_group: should we keep the KIND_CPU value for the kind field in the config.pbtxt files of preprocessing, tensorrt_llm, tensorrt_llm_bls and postprocessing? If we deploy a TensorRT-LLM engine it makes sense to me to change KIND_CPU to KIND_GPU in tensorrt_llm config.pbtxt file, but I am not really sure this is right.

Thanks in advance for your time.