How to make the server call tensorrt_llm/examples/run.py?

shil3754 commented 10 months ago

I've followed the instruction https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/baichuan.md to run Baichuan2-7b-Chat. But for exactly the same engine, the outputs are always different by running curl -X POST localhost:8000/v2/models/ensemble/generate and python /tensorrtllm_backend/tensorrt_llm/examples/run.py Somehow the latter often gives much better responses.

I've looked at the model.py in all_models/inflight_batcher_llm/tensorrt_llm_bls/1 and in particular this part

    trtllm_request = pb_utils.InferenceRequest(
        model_name="tensorrt_llm",
        inputs=trtllm_input_tensors,
        requested_output_names=list(
            self.trtllm_output_to_postproc_input_map.keys()))

    #Execute trtllm
    trtllm_responses = trtllm_request.exec(
        decoupled=self.decoupled)

But it is not clear to me what function is actually called inside trtllm_request.exec().

How to modify the code so that when a request is sent, the run.py or a custom function that wraps the LLM is called?

byshiue commented 10 months ago

Please make sure you use correct tokenizer and run on same parameters/inputs.

You could add some debug message in model.py of preprocessing to make sure you use correct inputs.

shil3754 commented 10 months ago

I've checked that the correct tokenizer is used, and I've tried to make the parameters consistent, but still, run.py gives better responses.

As for model.py , the problem is that there is no such file in all_models/inflight_batcher_llm/tensorrt_llm/1. Is there an example to show how a model.py works in a Baichuan2 model?

byshiue commented 10 months ago

It is at inflight_batcher_llm/preprocessing/1/model.py

stifles commented 6 months ago

Have you solved this problem? @shil3754

triton-inference-server / tensorrtllm_backend

How to make the server call tensorrt_llm/examples/run.py? #256