Open shil3754 opened 10 months ago
Please make sure you use correct tokenizer and run on same parameters/inputs.
You could add some debug message in model.py
of preprocessing to make sure you use correct inputs.
I've checked that the correct tokenizer is used, and I've tried to make the parameters consistent, but still, run.py
gives better responses.
As for model.py
, the problem is that there is no such file in all_models/inflight_batcher_llm/tensorrt_llm/1
.
Is there an example to show how a model.py
works in a Baichuan2 model?
It is at inflight_batcher_llm/preprocessing/1/model.py
Have you solved this problem? @shil3754
I've followed the instruction https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/baichuan.md to run Baichuan2-7b-Chat. But for exactly the same engine, the outputs are always different by running
curl -X POST localhost:8000/v2/models/ensemble/generate
andpython /tensorrtllm_backend/tensorrt_llm/examples/run.py
Somehow the latter often gives much better responses.I've looked at the model.py in all_models/inflight_batcher_llm/tensorrt_llm_bls/1 and in particular this part
But it is not clear to me what function is actually called inside
trtllm_request.exec()
.How to modify the code so that when a request is sent, the run.py or a custom function that wraps the LLM is called?