Open Lzhang-hub opened 8 months ago
For run.py
, do you mean running tensorrt_llm python runtime directly, and don't use the tensorrt_llm triton backend?
yes, the run.py
in Tensorrt-LLM repo. https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/examples/llama/run.py
Can you share the end to end reproduced steps on tensorrt_llm side? We have lots of issues and limited resources. So, it is helpful to share a clear reproduced steps to help us finding the issue.
1、 build engine
python ../build.py --model_dir codellama/CodeLlama-7b-hf --dtype float16 \
--remove_input_padding --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --output_dir ./codellama_7b-fp16 --rotary_base 1000000 --vocab_size 32016 --use_weight_only --use_inflight_batching --paged_kv_cache
2、model_repo
cp -r all_models/inflight_batcher_llm/* triton_model_repo/
cp ./codellama_7b-fp16 /* triton_model_repo/tensorrt_llm/1
modified tokenizer_dir
and tokenizer_type
in triton_model_repo/preprocessing/config.pbtxt and triton_model_repo/postprocessing/config.pbtxt
set decoupled
to True
and gpt_model_type
to inflight_fused_batching
, gpt_model_path with triton_model_repo/tensorrt_llm/1
3、lanch tritonserver with tritonserver --model-repository=triton_model_repo
4、request server with http:
curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "write a quick sort", "max_tokens": 20, "bad_words": "", "stop_words": ""}'
What tokenizer_type
do you use?
What
tokenizer_type
do you use?
llama
Thanks for reporting this @Lzhang-hub . Investigating it. Will provide an update once I have a fix.
I lanch the tritonserver follow readme with codellama-7b-hf, and request through http.
get the result:
I launch the
run.py
get the result: