Open jayakommuru opened 3 months ago
@byshiue @schetlur-nv can you help with this? Not able to deploy the basic t5-small model, following the instructions given in https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/encoder_decoder.md
System Info
L4 GPU GPU memory: 24 GB TensorRT LLM version: v0.10.0 container used: tritonserver:24.06-trtllm-python-py3
Who can help?
@byshiue @schetlur-nv
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
docker run -v /home/jaya_kommuru/:/home/jaya_kommuru/ -it --gpus=all --net=host --ipc=host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3
git clone https://github.com/NVIDIA/TensorRT-LLM.git git checkout tags/v0.10.0 cd TensorRT-LLM/examples/enc_dec/
git clone https://huggingface.co/google-t5/t5-small /tmp/hf_models/t5-small
export MODEL_NAME=t5-small export MODEL_TYPE=t5 # or bart export HF_MODEL_PATH=/tmp/hf_models/${MODEL_NAME} export UNIFIED_CKPT_PATH=/tmp/ckpt/${MODEL_NAME} export ENGINE_PATH=/tmp/engines/${MODEL_NAME}
python convert_checkpoint.py --model_type ${MODEL_TYPE} --model_dir ${HF_MODEL_PATH} --output_dir ${UNIFIED_CKPT_PATH} --dtype float16
trtllm-build --checkpoint_dir ${UNIFIED_CKPT_PATH}/tp1/pp1/encoder --output_dir ${ENGINE_PATH}/encoder --paged_kv_cache disable --moe_plugin disable --enable_xqa disable --max_batch_size 64 --gemm_plugin float16 --bert_attention_plugin float16 --gpt_attention_plugin float16 --context_fmha disable
trtllm-build --checkpoint_dir ${UNIFIED_CKPT_PATH}/tp1/pp1/decoder --output_dir ${ENGINE_PATH}/decoder --moe_plugin disable --enable_xqa disable --max_batch_size 64 --gemm_plugin float16 --bert_attention_plugin float16 --gpt_attention_plugin float16 --context_fmha disable --max_input_len 1 --max_encoder_input_len 2048
cd ../../../
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git git checkout tags/v0.10.0
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH}/decoder,encoder_engine_dir:${ENGINE_PATH}/encoder,max_tokens_in_paged_kv_cache:4096,max_attention_window_size:4096,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0,enable_chunked_context:False,max_queue_size:0
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/preprocessing/config.pbtxt tokenizer_dir:${HF_MODEL_PATH},triton_max_batch_size:64,preprocessing_instance_count:1
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/postprocessing/config.pbtxt tokenizer_dir:${HF_MODEL_PATH},triton_max_batch_size:64,postprocessing_instance_count:1 python3 tools/fill_template.py -i all_models/inflight_batcher_llm/ensemble/config.pbtxt triton_max_batch_size:64 python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
pip install SentencePiece python3 scripts/launch_triton_server.py --world_size 1 --model_repo=all_models/inflight_batcher_llm/
curl -X POST localhost:8000/v2/models/ensemble/generate -d "{\"text_input\": \"Summarize the following news article: (CNN)Following last year's successful U.K. tour, Prince and 3rdEyeGirl are bringing the Hit & Run Tour to the U.S. for the first time. The first -- and so far only -- scheduled show will take place in Louisville, Kentucky, the hometown of 3rdEyeGirl drummer Hannah Welton. Slated for March 14, tickets will go on sale Monday, March 9 at 10 a.m. local time. Prince crowns dual rock charts . A venue has yet to be announced. When the Hit & Run worked its way through the U.K. in 2014, concert venues were revealed via Twitter prior to each show. Portions of the ticket sales will be donated to various Louisville charities. See the original story at Billboard.com. ©2015 Billboard. All Rights Reserved.\", \"max_tokens\": 1024, \"bad_words\": \"\", \"stop_words\": \"\"}"
Expected behavior
The curl request should have generated some response
actual behavior
But its failing due to following error:
{"error":"in ensemble 'ensemble', Executor failed process requestId 4 due to the following error: Encountered an error when fetching new request: Prompt length (200) exceeds maximum input length (1). (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/include/tensorrt_llm/batch_manager/llmRequest.h:201)\n1 0x7f15ca33587f /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x6c387f) [0x7f15ca33587f]\n2 0x7f15cc2dcae2 tensorrt_llm::executor::Executor::Impl::executionLoop() + 722\n3 0x7f16bd1d8253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f16bd1d8253]\n4 0x7f16bcf67ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f16bcf67ac3]\n5 0x7f16bcff8a04 clone + 68"}
additional notes
Have been referring to this example: https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/encoder_decoder.md