the result use inflight_batcher_llm_client to send multiple lora weights is not same as use tensorrtllm

stifles commented 5 months ago

case1：use tensorrtllm python3 /tensorrtllm_backend/tensorrt_llm/examples/run.py --engine_dir "/data512/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1/" \ --max_output_len 2048 \ --tokenizer_dir "/tensorrtllm_backend/tokenizer" \ --input_text "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWhat is the intention of the following user questions? \Can you help me write a summary<|im_end|>\n<|im_start|>assistant\n" \ --lora_dir "/tensorrtllm_backend/lora_intent" \ --lora_task_uids 0 \ --no_add_special_tokens \ --use_py_session \ --streaming

Output [Text 0 Beam 0]: "Writing"

case1：use inflight_batcher_llm_client python3 /tensorrtllm_backend/inflight_batcher_llm/client/inflight_batcher_llm_client.py \ --request-output-len 2048 \ --text "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWhat is the intention of the following user questions? \Can you help me write a summary<|im_end|>\n<|im_start|>assistant\n" \ --tokenizer-dir /tensorrtllm_backend/tokenizer \ --lora-path "/tensorrtllm_backend/lora_intent" --streaming Output [Text 0 Beam 0]: "Summary"

The correct answer is "Writing"

byshiue commented 5 months ago

Could you print the input ids of both cases?

stifles commented 5 months ago

Could you print the input ids of both cases?

case1：use tensorrtllm [151644, 8948, 1699, 2610, 525, 264, 10950, 17847, 13, 151645, 1699, 151644, 872, 1699, 3838, 374, 279, 14602, 315, 279, 2701, 1196, 4755, 30, 1124, 6713, 498, 1492, 752, 3270, 264, 12126, 151645, 1699, 151644, 77091, 1699]

case2：use inflight_batcher_llm_client [151644, 8948, 1699, 2610, 525, 264, 10950, 17847, 13, 151645, 1699, 151644, 872, 1699, 3838, 374, 279, 14602, 315, 279, 2701, 1196, 4755, 30, 1124, 6713, 498, 1492, 752, 3270, 264, 12126, 151645, 1699, 151644, 77091, 1699]

byshiue commented 4 months ago

I have no idea yet. Could you share the end to end steps to reproduce? (How to convert ckpt and how to build the engine?)

triton-inference-server / tensorrtllm_backend

the result use inflight_batcher_llm_client to send multiple lora weights is not same as use tensorrtllm #413