decoding_mode top_k_top_p does not take effect for llama2 not same with huggingface

yjjiang11 commented 4 months ago

System Info

CPU Architecture: AMD EPYC 7V13 64-Core Processor CPU/Host memory size: 440 GPU properties: A800 80GB GPU name: NVIDIA A800 80GB x2 GPU mem size: 80Gb x 2 clock frequencies Libraries TensorRT-LLM branch or tag: main TensorRT-LLM commit: https://github.com/triton-inference-server/tensorrtllm_backend/commit/ae52bce3ed8ecea468a16483e0dacd3d156ae4fe Versions of TensorRT, CUDA: (10.0.1, 12.4) container used: Built container from tensorrtllm_backend main branch using dockerfile/Dockerfile.trt_llm_backend nvidia driver version: 535.161.07 OS: Ubuntu 22.04.4 LTS docker image version: custom built from main branch

Who can help?

No response

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Build the trt llm container by running DOCKER_BUILDKIT=1 TORCH_CUDA_ARCH_LIST= docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend
Launch the container with this command sudo docker run -it --net host --shm-size=20g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /home/keith:/home triton_trt_llm:latest /bin/bash
Build trt engines following the guide https://github.com/NVIDIA/TensorRT-LLM/tree/bf0a5afc92f4b2b3191e9e55073953c1f600cf2d/examples/llama
Launch triton server

set decoding_mode: top_k_top_p

python3 scripts/launch_triton_server.py --world_size=4 --model_repo=/path/to/llama2-70b/repo --log

Query the serser twice

curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "背一首诗", "max_tokens": 20, "bad_words": "", "stop_words": "</s>", "top_k": 100, "top_p": 1}'

Expected behavior

response1:

{ "model_name": "ensemble", "model_version": "1", "sequence_end": false, "sequence_id": 0, "sequence_start": false, "text_output": "背诵一首诗\n\n《登高》杜甫\n\n风急天高猿啸哀，渚清沙白鸟飞回。无边落木萧萧下，不尽长江滚滚来。万里悲秋常作客，百年多病独登台。艰难苦恨繁霜鬓，潦倒新停浊酒杯。" }

response2 response1: { "model_name": "ensemble", "model_version": "1", "sequence_end": false, "sequence_id": 0, "sequence_start": false, "text_output": "背诵一首诗\n\n《赋得古原草送别》白居易离离原上草，一岁一枯荣。野火烧不尽，春风吹又生。远芳侵古道，晴翠接荒城。又送王孙去，萋萋满别情。。" }

actual behavior

two responses are same: { "model_name": "ensemble", "model_version": "1", "sequence_end": false, "sequence_id": 0, "sequence_start": false, "text_output": "背诵一首诗\n\n《登高》杜甫\n\n风急天高猿啸哀，渚清沙白鸟飞回。无边落木萧萧下，不尽长江滚滚来。万里悲秋常作客，百年多病独登台。艰难苦恨繁霜鬓，潦倒新停浊酒杯。" }

additional notes

while I set decoding mode to top_p top_k The result is still no effect

zixuxu000 commented 4 months ago

有可能是因为，top_k = 100 中的词语中第一个单词的概率非常大，导致sample时，有很大的概率被选中。你可以试一下beam_width这个参数，应该是beam search的个数

yjjiang11 commented 4 months ago

如果用beam_width 那意味着sample算法就不是top_k_top_p了吧？

byshiue commented 3 months ago

When you pass

curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "背一首诗", "max_tokens": 20, "bad_words": "", "stop_words": "</s>", "top_k": 100, "top_p": 1}'

twice, it uses same random seed and always generate same results. You should pass different random_seed to get different results.

triton-inference-server / tensorrtllm_backend