triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
664 stars 96 forks source link

decoding_mode top_k_top_p does not take effect for llama2 not same with huggingface #461

Open yjjiang11 opened 4 months ago

yjjiang11 commented 4 months ago

System Info

CPU Architecture: AMD EPYC 7V13 64-Core Processor CPU/Host memory size: 440 GPU properties: A800 80GB GPU name: NVIDIA A800 80GB x2 GPU mem size: 80Gb x 2 clock frequencies Libraries TensorRT-LLM branch or tag: main TensorRT-LLM commit: https://github.com/triton-inference-server/tensorrtllm_backend/commit/ae52bce3ed8ecea468a16483e0dacd3d156ae4fe Versions of TensorRT, CUDA: (10.0.1, 12.4) container used: Built container from tensorrtllm_backend main branch using dockerfile/Dockerfile.trt_llm_backend nvidia driver version: 535.161.07 OS: Ubuntu 22.04.4 LTS docker image version: custom built from main branch

Who can help?

No response

Information

Tasks

Reproduction

  1. Build the trt llm container by running DOCKER_BUILDKIT=1 TORCH_CUDA_ARCH_LIST= docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend

  2. Launch the container with this command sudo docker run -it --net host --shm-size=20g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /home/keith:/home triton_trt_llm:latest /bin/bash

  3. Build trt engines following the guide https://github.com/NVIDIA/TensorRT-LLM/tree/bf0a5afc92f4b2b3191e9e55073953c1f600cf2d/examples/llama

  4. Launch triton server

set decoding_mode: top_k_top_p

python3 scripts/launch_triton_server.py --world_size=4 --model_repo=/path/to/llama2-70b/repo --log

  1. Query the serser twice

curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "背一首诗", "max_tokens": 20, "bad_words": "", "stop_words": "</s>", "top_k": 100, "top_p": 1}'

Expected behavior

{ "model_name": "ensemble", "model_version": "1", "sequence_end": false, "sequence_id": 0, "sequence_start": false, "text_output": "背诵一首诗\n\n《登高》 杜甫\n\n风急天高猿啸哀,渚清沙白鸟飞回。无边落木萧萧下,不尽长江滚滚来。万里悲秋常作客,百年多病独登台。艰难苦恨繁霜鬓,潦倒新停浊酒杯。" }

actual behavior

two responses are same: { "model_name": "ensemble", "model_version": "1", "sequence_end": false, "sequence_id": 0, "sequence_start": false, "text_output": "背诵一首诗\n\n《登高》 杜甫\n\n风急天高猿啸哀,渚清沙白鸟飞回。无边落木萧萧下,不尽长江滚滚来。万里悲秋常作客,百年多病独登台。艰难苦恨繁霜鬓,潦倒新停浊酒杯。" }

additional notes

while I set decoding mode to top_p top_k The result is still no effect

zixuxu000 commented 4 months ago

有可能是因为,top_k = 100 中的词语中第一个单词的概率非常大,导致sample时,有很大的概率被选中。你可以试一下beam_width这个参数,应该是beam search的个数

yjjiang11 commented 4 months ago

如果用beam_width 那意味着sample算法就不是top_k_top_p了吧?

byshiue commented 3 months ago

When you pass

curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "背一首诗", "max_tokens": 20, "bad_words": "", "stop_words": "</s>", "top_k": 100, "top_p": 1}'

twice, it uses same random seed and always generate same results. You should pass different random_seed to get different results.