Open yjjiang11 opened 4 months ago
有可能是因为,top_k = 100 中的词语中第一个单词的概率非常大,导致sample时,有很大的概率被选中。你可以试一下beam_width这个参数,应该是beam search的个数
如果用beam_width 那意味着sample算法就不是top_k_top_p了吧?
When you pass
curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "背一首诗", "max_tokens": 20, "bad_words": "", "stop_words": "</s>", "top_k": 100, "top_p": 1}'
twice, it uses same random seed and always generate same results. You should pass different random_seed to get different results.
System Info
CPU Architecture: AMD EPYC 7V13 64-Core Processor CPU/Host memory size: 440 GPU properties: A800 80GB GPU name: NVIDIA A800 80GB x2 GPU mem size: 80Gb x 2 clock frequencies Libraries TensorRT-LLM branch or tag: main TensorRT-LLM commit: https://github.com/triton-inference-server/tensorrtllm_backend/commit/ae52bce3ed8ecea468a16483e0dacd3d156ae4fe Versions of TensorRT, CUDA: (10.0.1, 12.4) container used: Built container from tensorrtllm_backend main branch using dockerfile/Dockerfile.trt_llm_backend nvidia driver version: 535.161.07 OS: Ubuntu 22.04.4 LTS docker image version: custom built from main branch
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Build the trt llm container by running
DOCKER_BUILDKIT=1 TORCH_CUDA_ARCH_LIST= docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend
Launch the container with this command
sudo docker run -it --net host --shm-size=20g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /home/keith:/home triton_trt_llm:latest /bin/bash
Build trt engines following the guide https://github.com/NVIDIA/TensorRT-LLM/tree/bf0a5afc92f4b2b3191e9e55073953c1f600cf2d/examples/llama
Launch triton server
set decoding_mode:
top_k_top_p
python3 scripts/launch_triton_server.py --world_size=4 --model_repo=/path/to/llama2-70b/repo --log
curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "背一首诗", "max_tokens": 20, "bad_words": "", "stop_words": "</s>", "top_k": 100, "top_p": 1}'
Expected behavior
{ "model_name": "ensemble", "model_version": "1", "sequence_end": false, "sequence_id": 0, "sequence_start": false, "text_output": "背诵一首诗\n\n《登高》 杜甫\n\n风急天高猿啸哀,渚清沙白鸟飞回。无边落木萧萧下,不尽长江滚滚来。万里悲秋常作客,百年多病独登台。艰难苦恨繁霜鬓,潦倒新停浊酒杯。" }
actual behavior
two responses are same: { "model_name": "ensemble", "model_version": "1", "sequence_end": false, "sequence_id": 0, "sequence_start": false, "text_output": "背诵一首诗\n\n《登高》 杜甫\n\n风急天高猿啸哀,渚清沙白鸟飞回。无边落木萧萧下,不尽长江滚滚来。万里悲秋常作客,百年多病独登台。艰难苦恨繁霜鬓,潦倒新停浊酒杯。" }
additional notes
while I set decoding mode to
top_p
top_k
The result is still no effect