triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
686 stars 101 forks source link

xverse-65b error #347

Closed lwbmowgli closed 7 months ago

lwbmowgli commented 8 months ago

I successfully built xverse-65b using the llama example, and successfully deployed it using triton, but an error occurred during inference. What is the reason? How should I modify it? [TensorRT-LLM][ERROR] Encountered an error in forward function: [TensorRT-LLM][ERROR] Assertion failed: Tensor 'past_key_value_0' has invalid shape (1, 2, 8, 1536, 128) (/app/tensorrt_llm/ cpp/tensorrt_llm/runtime/tllmRuntime.cpp:150)

lwbmowgli commented 8 months ago

这是我的build命令 python build_xverse.py --model_dir xverse-65b \ --use_gpt_attention_plugin float16 \ --use_weight_only \ --weight_only_precision int4 \ --max_batch_size 1 \ --output_dir XVERSE-65B \ --world_size 8 \ --tp_size 8

StudyingShao commented 7 months ago

@lwbmowgli what is build_xverse.py here? Are you using main branch now? Please give more details here.

lwbmowgli commented 7 months ago

因为这个和llama具有相同的结构,所以我直接使用了https://github.com/NVIDIA/TensorRT-LLM/blob/v0.7.1/examples/llama/build.py 。所以build_xverse.py这个脚本就是链接中的脚本,7b和13b的版本都能够正常起服务,但是65b的版本就会报告上面的错误。我使用的镜像是23.12,我是否应该切换到24.01测试一下?

StudyingShao commented 7 months ago

I see. will have a try with the latest main branch and get back to you later.

StudyingShao commented 7 months ago

BTW, I saw that --use_weight_only and --weight_only_precision int4 are added here. Do you want to use int4 weight-only per-channel quantization here? If so, a more fine-grained quantization method is recommended. Please try awq with the latest main branch. It has a better accuracy and perf boost with bs = 1 you used here. @lwbmowgli

lwbmowgli commented 7 months ago

大佬,现在有什么新的消息吗? @StudyingShao