triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
706 stars 106 forks source link

Build Qwen2-72B model to INT4-AWQ TensorRT engines failed #566

Open wangpeilin opened 3 months ago

wangpeilin commented 3 months ago

System Info

Who can help?

@Tracin @kaiyux

Information

Tasks

Reproduction

  1. docker run -itd --name xxx --gpus=all -p8000:8000 -p8001:8001 -p8002:8002 -v /share/datasets:/share/datasets nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
  2. code version is 0.11.0 git clone https://github.com/NVIDIA/TensorRT-LLM.git git clone https://github.com/triton-inference-server/tensorrtllm_backend.git
  3. cd TensorRT-LLM/examples python3 ./quantization/quantize.py \ --model_dir /path/Qwen_Qwen2-72B-Instruct/ \ --output_dir /path/Qwen_Qwen2-72B-Instruct_int4_awq_4gpu \ --dtype bfloat16 \ --qformat int4_awq \ --awq_block_size 128 \ --calib_size 32 \ --tp_size 4
  4. trtllm-build \ --checkpoint_dir /path/Qwen_Qwen2-72B-Instruct_int4_awq_4gpu/ \ --output_dir triton_model_repo/Qwen_Qwen2-72B-Instruct_int4_awq/tensorrt_llm/1/ \ --gemm_plugin auto

Expected behavior

success convert model to quantified checkpoint and TensorRT engines

actual behavior

when I set tp_size=4 and awq_block_size=128 or 64, it report errors "Weight shape is not divisible for block size for block quantization." when I set tp_size=4 and awq_block_size=32 or 16, step3 quantize.py run success but trtllm-build failed which report error2.

error1

quantize

error2 RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Number of bytes for rows and cols must be a multiple of 32. However, num_rows_bytes = 4096 and num_col_bytes = 3696. (/workspace/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp:279)

additional notes

This issue seems due to the weight shape of Qwen2-72B model. I build quantization Qwen1.5-72B and Llama-3-70B success.

wangpeilin commented 6 days ago

Hi @kaiyux @Tracin Is there a resolution for this question now?