trtllm-build \
--checkpoint_dir /path/Qwen_Qwen2-72B-Instruct_int4_awq_4gpu/ \
--output_dir triton_model_repo/Qwen_Qwen2-72B-Instruct_int4_awq/tensorrt_llm/1/ \
--gemm_plugin auto
Expected behavior
success convert model to quantified checkpoint and TensorRT engines
actual behavior
when I set tp_size=4 and awq_block_size=128 or 64, it report errors "Weight shape is not divisible for block size for block quantization."
when I set tp_size=4 and awq_block_size=32 or 16, step3 quantize.py run success but trtllm-build failed which report error2.
error1
error2
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Number of bytes for rows and cols must be a multiple of 32. However, num_rows_bytes = 4096 and num_col_bytes = 3696. (/workspace/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp:279)
additional notes
This issue seems due to the weight shape of Qwen2-72B model. I build quantization Qwen1.5-72B and Llama-3-70B success.
System Info
Who can help?
@Tracin @kaiyux
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
success convert model to quantified checkpoint and TensorRT engines
actual behavior
when I set tp_size=4 and awq_block_size=128 or 64, it report errors "Weight shape is not divisible for block size for block quantization." when I set tp_size=4 and awq_block_size=32 or 16, step3 quantize.py run success but trtllm-build failed which report error2.
error1
error2 RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Number of bytes for rows and cols must be a multiple of 32. However, num_rows_bytes = 4096 and num_col_bytes = 3696. (/workspace/tensorrt_llm/cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp:279)
additional notes
This issue seems due to the weight shape of Qwen2-72B model. I build quantization Qwen1.5-72B and Llama-3-70B success.