The crash occurred when attempting to quantize the LLaMA model with W4A(fp)8_AWQ.

pandengyao commented 2 months ago

System Info

CPU architecture (x86_64)
CPU/Host memory size (64G) -GPU properties
- GPU name (NVIDIA RTX 4090)
- GPU memory size (24G)
-Libraries -TensorRT-LLM branch or tag (v0.9.0) -TensorRT-LLM commit (250d9c293d5edbc2a45c20775b3150b1eb68b364)
- Container used (Build from tensorrt_backend v0.9.0)

Who can help?

@Tracin

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

python ../quantization/quantize.py --model_dir MODEL_PATH \ --dtype float16 \ --qformat w4a8_awq \ --awq_block_size 128 \ --output_dir ./quantized_int4-awq \ --calib_size 32

trtllm-build --checkpoint_dir ./quantized_int4-awq \ --output_dir OUTPUT_PATH \ --gemm_plugin float16 \ --gpt_attention_plugin float16 \ --context_fmha enable \ --paged_kv_cache enable \ --max_prompt_embedding_table_size 2048 \ --max_batch_size 32 \ --remove_input_padding enable \ --use_paged_context_fmha enable \ --multi_block_mode enable \ --gather_all_token_logits \ --max_input_len 2048

Expected behavior

build successfully!

actual behavior

[TensorRT-LLM] TensorRT-LLM version: 0.9.0 [04/18/2024-02:48:51] [TRT-LLM] [I] Set bert_attention_plugin to float16. [04/18/2024-02:48:51] [TRT-LLM] [I] Set gpt_attention_plugin to float16. [04/18/2024-02:48:51] [TRT-LLM] [I] Set gemm_plugin to float16. [04/18/2024-02:48:51] [TRT-LLM] [I] Set lookup_plugin to None. [04/18/2024-02:48:51] [TRT-LLM] [I] Set lora_plugin to None. [04/18/2024-02:48:51] [TRT-LLM] [I] Set moe_plugin to float16. [04/18/2024-02:48:51] [TRT-LLM] [I] Set mamba_conv1d_plugin to float16. [04/18/2024-02:48:51] [TRT-LLM] [I] Set context_fmha to True. [04/18/2024-02:48:51] [TRT-LLM] [I] Set context_fmha_fp32_acc to False. [04/18/2024-02:48:51] [TRT-LLM] [I] Set paged_kv_cache to True. [04/18/2024-02:48:51] [TRT-LLM] [I] Set remove_input_padding to True. [04/18/2024-02:48:51] [TRT-LLM] [I] Set use_custom_all_reduce to True. [04/18/2024-02:48:51] [TRT-LLM] [I] Set multi_block_mode to True. [04/18/2024-02:48:51] [TRT-LLM] [I] Set enable_xqa to True. [04/18/2024-02:48:51] [TRT-LLM] [I] Set attention_qk_half_accumulation to False. [04/18/2024-02:48:51] [TRT-LLM] [I] Set tokens_per_block to 128. [04/18/2024-02:48:51] [TRT-LLM] [I] Set use_paged_context_fmha to True. [04/18/2024-02:48:51] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [04/18/2024-02:48:51] [TRT-LLM] [I] Set use_context_fmha_for_generation to False. [04/18/2024-02:48:51] [TRT-LLM] [I] Set multiple_profiles to False. [04/18/2024-02:48:51] [TRT-LLM] [I] Set paged_state to True. [04/18/2024-02:48:51] [TRT-LLM] [I] Set streamingllm to False. [04/18/2024-02:48:51] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_sizemax_input_len. It may not be optimal to set max_num_tokens=max_batch_sizemax_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads. [04/18/2024-02:48:51] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.

[04/18/2024-02:48:51] [TRT-LLM] [W] Fail to infer cluster key, use A100-SXM-80GB as fallback. [04/18/2024-02:49:03] [TRT] [I] [MemUsageChange] Init CUDA: CPU +12, GPU +0, now: CPU 3194, GPU 469 (MiB) [04/18/2024-02:49:04] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1809, GPU +316, now: CPU 5139, GPU 785 (MiB) [04/18/2024-02:49:04] [TRT-LLM] [I] Set weight_only_groupwise_quant_matmul_plugin to float16. [04/18/2024-02:49:04] [TRT-LLM] [I] Set nccl_plugin to None. [04/18/2024-02:49:04] [TRT-LLM] [I] Set use_custom_all_reduce to True. [04/18/2024-02:49:04] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/vocab_embedding/SELECT_2_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/18/2024-02:49:04] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: W4A(fp)8 kernel is unsupported on pre-Hopper architectures! (/app/tensorrt_llm/cpp/tensorrt_llm/plugins/weightOnlyGroupwiseQuantMatmulPlugin/weightOnlyGroupwiseQuantMatmulPlugin.cpp:152) 1 0x7014245f8b30 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x53b30) [0x7014245f8b30] 2 0x701424713f7a tensorrt_llm::plugins::WeightOnlyGroupwiseQuantMatmulPlugin::WeightOnlyGroupwiseQuantMatmulPlugin(nvinfer1::DataType, int, int, std::shared_ptr const&) + 250 3 0x701424714352 tensorrt_llm::plugins::WeightOnlyGroupwiseQuantMatmulPluginCreator::createPlugin(char const, nvinfer1::PluginFieldCollection const) + 434 4 0x7014e676060a /usr/local/lib/python3.10/dist-packages/tensorrt/tensorrt.so(+0x16060a) [0x7014e676060a] 5 0x7014e6643443 /usr/local/lib/python3.10/dist-packages/tensorrt/tensorrt.so(+0x43443) [0x7014e6643443] 6 0x594282dfc10e /usr/bin/python3(+0x15a10e) [0x594282dfc10e] 7 0x594282df2a7b _PyObject_MakeTpCall + 603 8 0x594282e0aacb /usr/bin/python3(+0x168acb) [0x594282e0aacb] 9 0x594282deacfa _PyEval_EvalFrameDefault + 24906 10 0x594282dfc9fc _PyFunction_Vectorcall + 124 11 0x594282de526d _PyEval_EvalFrameDefault + 1725 12 0x594282e0a93e /usr/bin/python3(+0x16893e) [0x594282e0a93e] 13 0x594282de75d7 _PyEval_EvalFrameDefault + 10791 14 0x594282df1c14 _PyObject_FastCallDictTstate + 196 15 0x594282e0786c _PyObject_Call_Prepend + 92 16 0x594282f22700 /usr/bin/python3(+0x280700) [0x594282f22700] 17 0x594282df2a7b _PyObject_MakeTpCall + 603 18 0x594282deb629 _PyEval_EvalFrameDefault + 27257 19 0x594282e0a7f1 /usr/bin/python3(+0x1687f1) [0x594282e0a7f1] 20 0x594282e0b492 PyObject_Call + 290 21 0x594282de75d7 _PyEval_EvalFrameDefault + 10791 22 0x594282dfc9fc _PyFunction_Vectorcall + 124 23 0x594282df1cbd _PyObject_FastCallDictTstate + 365 24 0x594282e0786c _PyObject_Call_Prepend + 92 25 0x594282f22700 /usr/bin/python3(+0x280700) [0x594282f22700] 26 0x594282df2a7b _PyObject_MakeTpCall + 603 27 0x594282dec150 _PyEval_EvalFrameDefault + 30112 28 0x594282e0a7f1 /usr/bin/python3(+0x1687f1) [0x594282e0a7f1] 29 0x594282e0b492 PyObject_Call + 290 30 0x594282de75d7 _PyEval_EvalFrameDefault + 10791 31 0x594282dfc9fc _PyFunction_Vectorcall + 124 32 0x594282df1cbd _PyObject_FastCallDictTstate + 365 33 0x594282e0786c _PyObject_Call_Prepend + 92 34 0x594282f22700 /usr/bin/python3(+0x280700) [0x594282f22700] 35 0x594282e0b42b PyObject_Call + 187 36 0x594282de75d7 _PyEval_EvalFrameDefault + 10791 37 0x594282e0a7f1 /usr/bin/python3(+0x1687f1) [0x594282e0a7f1] 38 0x594282de653c _PyEval_EvalFrameDefault + 6540 39 0x594282e0a7f1 /usr/bin/python3(+0x1687f1) [0x594282e0a7f1] 40 0x594282e0b492 PyObject_Call + 290 41 0x594282de75d7 _PyEval_EvalFrameDefault + 10791 42 0x594282e0a7f1 /usr/bin/python3(+0x1687f1) [0x594282e0a7f1] 43 0x594282e0b492 PyObject_Call + 290 44 0x594282de75d7 _PyEval_EvalFrameDefault + 10791 45 0x594282dfc9fc _PyFunction_Vectorcall + 124 46 0x594282df1cbd _PyObject_FastCallDictTstate + 365 47 0x594282e0786c _PyObject_Call_Prepend + 92 48 0x594282f22700 /usr/bin/python3(+0x280700) [0x594282f22700] 49 0x594282e0b42b PyObject_Call + 187 50 0x594282de75d7 _PyEval_EvalFrameDefault + 10791 51 0x594282dfc9fc _PyFunction_Vectorcall + 124 52 0x594282de526d _PyEval_EvalFrameDefault + 1725 53 0x594282dfc9fc _PyFunction_Vectorcall + 124 54 0x594282e0b492 PyObject_Call + 290 55 0x594282de75d7 _PyEval_EvalFrameDefault + 10791 56 0x594282dfc9fc _PyFunction_Vectorcall + 124 57 0x594282e0b492 PyObject_Call + 290 58 0x594282de75d7 _PyEval_EvalFrameDefault + 10791 59 0x594282dfc9fc _PyFunction_Vectorcall + 124 60 0x594282e0b492 PyObject_Call + 290 61 0x594282de75d7 _PyEval_EvalFrameDefault + 10791 62 0x594282dfc9fc _PyFunction_Vectorcall + 124 63 0x594282de526d _PyEval_EvalFrameDefault + 1725 64 0x594282de19c6 /usr/bin/python3(+0x13f9c6) [0x594282de19c6] 65 0x594282ed7256 PyEval_EvalCode + 134 66 0x594282f02108 /usr/bin/python3(+0x260108) [0x594282f02108] 67 0x594282efb9cb /usr/bin/python3(+0x2599cb) [0x594282efb9cb] 68 0x594282f01e55 /usr/bin/python3(+0x25fe55) [0x594282f01e55] 69 0x594282f01338 _PyRun_SimpleFileObject + 424 70 0x594282f00f83 _PyRun_AnyFileObject + 67 71 0x594282ef3a5e Py_RunMain + 702 72 0x594282eca02d Py_BytesMain + 45 73 0x7016556c8d90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7016556c8d90] 74 0x7016556c8e40 libc_start_main + 128 75 0x594282ec9f25 _start + 37 Traceback (most recent call last): File "/usr/local/bin/trtllm-build", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 440, in main parallel_build(source, build_config, args.output_dir, workers, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 332, in parallel_build passed = build_and_save(rank, rank % workers, ckpt_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 291, in build_and_save engine = build_model(build_config, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 284, in build_model return build(model, build_config) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 691, in build model(**inputs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 630, in forward hidden_states = self.transformer.forward(kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 178, in forward hidden_states = self.layers.forward( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 296, in forward hidden_states = layer( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 107, in forward attention_output = self.attention( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call__ output = self.forward(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/layers/attention.py", line 637, in forward qkv = self.qkv(hidden_states, qkv_lora_params) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/layers.py", line 507, in forward x = weight_only_groupwise_quant_matmul( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/functional.py", line 185, in weight_only_groupwise_quant_matmul layer = default_trtnet().add_plugin_v2(plug_inputs, matmul_plug) TypeError: add_plugin_v2(): incompatible function arguments. The following argument types are supported:

(self: tensorrt.tensorrt.INetworkDefinition, inputs: List[tensorrt.tensorrt.ITensor], plugin: tensorrt.tensorrt.IPluginV2) -> tensorrt.tensorrt.IPluginV2Layer

Invoked with: <tensorrt.tensorrt.INetworkDefinition object at 0x701648142230>, [<tensorrt.tensorrt.ITensor object at 0x701648134bb0>, <tensorrt.tensorrt.ITensor object at 0x701648135db0>, <tensorrt.tensorrt.ITensor object at 0x7016481350f0>, <tensorrt.tensorrt.ITensor object at 0x701648134330>, <tensorrt.tensorrt.ITensor object at 0x701648135070>], None

additional notes

The step of "python ../quantization/quantize.py" is successful, but the step of "trtllm-build --checkpoint_dir " is fail.

Tracin commented 2 months ago

Just as log says W4A(fp)8 kernel is unsupported on pre-Hopper architectures!

pandengyao commented 2 months ago

The crash occurred when attempting to quantize the LLaMA model with W4A(fp)8_AWQ on my NVIDIA RTX 4090, which features the Ada SM89 architecture.

Tracin commented 2 months ago

Exactly, to be precise W4A(FP)8 is available on SM90 and later. And support on SM89 is on the way.

pandengyao commented 2 months ago

When will SM89(RTX 4090) be supported approximately?

triton-inference-server / tensorrtllm_backend