Closed pandengyao closed 2 months ago
Just as log says W4A(fp)8 kernel is unsupported on pre-Hopper architectures!
The crash occurred when attempting to quantize the LLaMA model with W4A(fp)8_AWQ on my NVIDIA RTX 4090, which features the Ada SM89 architecture.
Exactly, to be precise W4A(FP)8 is available on SM90 and later. And support on SM89 is on the way.
When will SM89(RTX 4090) be supported approximately?
System Info
CPU/Host memory size (64G) -GPU properties
-Libraries -TensorRT-LLM branch or tag (v0.9.0) -TensorRT-LLM commit (250d9c293d5edbc2a45c20775b3150b1eb68b364)
Who can help?
@Tracin
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
python ../quantization/quantize.py --model_dir MODEL_PATH \ --dtype float16 \ --qformat w4a8_awq \ --awq_block_size 128 \ --output_dir ./quantized_int4-awq \ --calib_size 32
trtllm-build --checkpoint_dir ./quantized_int4-awq \ --output_dir OUTPUT_PATH \ --gemm_plugin float16 \ --gpt_attention_plugin float16 \ --context_fmha enable \ --paged_kv_cache enable \ --max_prompt_embedding_table_size 2048 \ --max_batch_size 32 \ --remove_input_padding enable \ --use_paged_context_fmha enable \ --multi_block_mode enable \ --gather_all_token_logits \ --max_input_len 2048
Expected behavior
build successfully!
actual behavior
[TensorRT-LLM] TensorRT-LLM version: 0.9.0 [04/18/2024-02:48:51] [TRT-LLM] [I] Set bert_attention_plugin to float16. [04/18/2024-02:48:51] [TRT-LLM] [I] Set gpt_attention_plugin to float16. [04/18/2024-02:48:51] [TRT-LLM] [I] Set gemm_plugin to float16. [04/18/2024-02:48:51] [TRT-LLM] [I] Set lookup_plugin to None. [04/18/2024-02:48:51] [TRT-LLM] [I] Set lora_plugin to None. [04/18/2024-02:48:51] [TRT-LLM] [I] Set moe_plugin to float16. [04/18/2024-02:48:51] [TRT-LLM] [I] Set mamba_conv1d_plugin to float16. [04/18/2024-02:48:51] [TRT-LLM] [I] Set context_fmha to True. [04/18/2024-02:48:51] [TRT-LLM] [I] Set context_fmha_fp32_acc to False. [04/18/2024-02:48:51] [TRT-LLM] [I] Set paged_kv_cache to True. [04/18/2024-02:48:51] [TRT-LLM] [I] Set remove_input_padding to True. [04/18/2024-02:48:51] [TRT-LLM] [I] Set use_custom_all_reduce to True. [04/18/2024-02:48:51] [TRT-LLM] [I] Set multi_block_mode to True. [04/18/2024-02:48:51] [TRT-LLM] [I] Set enable_xqa to True. [04/18/2024-02:48:51] [TRT-LLM] [I] Set attention_qk_half_accumulation to False. [04/18/2024-02:48:51] [TRT-LLM] [I] Set tokens_per_block to 128. [04/18/2024-02:48:51] [TRT-LLM] [I] Set use_paged_context_fmha to True. [04/18/2024-02:48:51] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [04/18/2024-02:48:51] [TRT-LLM] [I] Set use_context_fmha_for_generation to False. [04/18/2024-02:48:51] [TRT-LLM] [I] Set multiple_profiles to False. [04/18/2024-02:48:51] [TRT-LLM] [I] Set paged_state to True. [04/18/2024-02:48:51] [TRT-LLM] [I] Set streamingllm to False. [04/18/2024-02:48:51] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_sizemax_input_len. It may not be optimal to set max_num_tokens=max_batch_sizemax_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads. [04/18/2024-02:48:51] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.
[04/18/2024-02:48:51] [TRT-LLM] [W] Fail to infer cluster key, use A100-SXM-80GB as fallback. [04/18/2024-02:49:03] [TRT] [I] [MemUsageChange] Init CUDA: CPU +12, GPU +0, now: CPU 3194, GPU 469 (MiB) [04/18/2024-02:49:04] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1809, GPU +316, now: CPU 5139, GPU 785 (MiB) [04/18/2024-02:49:04] [TRT-LLM] [I] Set weight_only_groupwise_quant_matmul_plugin to float16. [04/18/2024-02:49:04] [TRT-LLM] [I] Set nccl_plugin to None. [04/18/2024-02:49:04] [TRT-LLM] [I] Set use_custom_all_reduce to True. [04/18/2024-02:49:04] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/vocab_embedding/SELECT_2_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float. [04/18/2024-02:49:04] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: W4A(fp)8 kernel is unsupported on pre-Hopper architectures! (/app/tensorrt_llm/cpp/tensorrt_llm/plugins/weightOnlyGroupwiseQuantMatmulPlugin/weightOnlyGroupwiseQuantMatmulPlugin.cpp:152) 1 0x7014245f8b30 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x53b30) [0x7014245f8b30] 2 0x701424713f7a tensorrt_llm::plugins::WeightOnlyGroupwiseQuantMatmulPlugin::WeightOnlyGroupwiseQuantMatmulPlugin(nvinfer1::DataType, int, int, std::shared_ptr const&) + 250
3 0x701424714352 tensorrt_llm::plugins::WeightOnlyGroupwiseQuantMatmulPluginCreator::createPlugin(char const, nvinfer1::PluginFieldCollection const) + 434
4 0x7014e676060a /usr/local/lib/python3.10/dist-packages/tensorrt/tensorrt.so(+0x16060a) [0x7014e676060a]
5 0x7014e6643443 /usr/local/lib/python3.10/dist-packages/tensorrt/tensorrt.so(+0x43443) [0x7014e6643443]
6 0x594282dfc10e /usr/bin/python3(+0x15a10e) [0x594282dfc10e]
7 0x594282df2a7b _PyObject_MakeTpCall + 603
8 0x594282e0aacb /usr/bin/python3(+0x168acb) [0x594282e0aacb]
9 0x594282deacfa _PyEval_EvalFrameDefault + 24906
10 0x594282dfc9fc _PyFunction_Vectorcall + 124
11 0x594282de526d _PyEval_EvalFrameDefault + 1725
12 0x594282e0a93e /usr/bin/python3(+0x16893e) [0x594282e0a93e]
13 0x594282de75d7 _PyEval_EvalFrameDefault + 10791
14 0x594282df1c14 _PyObject_FastCallDictTstate + 196
15 0x594282e0786c _PyObject_Call_Prepend + 92
16 0x594282f22700 /usr/bin/python3(+0x280700) [0x594282f22700]
17 0x594282df2a7b _PyObject_MakeTpCall + 603
18 0x594282deb629 _PyEval_EvalFrameDefault + 27257
19 0x594282e0a7f1 /usr/bin/python3(+0x1687f1) [0x594282e0a7f1]
20 0x594282e0b492 PyObject_Call + 290
21 0x594282de75d7 _PyEval_EvalFrameDefault + 10791
22 0x594282dfc9fc _PyFunction_Vectorcall + 124
23 0x594282df1cbd _PyObject_FastCallDictTstate + 365
24 0x594282e0786c _PyObject_Call_Prepend + 92
25 0x594282f22700 /usr/bin/python3(+0x280700) [0x594282f22700]
26 0x594282df2a7b _PyObject_MakeTpCall + 603
27 0x594282dec150 _PyEval_EvalFrameDefault + 30112
28 0x594282e0a7f1 /usr/bin/python3(+0x1687f1) [0x594282e0a7f1]
29 0x594282e0b492 PyObject_Call + 290
30 0x594282de75d7 _PyEval_EvalFrameDefault + 10791
31 0x594282dfc9fc _PyFunction_Vectorcall + 124
32 0x594282df1cbd _PyObject_FastCallDictTstate + 365
33 0x594282e0786c _PyObject_Call_Prepend + 92
34 0x594282f22700 /usr/bin/python3(+0x280700) [0x594282f22700]
35 0x594282e0b42b PyObject_Call + 187
36 0x594282de75d7 _PyEval_EvalFrameDefault + 10791
37 0x594282e0a7f1 /usr/bin/python3(+0x1687f1) [0x594282e0a7f1]
38 0x594282de653c _PyEval_EvalFrameDefault + 6540
39 0x594282e0a7f1 /usr/bin/python3(+0x1687f1) [0x594282e0a7f1]
40 0x594282e0b492 PyObject_Call + 290
41 0x594282de75d7 _PyEval_EvalFrameDefault + 10791
42 0x594282e0a7f1 /usr/bin/python3(+0x1687f1) [0x594282e0a7f1]
43 0x594282e0b492 PyObject_Call + 290
44 0x594282de75d7 _PyEval_EvalFrameDefault + 10791
45 0x594282dfc9fc _PyFunction_Vectorcall + 124
46 0x594282df1cbd _PyObject_FastCallDictTstate + 365
47 0x594282e0786c _PyObject_Call_Prepend + 92
48 0x594282f22700 /usr/bin/python3(+0x280700) [0x594282f22700]
49 0x594282e0b42b PyObject_Call + 187
50 0x594282de75d7 _PyEval_EvalFrameDefault + 10791
51 0x594282dfc9fc _PyFunction_Vectorcall + 124
52 0x594282de526d _PyEval_EvalFrameDefault + 1725
53 0x594282dfc9fc _PyFunction_Vectorcall + 124
54 0x594282e0b492 PyObject_Call + 290
55 0x594282de75d7 _PyEval_EvalFrameDefault + 10791
56 0x594282dfc9fc _PyFunction_Vectorcall + 124
57 0x594282e0b492 PyObject_Call + 290
58 0x594282de75d7 _PyEval_EvalFrameDefault + 10791
59 0x594282dfc9fc _PyFunction_Vectorcall + 124
60 0x594282e0b492 PyObject_Call + 290
61 0x594282de75d7 _PyEval_EvalFrameDefault + 10791
62 0x594282dfc9fc _PyFunction_Vectorcall + 124
63 0x594282de526d _PyEval_EvalFrameDefault + 1725
64 0x594282de19c6 /usr/bin/python3(+0x13f9c6) [0x594282de19c6]
65 0x594282ed7256 PyEval_EvalCode + 134
66 0x594282f02108 /usr/bin/python3(+0x260108) [0x594282f02108]
67 0x594282efb9cb /usr/bin/python3(+0x2599cb) [0x594282efb9cb]
68 0x594282f01e55 /usr/bin/python3(+0x25fe55) [0x594282f01e55]
69 0x594282f01338 _PyRun_SimpleFileObject + 424
70 0x594282f00f83 _PyRun_AnyFileObject + 67
71 0x594282ef3a5e Py_RunMain + 702
72 0x594282eca02d Py_BytesMain + 45
73 0x7016556c8d90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7016556c8d90]
74 0x7016556c8e40 libc_start_main + 128
75 0x594282ec9f25 _start + 37
Traceback (most recent call last):
File "/usr/local/bin/trtllm-build", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 440, in main
parallel_build(source, build_config, args.output_dir, workers,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 332, in parallel_build
passed = build_and_save(rank, rank % workers, ckpt_dir,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 291, in build_and_save
engine = build_model(build_config,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 284, in build_model
return build(model, build_config)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 691, in build
model(**inputs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call
output = self.forward(args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 630, in forward
hidden_states = self.transformer.forward(kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 178, in forward
hidden_states = self.layers.forward(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 296, in forward
hidden_states = layer(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call
output = self.forward(args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 107, in forward
attention_output = self.attention(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call__
output = self.forward(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/layers/attention.py", line 637, in forward
qkv = self.qkv(hidden_states, qkv_lora_params)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call
output = self.forward(args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/layers.py", line 507, in forward
x = weight_only_groupwise_quant_matmul(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/functional.py", line 185, in weight_only_groupwise_quant_matmul
layer = default_trtnet().add_plugin_v2(plug_inputs, matmul_plug)
TypeError: add_plugin_v2(): incompatible function arguments. The following argument types are supported:
Invoked with: <tensorrt.tensorrt.INetworkDefinition object at 0x701648142230>, [<tensorrt.tensorrt.ITensor object at 0x701648134bb0>, <tensorrt.tensorrt.ITensor object at 0x701648135db0>, <tensorrt.tensorrt.ITensor object at 0x7016481350f0>, <tensorrt.tensorrt.ITensor object at 0x701648134330>, <tensorrt.tensorrt.ITensor object at 0x701648135070>], None
additional notes
The step of "python ../quantization/quantize.py" is successful, but the step of "trtllm-build --checkpoint_dir " is fail.