huggingface / optimum-nvidia

Apache License 2.0
887 stars 86 forks source link

Docker container fails on RTX A6000 #122

Open RomanKoshkin opened 5 months ago

RomanKoshkin commented 5 months ago

EDIT: The pre-built docker image (mentioned in the README.md) fails. I build the container from source later and it works.

Here's the full error dump:

Fetching 1 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 3075.00it/s] 2024-04-26 08:00:02 | INFO | root | No engine file found in /home/roman/flash/huggingface_cache/hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298, converting and building engines 2024-04-26 08:00:02 | INFO | root | Defined logits dtype to: float32 2024-04-26 08:00:02 | INFO | root | Defined engine inference profile: InferenceProfile(max_batch_size=1, max_input_len=128, max_output_len=8064) 2024-04-26 08:00:02 | INFO | root | Defined engine generation profile: GenerationProfile(num_beams=1, max_draft_length=-1) 2024-04-26 08:00:02 | INFO | root | Defined plugins config: PluginConfig(bert_attention_plugin='disable', gpt_attention_plugin='bfloat16', gemm_plugin='bfloat16', smooth_quant_gemm_plugin=None, identity_plugin=None, layernorm_quantization_plugin='disable', rmsnorm_quantization_plugin='disable', nccl_plugin='disable', lookup_plugin=None, lora_plugin=None, weight_only_groupwise_quant_matmul_plugin=None, weight_only_quant_matmul_plugin=None, quantize_per_token_plugin=False, quantize_tensor_plugin=False, moe_plugin='disable', context_fmha=None, context_fmha_fp32_acc=None, paged_kv_cache='enable', remove_input_padding=True, use_custom_all_reduce=True, multi_block_mode=None, enable_xqa='enable', attention_qk_half_accumulation=None, tokens_per_block=None, use_paged_context_fmha=None, use_context_fmha_for_generation=None, dense_context_fmha=False, pos_shift=False, multiple_profiles=False) Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00, 1.08s/it] 2024-04-26 08:02:38 | INFO | root | trtllm-build parameters: ['--checkpoint_dir', '/home/roman/flash/huggingface_cache/hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/engines', '--output_dir', '/home/roman/flash/huggingface_cache/hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/engines', '--model_config', '/home/roman/flash/huggingface_cache/hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/engines/config.json', '--builder_opt', '3', '--logits_dtype', 'float32', '--max_beam_width', '1', '--max_batch_size', '1', '--max_input_len', '128', '--max_output_len', '8064', '--use_custom_all_reduce', 'enable', '--paged_kv_cache', 'enable', '--moe_plugin', 'disable', '--gpt_attention_plugin', 'bfloat16', '--bert_attention_plugin', 'disable', '--gemm_plugin', 'bfloat16', '--remove_input_padding', 'enable', '--enable_xqa', 'enable'] 2024-04-26 08:02:55 | WARNING | root | trtllm-build stdout: [TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024031900 [04/26/2024-08:02:50] [TRT-LLM] [I] Set bert_attention_plugin to None. [04/26/2024-08:02:50] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16. [04/26/2024-08:02:50] [TRT-LLM] [I] Set gemm_plugin to bfloat16. [04/26/2024-08:02:50] [TRT-LLM] [I] Set lookup_plugin to None. [04/26/2024-08:02:50] [TRT-LLM] [I] Set lora_plugin to None. [04/26/2024-08:02:50] [TRT-LLM] [I] Set moe_plugin to None. [04/26/2024-08:02:50] [TRT-LLM] [I] Set context_fmha to True. [04/26/2024-08:02:50] [TRT-LLM] [I] Set context_fmha_fp32_acc to False. [04/26/2024-08:02:50] [TRT-LLM] [I] Set paged_kv_cache to True. [04/26/2024-08:02:50] [TRT-LLM] [I] Set remove_input_padding to True. [04/26/2024-08:02:50] [TRT-LLM] [I] Set use_custom_all_reduce to True. [04/26/2024-08:02:50] [TRT-LLM] [I] Set multi_block_mode to False. [04/26/2024-08:02:50] [TRT-LLM] [I] Set enable_xqa to True. [04/26/2024-08:02:50] [TRT-LLM] [I] Set attention_qk_half_accumulation to False. [04/26/2024-08:02:50] [TRT-LLM] [I] Set tokens_per_block to 128. [04/26/2024-08:02:50] [TRT-LLM] [I] Set use_paged_context_fmha to False. [04/26/2024-08:02:50] [TRT-LLM] [I] Set use_context_fmha_for_generation to False. [04/26/2024-08:02:50] [TRT-LLM] [I] Set dense_context_fmha to False. [04/26/2024-08:02:50] [TRT-LLM] [I] Set pos_shift to False. [04/26/2024-08:02:50] [TRT-LLM] [I] Set multiple_profiles to False. [04/26/2024-08:02:50] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_sizemax_input_len. It may not be optimal to set max_num_tokens=max_batch_sizemax_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads. [04/26/2024-08:02:50] [TRT-LLM] [W] Fail to infer cluster key, use A100-SXM-80GB as fallback. [04/26/2024-08:02:53] [TRT] [I] [MemUsageChange] Init CUDA: CPU +9, GPU +0, now: CPU 286, GPU 256 (MiB) [04/26/2024-08:02:54] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +244, GPU +40, now: CPU 666, GPU 296 (MiB) [04/26/2024-08:02:54] [TRT-LLM] [I] Set nccl_plugin to None. [04/26/2024-08:02:54] [TRT-LLM] [I] Set use_custom_all_reduce to True. [04/26/2024-08:02:54] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/vocab_embedding/GATHER_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_0_output_0: first input has type BFloat16 but second input has type Float. [04/26/2024-08:02:54] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_1_output_0: first input has type BFloat16 but second input has type Float. [TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 128 in sm_61. [TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] Assertion failed: Unsupported data type, pre SM 80 GPUs do not support bfloat16 (/src/tensorrt_llm/cpp/tensorrt_llm/plugins/gptAttentionCommon/gptAttentionCommon.cpp:446) 1 0x7f2a71333137 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x4f137) [0x7f2a71333137] 2 0x7f2a713334f1 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x4f4f1) [0x7f2a713334f1] 3 0x7f2a71436207 tensorrt_llm::plugins::GPTAttentionPlugin::GPTAttentionPlugin(int, int, int, int, int, float, tensorrt_llm::kernels::PositionEmbeddingType, int, float, tensorrt_llm::kernels::RotaryScalingType, float, int, int, int, bool, tensorrt_llm::kernels::ContextFMHAType, bool, bool, int, bool, tensorrt_llm::kernels::AttentionMaskType, bool, int, nvinfer1::DataType, int, bool, bool, int, bool, bool, bool, bool, bool) + 231 4 0x7f2a71436d15 tensorrt_llm::plugins::GPTAttentionPluginCreator::createPlugin(char const, nvinfer1::PluginFieldCollection const) + 2757 5 0x7f2bbbb6060a /usr/local/lib/python3.10/dist-packages/tensorrt/tensorrt.so(+0x16060a) [0x7f2bbbb6060a] 6 0x7f2bbba43443 /usr/local/lib/python3.10/dist-packages/tensorrt/tensorrt.so(+0x43443) [0x7f2bbba43443] 7 0x56064710610e /usr/bin/python(+0x15a10e) [0x56064710610e] 8 0x5606470fca7b _PyObject_MakeTpCall + 603 9 0x560647114acb /usr/bin/python(+0x168acb) [0x560647114acb] 10 0x5606470f4cfa _PyEval_EvalFrameDefault + 24906 11 0x5606471069fc _PyFunction_Vectorcall + 124 12 0x560647115492 PyObject_Call + 290 13 0x5606470f15d7 _PyEval_EvalFrameDefault + 10791 14 0x5606471069fc _PyFunction_Vectorcall + 124 15 0x560647115492 PyObject_Call + 290 16 0x5606470f15d7 _PyEval_EvalFrameDefault + 10791 17 0x5606471147f1 /usr/bin/python(+0x1687f1) [0x5606471147f1] 18 0x560647115492 PyObject_Call + 290 19 0x5606470f15d7 _PyEval_EvalFrameDefault + 10791 20 0x5606471069fc _PyFunction_Vectorcall + 124 21 0x5606470fbcbd _PyObject_FastCallDictTstate + 365 22 0x56064711186c _PyObject_Call_Prepend + 92 23 0x56064722c700 /usr/bin/python(+0x280700) [0x56064722c700] 24 0x5606470fca7b _PyObject_MakeTpCall + 603 25 0x5606470f6150 _PyEval_EvalFrameDefault + 30112 26 0x5606471147f1 /usr/bin/python(+0x1687f1) [0x5606471147f1] 27 0x560647115492 PyObject_Call + 290 28 0x5606470f15d7 _PyEval_EvalFrameDefault + 10791 29 0x5606471069fc _PyFunction_Vectorcall + 124 30 0x5606470fbcbd _PyObject_FastCallDictTstate + 365 31 0x56064711186c _PyObject_Call_Prepend + 92 32 0x56064722c700 /usr/bin/python(+0x280700) [0x56064722c700] 33 0x56064711542b PyObject_Call + 187 34 0x5606470f15d7 _PyEval_EvalFrameDefault + 10791 35 0x5606471147f1 /usr/bin/python(+0x1687f1) [0x5606471147f1] 36 0x5606470f053c _PyEval_EvalFrameDefault + 6540 37 0x5606471147f1 /usr/bin/python(+0x1687f1) [0x5606471147f1] 38 0x560647115492 PyObject_Call + 290 39 0x5606470f15d7 _PyEval_EvalFrameDefault + 10791 40 0x5606471147f1 /usr/bin/python(+0x1687f1) [0x5606471147f1] 41 0x560647115492 PyObject_Call + 290 42 0x5606470f15d7 _PyEval_EvalFrameDefault + 10791 43 0x5606471069fc _PyFunction_Vectorcall + 124 44 0x5606470fbcbd _PyObject_FastCallDictTstate + 365 45 0x56064711186c _PyObject_Call_Prepend + 92 46 0x56064722c700 /usr/bin/python(+0x280700) [0x56064722c700] 47 0x56064711542b PyObject_Call + 187 48 0x5606470f15d7 _PyEval_EvalFrameDefault + 10791 49 0x5606471069fc _PyFunction_Vectorcall + 124 50 0x5606470ef26d _PyEval_EvalFrameDefault + 1725 51 0x5606471069fc _PyFunction_Vectorcall + 124 52 0x560647115492 PyObject_Call + 290 53 0x5606470f15d7 _PyEval_EvalFrameDefault + 10791 54 0x5606471069fc _PyFunction_Vectorcall + 124 55 0x560647115492 PyObject_Call + 290 56 0x5606470f15d7 _PyEval_EvalFrameDefault + 10791 57 0x5606471069fc _PyFunction_Vectorcall + 124 58 0x560647115492 PyObject_Call + 290 59 0x5606470f15d7 _PyEval_EvalFrameDefault + 10791 60 0x5606471069fc _PyFunction_Vectorcall + 124 61 0x5606470ef26d _PyEval_EvalFrameDefault + 1725 62 0x5606470eb9c6 /usr/bin/python(+0x13f9c6) [0x5606470eb9c6] 63 0x5606471e1256 PyEval_EvalCode + 134 64 0x56064720c108 /usr/bin/python(+0x260108) [0x56064720c108] 65 0x5606472059cb /usr/bin/python(+0x2599cb) [0x5606472059cb] 66 0x56064720be55 /usr/bin/python(+0x25fe55) [0x56064720be55] 67 0x56064720b338 _PyRun_SimpleFileObject + 424 68 0x56064720af83 _PyRun_AnyFileObject + 67 69 0x5606471fda5e Py_RunMain + 702 70 0x5606471d402d Py_BytesMain + 45 71 0x7f2cde0c2d90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f2cde0c2d90] 72 0x7f2cde0c2e40 libc_start_main + 128 73 0x5606471d3f25 _start + 37 Traceback (most recent call last): File "/usr/local/bin/trtllm-build", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 467, in main parallel_build(source, build_config, args.output_dir, workers, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 367, in parallel_build passed = build_and_save(rank, rank % workers, ckpt_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 326, in build_and_save engine = build_model(build_config, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 319, in build_model return build(model, build_config) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 657, in build model(**inputs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 601, in forward hidden_states = self.transformer.forward(kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 195, in forward hidden_states = self.layers.forward( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 281, in forward hidden_states = layer( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 111, in forward attention_output = self.attention( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call__ output = self.forward(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/layers/attention.py", line 770, in forward context, past_key_value = gpt_attention( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/graph_rewriting.py", line 561, in wrapper outs = f(args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 3926, in gpt_attention layer = default_trtnet().add_plugin_v2(plug_inputs, attn_plug) TypeError: add_plugin_v2(): incompatible function arguments. The following argument types are supported:

  1. (self: tensorrt.tensorrt.INetworkDefinition, inputs: List[tensorrt.tensorrt.ITensor], plugin: tensorrt.tensorrt.IPluginV2) -> tensorrt.tensorrt.IPluginV2Layer

Invoked with: <tensorrt.tensorrt.INetworkDefinition object at 0x7f2cb9053170>, [<tensorrt.tensorrt.ITensor object at 0x7f2caf4444f0>, <tensorrt.tensorrt.ITensor object at 0x7f2cb8d18870>, <tensorrt.tensorrt.ITensor object at 0x7f2cb8ce7b70>, <tensorrt.tensorrt.ITensor object at 0x7f2cb8d10230>, <tensorrt.tensorrt.ITensor object at 0x7f2caf402af0>, <tensorrt.tensorrt.ITensor object at 0x7f2cb8d05ff0>, <tensorrt.tensorrt.ITensor object at 0x7f2caf402cb0>, <tensorrt.tensorrt.ITensor object at 0x7f2cb8ce5b70>, <tensorrt.tensorrt.ITensor object at 0x7f2cb90716b0>, <tensorrt.tensorrt.ITensor object at 0x7f2cb8d18d70>, <tensorrt.tensorrt.ITensor object at 0x7f2cb8d10930>], None

2024-04-26 08:02:55 | WARNING | root | trtllm-build stderr: None Traceback (most recent call last): File "/usr/local/bin/simuleval", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/simuleval/cli.py", line 49, in main system, args = build_system_args() File "/usr/local/lib/python3.10/dist-packages/simuleval/utils/agent.py", line 139, in build_system_args system_class = get_agent_class(config_dict) File "/usr/local/lib/python3.10/dist-packages/simuleval/utils/agent.py", line 47, in get_agent_class import_file(agent_file) File "/usr/local/lib/python3.10/dist-packages/simuleval/utils/agent.py", line 29, in import_file spec.loader.exec_module(agent_modules) File "", line 883, in exec_module File "", line 241, in _call_with_frames_removed File "/workspace/gremlin/evaluation/t2tt_agent_llama3_nv_optimum.py", line 85, in pipeline = pipeline(
File "/usr/local/lib/python3.10/dist-packages/optimum/nvidia/pipelines/init.py", line 119, in pipeline model = model_factory.from_pretrained(model, kwargs) File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 119, in _inner_fn return fn(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/hub_mixin.py", line 420, in from_pretrained instance = cls._from_pretrained( File "/usr/local/lib/python3.10/dist-packages/optimum/nvidia/models/auto.py", line 68, in _from_pretrained model = model_clazz.from_pretrained( File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 119, in _inner_fn return fn(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/hub_mixin.py", line 420, in from_pretrained instance = cls._from_pretrained( File "/usr/local/lib/python3.10/dist-packages/optimum/nvidia/hub.py", line 366, in _from_pretrained engines_folders, relative_paths_engines_folders = cls.convert_and_build( File "/usr/local/lib/python3.10/dist-packages/optimum/nvidia/hub.py", line 290, in convert_and_build engine_builder.build(engine_config) File "/usr/local/lib/python3.10/dist-packages/optimum/nvidia/builder/local.py", line 141, in build raise ValueError( ValueError: Compilation failed (1), please open up an issue at https://github.com/huggingface/optimum-nvidia