As followed README to build trtllm, i met an issue as below, please help me check it. Thank you!
triton/whisper/README.md
Seems like process being killed unexpectedly during converting encoder checkpoints.
Converting encoder checkpoints...
Killed
FileNotFoundError: [Errno 2] No such file or directory: 'tllm_checkpoint/decoder/config.json'
python3 convert_checkpoint.py \
--output_dir $checkpoint_dir
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling transformers.utils.move_cache().
0it [00:00, ?it/s]
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800
0.11.0.dev2024052800
Loaded model from assets/large-v3.pt
Converting encoder checkpoints...
Killed
root@ip-172-31-63-1:/workspace/TensorRT-LLM/examples/whisper# trtllm-build --checkpoint_dir ${checkpoint_dir}/encoder \
--output_dir ${output_dir}/encoder \
--paged_kv_cache disable \
--moe_plugin disable \
--enable_xqa disable \
--use_custom_all_reduce disable \
--max_batch_size ${MAX_BATCH_SIZE} \
--gemm_plugin disable \
--bert_attention_plugin ${INFERENCE_PRECISION} \
--remove_input_padding disable
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800
[06/02/2024-06:13:26] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[06/02/2024-06:13:26] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
[06/02/2024-06:13:26] [TRT-LLM] [I] Set gemm_plugin to None.
[06/02/2024-06:13:26] [TRT-LLM] [I] Set nccl_plugin to auto.
[06/02/2024-06:13:26] [TRT-LLM] [I] Set lookup_plugin to None.
[06/02/2024-06:13:26] [TRT-LLM] [I] Set lora_plugin to None.
[06/02/2024-06:13:26] [TRT-LLM] [I] Set moe_plugin to None.
[06/02/2024-06:13:26] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[06/02/2024-06:13:26] [TRT-LLM] [I] Set context_fmha to True.
[06/02/2024-06:13:26] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[06/02/2024-06:13:26] [TRT-LLM] [I] Set paged_kv_cache to False.
[06/02/2024-06:13:26] [TRT-LLM] [I] Set remove_input_padding to False.
[06/02/2024-06:13:26] [TRT-LLM] [I] Set use_custom_all_reduce to False.
[06/02/2024-06:13:26] [TRT-LLM] [I] Set multi_block_mode to False.
[06/02/2024-06:13:26] [TRT-LLM] [I] Set enable_xqa to False.
[06/02/2024-06:13:26] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[06/02/2024-06:13:26] [TRT-LLM] [I] Set tokens_per_block to 64.
[06/02/2024-06:13:26] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[06/02/2024-06:13:26] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[06/02/2024-06:13:26] [TRT-LLM] [I] Set multiple_profiles to False.
[06/02/2024-06:13:26] [TRT-LLM] [I] Set paged_state to True.
[06/02/2024-06:13:26] [TRT-LLM] [I] Set streamingllm to False.
[06/02/2024-06:13:26] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly
[06/02/2024-06:13:26] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly
[06/02/2024-06:13:26] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly
[06/02/2024-06:13:26] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly
[06/02/2024-06:13:26] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly
[06/02/2024-06:13:26] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly
[06/02/2024-06:13:26] [TRT-LLM] [W] Cannot find tllm_checkpoint/encoder/rank0.safetensors. Use dummy model weights.
[06/02/2024-06:13:26] [TRT-LLM] [I] Set dtype to float16.
[06/02/2024-06:13:26] [TRT] [I] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 134, GPU 15276 (MiB)
[06/02/2024-06:13:36] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1772, GPU +314, now: CPU 2042, GPU 15590 (MiB)
[06/02/2024-06:13:36] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.
[06/02/2024-06:13:36] [TRT-LLM] [W] allreduce algorithm is selected automatically during execution now. use_custom_all_reduce will be deprecated in future releases.
[06/02/2024-06:13:36] [TRT-LLM] [I] Set nccl_plugin to None.
[06/02/2024-06:13:36] [TRT-LLM] [I] Set use_custom_all_reduce to False.
[06/02/2024-06:13:36] [TRT] [W] IElementWiseLayer with inputs WhisperEncoder/SHUFFLE_0_output_0 and WhisperEncoder/conv1/SHUFFLE_1_output_0: first input has type Float but second input has type Half.
[06/02/2024-06:13:36] [TRT] [W] IElementWiseLayer with inputs WhisperEncoder/conv1/SHUFFLE_1_output_0 and WhisperEncoder/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[06/02/2024-06:13:36] [TRT] [W] IElementWiseLayer with inputs WhisperEncoder/SHUFFLE_2_output_0 and WhisperEncoder/ELEMENTWISE_POW_0_output_0: first input has type Float but second input has type Half.
[06/02/2024-06:13:36] [TRT] [W] IElementWiseLayer with inputs WhisperEncoder/conv1/SHUFFLE_1_output_0 and WhisperEncoder/ELEMENTWISE_PROD_1_output_0: first input has type Half but second input has type Float.
[06/02/2024-06:13:36] [TRT] [W] IElementWiseLayer with inputs WhisperEncoder/SHUFFLE_3_output_0 and WhisperEncoder/ELEMENTWISE_SUM_0_output_0: first input has type Float but second input has type Half.
[06/02/2024-06:13:36] [TRT] [W] IElementWiseLayer with inputs WhisperEncoder/SHUFFLE_10_output_0 and WhisperEncoder/SHUFFLE_11_output_0: first input has type Float but second input has type Half.
[06/02/2024-06:14:06] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[06/02/2024-06:14:07] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[06/02/2024-06:14:19] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[06/02/2024-06:14:19] [TRT] [I] Detected 2 inputs and 1 output network tensors.
[06/02/2024-06:14:20] [TRT] [I] Total Host Persistent Memory: 21696
[06/02/2024-06:14:20] [TRT] [I] Total Device Persistent Memory: 0
[06/02/2024-06:14:20] [TRT] [I] Total Scratch Memory: 184320000
[06/02/2024-06:14:20] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 167 steps to complete.
[06/02/2024-06:14:20] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 2.57238ms to assign 8 blocks to 167 nodes requiring 399362048 bytes.
[06/02/2024-06:14:20] [TRT] [I] Total Activation Memory: 399360000
[06/02/2024-06:14:20] [TRT] [I] Total Weights Memory: 1274045696
[06/02/2024-06:14:20] [TRT] [I] Engine generation completed in 13.4439 seconds.
[06/02/2024-06:14:20] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 9 MiB, GPU 1215 MiB
Killed
root@ip-172-31-63-1:/workspace/TensorRT-LLM/examples/whisper# trtllm-build --checkpoint_dir ${checkpoint_dir}/decoder \
--output_dir ${output_dir}/decoder \
--paged_kv_cache disable \
--moe_plugin disable \
--enable_xqa disable \
--use_custom_all_reduce disable \
--max_beam_width ${MAX_BEAM_WIDTH} \
--max_batch_size ${MAX_BATCH_SIZE} \
--max_output_len 100 \
--max_input_len 14 \
--max_encoder_input_len 1500 \
--gemm_plugin ${INFERENCE_PRECISION} \
--bert_attention_plugin ${INFERENCE_PRECISION} \
--gpt_attention_plugin ${INFERENCE_PRECISION} \
--remove_input_padding disable
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800
[06/02/2024-06:15:48] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[06/02/2024-06:15:48] [TRT-LLM] [I] Set gpt_attention_plugin to float16.
[06/02/2024-06:15:48] [TRT-LLM] [I] Set gemm_plugin to float16.
[06/02/2024-06:15:48] [TRT-LLM] [I] Set nccl_plugin to auto.
[06/02/2024-06:15:48] [TRT-LLM] [I] Set lookup_plugin to None.
[06/02/2024-06:15:48] [TRT-LLM] [I] Set lora_plugin to None.
[06/02/2024-06:15:48] [TRT-LLM] [I] Set moe_plugin to None.
[06/02/2024-06:15:48] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[06/02/2024-06:15:48] [TRT-LLM] [I] Set context_fmha to True.
[06/02/2024-06:15:48] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[06/02/2024-06:15:48] [TRT-LLM] [I] Set paged_kv_cache to False.
[06/02/2024-06:15:48] [TRT-LLM] [I] Set remove_input_padding to False.
[06/02/2024-06:15:48] [TRT-LLM] [I] Set use_custom_all_reduce to False.
[06/02/2024-06:15:48] [TRT-LLM] [I] Set multi_block_mode to False.
[06/02/2024-06:15:48] [TRT-LLM] [I] Set enable_xqa to False.
[06/02/2024-06:15:48] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[06/02/2024-06:15:48] [TRT-LLM] [I] Set tokens_per_block to 64.
[06/02/2024-06:15:48] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[06/02/2024-06:15:48] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[06/02/2024-06:15:48] [TRT-LLM] [I] Set multiple_profiles to False.
[06/02/2024-06:15:48] [TRT-LLM] [I] Set paged_state to True.
[06/02/2024-06:15:48] [TRT-LLM] [I] Set streamingllm to False.
Traceback (most recent call last):
File "/usr/local/bin/trtllm-build", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 499, in main
parallel_build(source, build_config, args.output_dir, workers,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 361, in parallel_build
model_config = PretrainedConfig.from_json_file(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 257, in from_json_file
with open(config_file) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'tllm_checkpoint/decoder/config.json'
As followed README to build trtllm, i met an issue as below, please help me check it. Thank you! triton/whisper/README.md
Seems like process being killed unexpectedly during converting encoder checkpoints.
Converting encoder checkpoints... Killed
FileNotFoundError: [Errno 2] No such file or directory: 'tllm_checkpoint/decoder/config.json'
python3 convert_checkpoint.py \ --output_dir $checkpoint_dir The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 499, in main
parallel_build(source, build_config, args.output_dir, workers,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 361, in parallel_build
model_config = PretrainedConfig.from_json_file(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 257, in from_json_file
with open(config_file) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'tllm_checkpoint/decoder/config.json'
transformers.utils.move_cache()
. 0it [00:00, ?it/s] [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800 0.11.0.dev2024052800 Loaded model from assets/large-v3.pt Converting encoder checkpoints... Killed root@ip-172-31-63-1:/workspace/TensorRT-LLM/examples/whisper# trtllm-build --checkpoint_dir ${checkpoint_dir}/encoder \ --output_dir ${output_dir}/encoder \ --paged_kv_cache disable \ --moe_plugin disable \ --enable_xqa disable \ --use_custom_all_reduce disable \ --max_batch_size ${MAX_BATCH_SIZE} \ --gemm_plugin disable \ --bert_attention_plugin ${INFERENCE_PRECISION} \ --remove_input_padding disable [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800 [06/02/2024-06:13:26] [TRT-LLM] [I] Set bert_attention_plugin to float16. [06/02/2024-06:13:26] [TRT-LLM] [I] Set gpt_attention_plugin to auto. [06/02/2024-06:13:26] [TRT-LLM] [I] Set gemm_plugin to None. [06/02/2024-06:13:26] [TRT-LLM] [I] Set nccl_plugin to auto. [06/02/2024-06:13:26] [TRT-LLM] [I] Set lookup_plugin to None. [06/02/2024-06:13:26] [TRT-LLM] [I] Set lora_plugin to None. [06/02/2024-06:13:26] [TRT-LLM] [I] Set moe_plugin to None. [06/02/2024-06:13:26] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto. [06/02/2024-06:13:26] [TRT-LLM] [I] Set context_fmha to True. [06/02/2024-06:13:26] [TRT-LLM] [I] Set context_fmha_fp32_acc to False. [06/02/2024-06:13:26] [TRT-LLM] [I] Set paged_kv_cache to False. [06/02/2024-06:13:26] [TRT-LLM] [I] Set remove_input_padding to False. [06/02/2024-06:13:26] [TRT-LLM] [I] Set use_custom_all_reduce to False. [06/02/2024-06:13:26] [TRT-LLM] [I] Set multi_block_mode to False. [06/02/2024-06:13:26] [TRT-LLM] [I] Set enable_xqa to False. [06/02/2024-06:13:26] [TRT-LLM] [I] Set attention_qk_half_accumulation to False. [06/02/2024-06:13:26] [TRT-LLM] [I] Set tokens_per_block to 64. [06/02/2024-06:13:26] [TRT-LLM] [I] Set use_paged_context_fmha to False. [06/02/2024-06:13:26] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [06/02/2024-06:13:26] [TRT-LLM] [I] Set multiple_profiles to False. [06/02/2024-06:13:26] [TRT-LLM] [I] Set paged_state to True. [06/02/2024-06:13:26] [TRT-LLM] [I] Set streamingllm to False. [06/02/2024-06:13:26] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly [06/02/2024-06:13:26] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly [06/02/2024-06:13:26] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly [06/02/2024-06:13:26] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly [06/02/2024-06:13:26] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly [06/02/2024-06:13:26] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly [06/02/2024-06:13:26] [TRT-LLM] [W] Cannot find tllm_checkpoint/encoder/rank0.safetensors. Use dummy model weights. [06/02/2024-06:13:26] [TRT-LLM] [I] Set dtype to float16. [06/02/2024-06:13:26] [TRT] [I] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 134, GPU 15276 (MiB) [06/02/2024-06:13:36] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1772, GPU +314, now: CPU 2042, GPU 15590 (MiB) [06/02/2024-06:13:36] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect. [06/02/2024-06:13:36] [TRT-LLM] [W] allreduce algorithm is selected automatically during execution now. use_custom_all_reduce will be deprecated in future releases. [06/02/2024-06:13:36] [TRT-LLM] [I] Set nccl_plugin to None. [06/02/2024-06:13:36] [TRT-LLM] [I] Set use_custom_all_reduce to False. [06/02/2024-06:13:36] [TRT] [W] IElementWiseLayer with inputs WhisperEncoder/SHUFFLE_0_output_0 and WhisperEncoder/conv1/SHUFFLE_1_output_0: first input has type Float but second input has type Half. [06/02/2024-06:13:36] [TRT] [W] IElementWiseLayer with inputs WhisperEncoder/conv1/SHUFFLE_1_output_0 and WhisperEncoder/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [06/02/2024-06:13:36] [TRT] [W] IElementWiseLayer with inputs WhisperEncoder/SHUFFLE_2_output_0 and WhisperEncoder/ELEMENTWISE_POW_0_output_0: first input has type Float but second input has type Half. [06/02/2024-06:13:36] [TRT] [W] IElementWiseLayer with inputs WhisperEncoder/conv1/SHUFFLE_1_output_0 and WhisperEncoder/ELEMENTWISE_PROD_1_output_0: first input has type Half but second input has type Float. [06/02/2024-06:13:36] [TRT] [W] IElementWiseLayer with inputs WhisperEncoder/SHUFFLE_3_output_0 and WhisperEncoder/ELEMENTWISE_SUM_0_output_0: first input has type Float but second input has type Half. [06/02/2024-06:13:36] [TRT] [W] IElementWiseLayer with inputs WhisperEncoder/SHUFFLE_10_output_0 and WhisperEncoder/SHUFFLE_11_output_0: first input has type Float but second input has type Half. [06/02/2024-06:14:06] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0 [06/02/2024-06:14:07] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored. [06/02/2024-06:14:19] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called. [06/02/2024-06:14:19] [TRT] [I] Detected 2 inputs and 1 output network tensors. [06/02/2024-06:14:20] [TRT] [I] Total Host Persistent Memory: 21696 [06/02/2024-06:14:20] [TRT] [I] Total Device Persistent Memory: 0 [06/02/2024-06:14:20] [TRT] [I] Total Scratch Memory: 184320000 [06/02/2024-06:14:20] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 167 steps to complete. [06/02/2024-06:14:20] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 2.57238ms to assign 8 blocks to 167 nodes requiring 399362048 bytes. [06/02/2024-06:14:20] [TRT] [I] Total Activation Memory: 399360000 [06/02/2024-06:14:20] [TRT] [I] Total Weights Memory: 1274045696 [06/02/2024-06:14:20] [TRT] [I] Engine generation completed in 13.4439 seconds. [06/02/2024-06:14:20] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 9 MiB, GPU 1215 MiB Killed root@ip-172-31-63-1:/workspace/TensorRT-LLM/examples/whisper# trtllm-build --checkpoint_dir ${checkpoint_dir}/decoder \ --output_dir ${output_dir}/decoder \ --paged_kv_cache disable \ --moe_plugin disable \ --enable_xqa disable \ --use_custom_all_reduce disable \ --max_beam_width ${MAX_BEAM_WIDTH} \ --max_batch_size ${MAX_BATCH_SIZE} \ --max_output_len 100 \ --max_input_len 14 \ --max_encoder_input_len 1500 \ --gemm_plugin ${INFERENCE_PRECISION} \ --bert_attention_plugin ${INFERENCE_PRECISION} \ --gpt_attention_plugin ${INFERENCE_PRECISION} \ --remove_input_padding disable [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800 [06/02/2024-06:15:48] [TRT-LLM] [I] Set bert_attention_plugin to float16. [06/02/2024-06:15:48] [TRT-LLM] [I] Set gpt_attention_plugin to float16. [06/02/2024-06:15:48] [TRT-LLM] [I] Set gemm_plugin to float16. [06/02/2024-06:15:48] [TRT-LLM] [I] Set nccl_plugin to auto. [06/02/2024-06:15:48] [TRT-LLM] [I] Set lookup_plugin to None. [06/02/2024-06:15:48] [TRT-LLM] [I] Set lora_plugin to None. [06/02/2024-06:15:48] [TRT-LLM] [I] Set moe_plugin to None. [06/02/2024-06:15:48] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto. [06/02/2024-06:15:48] [TRT-LLM] [I] Set context_fmha to True. [06/02/2024-06:15:48] [TRT-LLM] [I] Set context_fmha_fp32_acc to False. [06/02/2024-06:15:48] [TRT-LLM] [I] Set paged_kv_cache to False. [06/02/2024-06:15:48] [TRT-LLM] [I] Set remove_input_padding to False. [06/02/2024-06:15:48] [TRT-LLM] [I] Set use_custom_all_reduce to False. [06/02/2024-06:15:48] [TRT-LLM] [I] Set multi_block_mode to False. [06/02/2024-06:15:48] [TRT-LLM] [I] Set enable_xqa to False. [06/02/2024-06:15:48] [TRT-LLM] [I] Set attention_qk_half_accumulation to False. [06/02/2024-06:15:48] [TRT-LLM] [I] Set tokens_per_block to 64. [06/02/2024-06:15:48] [TRT-LLM] [I] Set use_paged_context_fmha to False. [06/02/2024-06:15:48] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [06/02/2024-06:15:48] [TRT-LLM] [I] Set multiple_profiles to False. [06/02/2024-06:15:48] [TRT-LLM] [I] Set paged_state to True. [06/02/2024-06:15:48] [TRT-LLM] [I] Set streamingllm to False. Traceback (most recent call last): File "/usr/local/bin/trtllm-build", line 8, in