k2-fsa / sherpa

Speech-to-text server framework with next-gen Kaldi
https://k2-fsa.github.io/sherpa
Apache License 2.0
473 stars 97 forks source link

trtllm-build --checkpoint_dir ${checkpoint_dir}/decoder failed #605

Closed evanxqs closed 1 month ago

evanxqs commented 1 month ago

As followed README to build trtllm, i met an issue as below, please help me check it. Thank you! triton/whisper/README.md

Seems like process being killed unexpectedly during converting encoder checkpoints.

Converting encoder checkpoints... Killed

FileNotFoundError: [Errno 2] No such file or directory: 'tllm_checkpoint/decoder/config.json'

python3 convert_checkpoint.py \ --output_dir $checkpoint_dir The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling transformers.utils.move_cache(). 0it [00:00, ?it/s] [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800 0.11.0.dev2024052800 Loaded model from assets/large-v3.pt Converting encoder checkpoints... Killed root@ip-172-31-63-1:/workspace/TensorRT-LLM/examples/whisper# trtllm-build --checkpoint_dir ${checkpoint_dir}/encoder \ --output_dir ${output_dir}/encoder \ --paged_kv_cache disable \ --moe_plugin disable \ --enable_xqa disable \ --use_custom_all_reduce disable \ --max_batch_size ${MAX_BATCH_SIZE} \ --gemm_plugin disable \ --bert_attention_plugin ${INFERENCE_PRECISION} \ --remove_input_padding disable [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800 [06/02/2024-06:13:26] [TRT-LLM] [I] Set bert_attention_plugin to float16. [06/02/2024-06:13:26] [TRT-LLM] [I] Set gpt_attention_plugin to auto. [06/02/2024-06:13:26] [TRT-LLM] [I] Set gemm_plugin to None. [06/02/2024-06:13:26] [TRT-LLM] [I] Set nccl_plugin to auto. [06/02/2024-06:13:26] [TRT-LLM] [I] Set lookup_plugin to None. [06/02/2024-06:13:26] [TRT-LLM] [I] Set lora_plugin to None. [06/02/2024-06:13:26] [TRT-LLM] [I] Set moe_plugin to None. [06/02/2024-06:13:26] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto. [06/02/2024-06:13:26] [TRT-LLM] [I] Set context_fmha to True. [06/02/2024-06:13:26] [TRT-LLM] [I] Set context_fmha_fp32_acc to False. [06/02/2024-06:13:26] [TRT-LLM] [I] Set paged_kv_cache to False. [06/02/2024-06:13:26] [TRT-LLM] [I] Set remove_input_padding to False. [06/02/2024-06:13:26] [TRT-LLM] [I] Set use_custom_all_reduce to False. [06/02/2024-06:13:26] [TRT-LLM] [I] Set multi_block_mode to False. [06/02/2024-06:13:26] [TRT-LLM] [I] Set enable_xqa to False. [06/02/2024-06:13:26] [TRT-LLM] [I] Set attention_qk_half_accumulation to False. [06/02/2024-06:13:26] [TRT-LLM] [I] Set tokens_per_block to 64. [06/02/2024-06:13:26] [TRT-LLM] [I] Set use_paged_context_fmha to False. [06/02/2024-06:13:26] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [06/02/2024-06:13:26] [TRT-LLM] [I] Set multiple_profiles to False. [06/02/2024-06:13:26] [TRT-LLM] [I] Set paged_state to True. [06/02/2024-06:13:26] [TRT-LLM] [I] Set streamingllm to False. [06/02/2024-06:13:26] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly [06/02/2024-06:13:26] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly [06/02/2024-06:13:26] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly [06/02/2024-06:13:26] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly [06/02/2024-06:13:26] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly [06/02/2024-06:13:26] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly [06/02/2024-06:13:26] [TRT-LLM] [W] Cannot find tllm_checkpoint/encoder/rank0.safetensors. Use dummy model weights. [06/02/2024-06:13:26] [TRT-LLM] [I] Set dtype to float16. [06/02/2024-06:13:26] [TRT] [I] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 134, GPU 15276 (MiB) [06/02/2024-06:13:36] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1772, GPU +314, now: CPU 2042, GPU 15590 (MiB) [06/02/2024-06:13:36] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect. [06/02/2024-06:13:36] [TRT-LLM] [W] allreduce algorithm is selected automatically during execution now. use_custom_all_reduce will be deprecated in future releases. [06/02/2024-06:13:36] [TRT-LLM] [I] Set nccl_plugin to None. [06/02/2024-06:13:36] [TRT-LLM] [I] Set use_custom_all_reduce to False. [06/02/2024-06:13:36] [TRT] [W] IElementWiseLayer with inputs WhisperEncoder/SHUFFLE_0_output_0 and WhisperEncoder/conv1/SHUFFLE_1_output_0: first input has type Float but second input has type Half. [06/02/2024-06:13:36] [TRT] [W] IElementWiseLayer with inputs WhisperEncoder/conv1/SHUFFLE_1_output_0 and WhisperEncoder/SHUFFLE_1_output_0: first input has type Half but second input has type Float. [06/02/2024-06:13:36] [TRT] [W] IElementWiseLayer with inputs WhisperEncoder/SHUFFLE_2_output_0 and WhisperEncoder/ELEMENTWISE_POW_0_output_0: first input has type Float but second input has type Half. [06/02/2024-06:13:36] [TRT] [W] IElementWiseLayer with inputs WhisperEncoder/conv1/SHUFFLE_1_output_0 and WhisperEncoder/ELEMENTWISE_PROD_1_output_0: first input has type Half but second input has type Float. [06/02/2024-06:13:36] [TRT] [W] IElementWiseLayer with inputs WhisperEncoder/SHUFFLE_3_output_0 and WhisperEncoder/ELEMENTWISE_SUM_0_output_0: first input has type Float but second input has type Half. [06/02/2024-06:13:36] [TRT] [W] IElementWiseLayer with inputs WhisperEncoder/SHUFFLE_10_output_0 and WhisperEncoder/SHUFFLE_11_output_0: first input has type Float but second input has type Half. [06/02/2024-06:14:06] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0 [06/02/2024-06:14:07] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored. [06/02/2024-06:14:19] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called. [06/02/2024-06:14:19] [TRT] [I] Detected 2 inputs and 1 output network tensors. [06/02/2024-06:14:20] [TRT] [I] Total Host Persistent Memory: 21696 [06/02/2024-06:14:20] [TRT] [I] Total Device Persistent Memory: 0 [06/02/2024-06:14:20] [TRT] [I] Total Scratch Memory: 184320000 [06/02/2024-06:14:20] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 167 steps to complete. [06/02/2024-06:14:20] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 2.57238ms to assign 8 blocks to 167 nodes requiring 399362048 bytes. [06/02/2024-06:14:20] [TRT] [I] Total Activation Memory: 399360000 [06/02/2024-06:14:20] [TRT] [I] Total Weights Memory: 1274045696 [06/02/2024-06:14:20] [TRT] [I] Engine generation completed in 13.4439 seconds. [06/02/2024-06:14:20] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 9 MiB, GPU 1215 MiB Killed root@ip-172-31-63-1:/workspace/TensorRT-LLM/examples/whisper# trtllm-build --checkpoint_dir ${checkpoint_dir}/decoder \ --output_dir ${output_dir}/decoder \ --paged_kv_cache disable \ --moe_plugin disable \ --enable_xqa disable \ --use_custom_all_reduce disable \ --max_beam_width ${MAX_BEAM_WIDTH} \ --max_batch_size ${MAX_BATCH_SIZE} \ --max_output_len 100 \ --max_input_len 14 \ --max_encoder_input_len 1500 \ --gemm_plugin ${INFERENCE_PRECISION} \ --bert_attention_plugin ${INFERENCE_PRECISION} \ --gpt_attention_plugin ${INFERENCE_PRECISION} \ --remove_input_padding disable [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800 [06/02/2024-06:15:48] [TRT-LLM] [I] Set bert_attention_plugin to float16. [06/02/2024-06:15:48] [TRT-LLM] [I] Set gpt_attention_plugin to float16. [06/02/2024-06:15:48] [TRT-LLM] [I] Set gemm_plugin to float16. [06/02/2024-06:15:48] [TRT-LLM] [I] Set nccl_plugin to auto. [06/02/2024-06:15:48] [TRT-LLM] [I] Set lookup_plugin to None. [06/02/2024-06:15:48] [TRT-LLM] [I] Set lora_plugin to None. [06/02/2024-06:15:48] [TRT-LLM] [I] Set moe_plugin to None. [06/02/2024-06:15:48] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto. [06/02/2024-06:15:48] [TRT-LLM] [I] Set context_fmha to True. [06/02/2024-06:15:48] [TRT-LLM] [I] Set context_fmha_fp32_acc to False. [06/02/2024-06:15:48] [TRT-LLM] [I] Set paged_kv_cache to False. [06/02/2024-06:15:48] [TRT-LLM] [I] Set remove_input_padding to False. [06/02/2024-06:15:48] [TRT-LLM] [I] Set use_custom_all_reduce to False. [06/02/2024-06:15:48] [TRT-LLM] [I] Set multi_block_mode to False. [06/02/2024-06:15:48] [TRT-LLM] [I] Set enable_xqa to False. [06/02/2024-06:15:48] [TRT-LLM] [I] Set attention_qk_half_accumulation to False. [06/02/2024-06:15:48] [TRT-LLM] [I] Set tokens_per_block to 64. [06/02/2024-06:15:48] [TRT-LLM] [I] Set use_paged_context_fmha to False. [06/02/2024-06:15:48] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [06/02/2024-06:15:48] [TRT-LLM] [I] Set multiple_profiles to False. [06/02/2024-06:15:48] [TRT-LLM] [I] Set paged_state to True. [06/02/2024-06:15:48] [TRT-LLM] [I] Set streamingllm to False. Traceback (most recent call last): File "/usr/local/bin/trtllm-build", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 499, in main parallel_build(source, build_config, args.output_dir, workers, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 361, in parallel_build model_config = PretrainedConfig.from_json_file( File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 257, in from_json_file with open(config_file) as f: FileNotFoundError: [Errno 2] No such file or directory: 'tllm_checkpoint/decoder/config.json'

evanxqs commented 1 month ago

Fixed, it's caused by low memory, it's better to keep 10GB memory free.