I have RTX4060 8GB in my laptop with 16gb ram and intel i7-12700H cpu when i run build-llama.sh or build-mistral.sh it gets killed automatically with below output and I found that my cpu gets 100% utilised when running build-llama.sh or build-mistral.sh attaching the ss of the same kindly Help me with that
(trtllm) vishwajeet@vishwa:~/Desktop/MYGPT/trt-llm-rag-linux$ bash build-mistral.sh
You are using a model of type mistral to instantiate a model of type llama. This is not supported for all configurations of models and can yield errors.
[03/22/2024-20:50:37] [TRT-LLM] [I] Serially build TensorRT engines.
[03/22/2024-20:50:39] [TRT] [I] [MemUsageChange] Init CUDA: CPU +2991, GPU +0, now: CPU 4121, GPU 1039 (MiB)
[03/22/2024-20:50:41] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1798, GPU +314, now: CPU 6055, GPU 1353 (MiB)
[03/22/2024-20:50:41] [TRT] [W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
[03/22/2024-20:50:41] [TRT-LLM] [W] Invalid timing cache, using freshly created one
[03/22/2024-20:50:41] [TRT-LLM] [I] [MemUsage] Rank 0 Engine build starts - Allocated Memory: Host 7.1113 (GiB) Device 1.3216 (GiB)
build-mistral.sh: line 1: 6084 Killed python build.py --model_dir './model/mistral/mistral7b_hf' --quant_ckpt_path './model/mistral/mistral7b_int4_quant_weights/mistral_tp1_rank0.npz' --dtype float16 --remove_input_padding --use_gpt_attention_plugin float16 --enable_context_fmha --use_gemm_plugin float16 --use_weight_only --weight_only_precision int4_awq --per_group --output_dir './model/mistral/mistral7b_int4_engine' --world_size 1 --tp_size 1 --parallel_build --max_input_len 1024 --max_batch_size 1 --max_output_len 1024
I have RTX4060 8GB in my laptop with 16gb ram and intel i7-12700H cpu when i run build-llama.sh or build-mistral.sh it gets killed automatically with below output and I found that my cpu gets 100% utilised when running build-llama.sh or build-mistral.sh attaching the ss of the same kindly Help me with that
(trtllm) vishwajeet@vishwa:~/Desktop/MYGPT/trt-llm-rag-linux$ bash build-mistral.sh You are using a model of type mistral to instantiate a model of type llama. This is not supported for all configurations of models and can yield errors. [03/22/2024-20:50:37] [TRT-LLM] [I] Serially build TensorRT engines. [03/22/2024-20:50:39] [TRT] [I] [MemUsageChange] Init CUDA: CPU +2991, GPU +0, now: CPU 4121, GPU 1039 (MiB) [03/22/2024-20:50:41] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1798, GPU +314, now: CPU 6055, GPU 1353 (MiB) [03/22/2024-20:50:41] [TRT] [W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading [03/22/2024-20:50:41] [TRT-LLM] [W] Invalid timing cache, using freshly created one [03/22/2024-20:50:41] [TRT-LLM] [I] [MemUsage] Rank 0 Engine build starts - Allocated Memory: Host 7.1113 (GiB) Device 1.3216 (GiB) build-mistral.sh: line 1: 6084 Killed python build.py --model_dir './model/mistral/mistral7b_hf' --quant_ckpt_path './model/mistral/mistral7b_int4_quant_weights/mistral_tp1_rank0.npz' --dtype float16 --remove_input_padding --use_gpt_attention_plugin float16 --enable_context_fmha --use_gemm_plugin float16 --use_weight_only --weight_only_precision int4_awq --per_group --output_dir './model/mistral/mistral7b_int4_engine' --world_size 1 --tp_size 1 --parallel_build --max_input_len 1024 --max_batch_size 1 --max_output_len 1024