Open kelkarn opened 2 months ago
I even tried quantizing the model weights with int4
but I still get this error:
python3 ../llama/convert_checkpoint.py --model_dir ./Mixtral-8x7B-v0.1 \
--output_dir ./mixtral-ckpt-1 \
--dtype float16 \
--pp_size 2 \
--use_weight_only \
--weight_only_precision=int4 \
--workers 2 \
--int8_kv_cache
trtllm-build \
--checkpoint_dir ./mixtral-ckpt-1 \
--output_dir ./mixtral-engine-1 \
--gemm_plugin float16 \
--gpt_attention_plugin float16 \
--remove_input_padding enable \
--max_input_len 32768 \
--max_output_len 1024 \
--workers 2 \
--max_batch_size 1
And the error I see is:
root@fea09f8d121f:/tensorrtllm_backend# python3 scripts/launch_triton_server.py --world_size=2 --model_repo=/tensorrtllm_backend/models --tensorrt_llm_model_name=mixtral
root@fea09f8d121f:/tensorrtllm_backend# I0430 06:54:55.802643 163 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7fc39a000000' with size 268435456
I0430 06:54:55.808485 163 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0430 06:54:55.808494 163 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I0430 06:54:55.808623 164 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7f8796000000' with size 268435456
I0430 06:54:55.818606 164 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0430 06:54:55.818614 164 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I0430 06:54:56.046786 163 model_lifecycle.cc:469] loading: mixtral:1
I0430 06:54:56.046890 164 model_lifecycle.cc:469] loading: mixtral:1
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
[TensorRT-LLM][WARNING] max_beam_width is not specified, will use default value of 1
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][INFO] Engine version 0.8.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be array, but is null
[TensorRT-LLM][WARNING] Optional value for parameter lora_target_modules will not be set.
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
[TensorRT-LLM][WARNING] max_beam_width is not specified, will use default value of 1
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][INFO] Engine version 0.8.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be array, but is null
[TensorRT-LLM][WARNING] Optional value for parameter lora_target_modules will not be set.
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
[TensorRT-LLM][INFO] MPI size: 2, rank: 1
[TensorRT-LLM][INFO] Rank 1 is using GPU 1
[TensorRT-LLM][INFO] MPI size: 2, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 33792
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] Loaded engine size: 11334 MiB
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 33792
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] Loaded engine size: 11334 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 11391, GPU 12328 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 11392, GPU 12338 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 11391, GPU 12328 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 11393, GPU 12338 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +11332, now: CPU 0, GPU 11332 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +11332, now: CPU 0, GPU 11332 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 11570, GPU 20578 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 11570, GPU 20586 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 11332 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 11580, GPU 20598 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +12, now: CPU 11581, GPU 20610 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 11332 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 11570, GPU 20578 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 11570, GPU 20586 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 11332 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 11581, GPU 20598 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +12, now: CPU 11581, GPU 20610 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 11332 (MiB)
[TensorRT-LLM][INFO] Allocate 56358862848 bytes for k/v cache.
[TensorRT-LLM][INFO] Using 1719936 total tokens in paged KV cache, and 264 blocks per sequence
[TensorRT-LLM][INFO] Allocate 56358862848 bytes for k/v cache.
[TensorRT-LLM][INFO] Using 1719936 total tokens in paged KV cache, and 264 blocks per sequence
[TensorRT-LLM][WARNING] max_beam_width is not specified, will use default value of 1
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][INFO] Engine version 0.8.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be array, but is null
[TensorRT-LLM][WARNING] Optional value for parameter lora_target_modules will not be set.
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][INFO] MPI size: 2, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 33792
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.
[TensorRT-LLM][INFO] Loaded engine size: 11334 MiB
[TensorRT-LLM][ERROR] 1: [defaultAllocator.cpp::allocate::20] Error Code 1: Cuda Runtime (out of memory)
[TensorRT-LLM][WARNING] Requested amount of GPU memory (11883102464 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
[TensorRT-LLM][ERROR] 2: [safeDeserialize.cpp::load::269] Error Code 2: OutOfMemory (no further information)
E0430 06:55:10.730665 163 backend_model.cc:691] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:72)
1 0x7fc2f82614ba tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2 0x7fc2f82850a0 /opt/tritonserver/backends/tensorrtllm/libtensorrt_llm.so(+0x79c0a0) [0x7fc2f82850a0]
3 0x7fc2fa14f742 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(int, std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, std::vector<unsigned char, std::allocator<unsigned char> > const&, bool, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1138
4 0x7fc2fa125977 tensorrt_llm::batch_manager::TrtGptModelFactory::create(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1687
5 0x7fc2fa11ce00 tensorrt_llm::batch_manager::GptManager::GptManager(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, std::function<std::list<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest> > > (int)>, std::function<void (unsigned long, std::list<tensorrt_llm::batch_manager::NamedTensor, std::allocator<tensorrt_llm::batch_manager::NamedTensor> > const&, bool, std::string const&)>, std::function<std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > ()>, std::function<void (std::string const&)>, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&, std::optional<unsigned long>, std::optional<int>, bool) + 336
6 0x7fc3dc1cab62 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x18b62) [0x7fc3dc1cab62]
7 0x7fc3dc1cb3f2 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x193f2) [0x7fc3dc1cb3f2]
8 0x7fc3dc1bdfd5 TRITONBACKEND_ModelInstanceInitialize + 101
9 0x7fc3eff32296 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ad296) [0x7fc3eff32296]
10 0x7fc3eff334d6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ae4d6) [0x7fc3eff334d6]
11 0x7fc3eff16045 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191045) [0x7fc3eff16045]
12 0x7fc3eff16686 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191686) [0x7fc3eff16686]
13 0x7fc3eff22efd /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19defd) [0x7fc3eff22efd]
14 0x7fc3ef586ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7fc3ef586ee8]
15 0x7fc3eff0cf0b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x187f0b) [0x7fc3eff0cf0b]
16 0x7fc3eff1dc65 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x198c65) [0x7fc3eff1dc65]
17 0x7fc3eff2231e /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19d31e) [0x7fc3eff2231e]
18 0x7fc3f00140c8 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x28f0c8) [0x7fc3f00140c8]
19 0x7fc3f00179ac /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2929ac) [0x7fc3f00179ac]
20 0x7fc3f016b6c2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3e66c2) [0x7fc3f016b6c2]
21 0x7fc3ef7f2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fc3ef7f2253]
22 0x7fc3ef581ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fc3ef581ac3]
23 0x7fc3ef612a04 clone + 68
Is Triton somehow not able to use the full GPU? Since my machine has 2 GPUs, each of size 80GB GPU memory, I would assume that a quantized model (details below) is small enough to fit into the 2 GPU memories:
root@fea09f8d121f:/tensorrtllm_backend# ls -al models/mixtral/1
total 23213444
drwx------ 2 1001 users 4096 Apr 30 06:41 .
drwx------ 3 1001 users 4096 Apr 30 02:44 ..
-rw------- 1 1001 users 3817 Apr 30 06:41 config.json
-rw------- 1 1001 users 11885240924 Apr 30 06:41 rank0.engine
-rw------- 1 1001 users 11885297932 Apr 30 06:42 rank1.engine
The config.json
root@fea09f8d121f:/tensorrtllm_backend# cat models/mixtral/1/config.json
{
"version": "0.8.0",
"pretrained_config": {
"architecture": "MixtralForCausalLM",
"dtype": "float16",
"logits_dtype": "float32",
"vocab_size": 32000,
"max_position_embeddings": 32768,
"hidden_size": 4096,
"num_hidden_layers": 32,
"num_attention_heads": 32,
"num_key_value_heads": 8,
"head_size": 128,
"hidden_act": "swiglu",
"intermediate_size": 14336,
"norm_epsilon": 1e-05,
"position_embedding_type": "rope_gpt_neox",
"use_prompt_tuning": false,
"use_parallel_embedding": false,
"embedding_sharding_dim": 0,
"share_embedding_table": false,
"mapping": {
"world_size": 2,
"tp_size": 1,
"pp_size": 2
},
"kv_dtype": "int8",
"max_lora_rank": 64,
"rotary_base": 1000000.0,
"rotary_scaling": null,
"moe_num_experts": 8,
"moe_top_k": 2,
"moe_tp_mode": 2,
"moe_normalization_mode": 1,
"enable_pos_shift": false,
"dense_context_fmha": false,
"lora_target_modules": null,
"hf_modules_to_trtllm_modules": {
"q_proj": "attn_q",
"k_proj": "attn_k",
"v_proj": "attn_v",
"o_proj": "attn_dense",
"gate_proj": "mlp_h_to_4h",
"down_proj": "mlp_4h_to_h",
"up_proj": "mlp_gate"
},
"trtllm_modules_to_hf_modules": {
"attn_q": "q_proj",
"attn_k": "k_proj",
"attn_v": "v_proj",
"attn_dense": "o_proj",
"mlp_h_to_4h": "gate_proj",
"mlp_4h_to_h": "down_proj",
"mlp_gate": "up_proj"
},
"disable_weight_only_quant_plugin": false,
"mlp_bias": false,
"attn_bias": false,
"quantization": {
"quant_algo": "W4A16",
"kv_cache_quant_algo": "INT8",
"group_size": 128,
"has_zero_point": false,
"pre_quant_scale": false,
"exclude_modules": [
"lm_head",
"router"
],
"sq_use_plugin": false
}
},
"build_config": {
"max_input_len": 32768,
"max_output_len": 1024,
"max_batch_size": 1,
"max_beam_width": 1,
"max_num_tokens": 32768,
"max_prompt_embedding_table_size": 0,
"gather_context_logits": false,
"gather_generation_logits": false,
"strongly_typed": false,
"builder_opt": null,
"profiling_verbosity": "layer_names_only",
"enable_debug_output": false,
"plugin_config": {
"bert_attention_plugin": "float16",
"gpt_attention_plugin": "float16",
"gemm_plugin": "float16",
"smooth_quant_gemm_plugin": null,
"identity_plugin": null,
"layernorm_quantization_plugin": null,
"rmsnorm_quantization_plugin": null,
"nccl_plugin": "float16",
"lookup_plugin": null,
"lora_plugin": null,
"weight_only_groupwise_quant_matmul_plugin": null,
"weight_only_quant_matmul_plugin": "float16",
"quantize_per_token_plugin": false,
"quantize_tensor_plugin": false,
"context_fmha": true,
"context_fmha_fp32_acc": false,
"paged_kv_cache": true,
"remove_input_padding": true,
"use_custom_all_reduce": true,
"multi_block_mode": false,
"enable_xqa": true,
"attention_qk_half_accumulation": false,
"tokens_per_block": 128,
"use_paged_context_fmha": false,
"use_context_fmha_for_generation": false
}
}
}
I cannot reproduce the issue on latest main branch. Could you take a try on latest main branch? I use same 160 GB memory environment (with 2 GPUs) and the reproduced steps are
export HF_LLAMA_MODEL=Mixtral-8x7B-v0.1/
export UNIFIED_CKPT_PATH=/tmp/tllm_checkpoint_mixtral_2gpu
export ENGINE_PATH=/tmp/mixtral-engine-1
python3 ./examples/llama/convert_checkpoint.py --model_dir ${HF_LLAMA_MODEL} \
--output_dir ${UNIFIED_CKPT_PATH} \
--dtype float16 \
--tp_size 2
python3 -m tensorrt_llm.commands.build --checkpoint_dir ${UNIFIED_CKPT_PATH} \
--output_dir ${ENGINE_PATH} \
--gemm_plugin float16
cp all_models/inflight_batcher_llm/ mixtral_ifb -r
python3 tools/fill_template.py -i mixtral_ifb/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:64,preprocessing_instance_count:1
python3 tools/fill_template.py -i mixtral_ifb/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:64,postprocessing_instance_count:1
python3 tools/fill_template.py -i mixtral_ifb/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
python3 tools/fill_template.py -i mixtral_ifb/ensemble/config.pbtxt triton_max_batch_size:64
python3 tools/fill_template.py -i mixtral_ifb/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0
python3 scripts/launch_triton_server.py --world_size=2 --model_repo=mixtral_ifb/ --log
curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2}'
System Info
Environment
CPU architecture: x86_64 CPU/Host memory size: 440 GiB memory
GPU properties
GPU name: A100 GPU memory size: 160GB I am using the Azure offering of this GPU: Standard NC48ads A100 v4 (48 vcpus, 440 GiB memory)
Libraries
TensorRT-LLM branch or tag: v0.8.0 Container used: 24.02-trtllm-python-py3 (following the support matrix)
NVIDIA driver version: Driver Version: 535.161.07
OS: Ubuntu 22.04 (Jammy)
Who can help?
@byshiue @schetlur-nv
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
/tensorrtllm_backend
volume-mounted folder:python3 scripts/launch_triton_server.py --world_size=2 --model_repo=/tensorrtllm_backend/models/mixtral56b --tensorrt_llm_model_name=mixtral56b --log
Expected behavior
I expect Triton server to start successfully, and show the Mixtral model in READY state and the server listening on ports 8000 and 8001 for HTTP and GRPC requests respectively.
actual behavior
I get a CUDA out of memory error like so:
On the command-line I see:
additional notes
I followed the process documented in here (using v0.8.0 of TRT-LLM) for the
--tp_size=2
case: https://github.com/NVIDIA/TensorRT-LLM/blob/v0.8.0/examples/mixtral/README.md