OOM Running Gemma 7B triton runtime engine

System Info

CPU Architecture: AMD EPYC 7V13 64-Core Processor
CPU/Host memory size: 440
GPU properties: A100 80Gb
GPU name: NVIDIA A100 80GB x2 (running Tritonserver with only a single gpu)
GPU mem size: 80Gb x 2
clock frequencies
Libraries
TensorRT-LLM branch or tag: main
TensorRT-LLM commit: e239adcfbf8e72bf1eefaa166781343f41770d70
Versions of TensorRT, CUDA: (10.0.1, 12.4)
container used: Built container from tensorrtllm_backend main branch using dockerfile/Dockerfile.trt_llm_backend
nvidia driver version: 535.161.07
OS: Ubuntu 22.04.4 LTS
docker image version: custom built from main branch
other

Who can help?

No response

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

I built the trt llm container by running

DOCKER_BUILDKIT=1 TORCH_CUDA_ARCH_LIST= docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend .

I launched the container with this command

sudo docker run -it --net host --shm-size=20g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /home/keith:/home triton_trt_llm:latest /bin/bash

In the container above, I followed the guide here up to the trtllm-build step to generate the triton engine.

I setup the tritonserver model repo and launched triton using

python3 scripts/launch_triton_server.py --world_size=1 --model_repo=/home/triton_model_repo

Expected behavior

I expect tritonserver to successfully spin up the gemma 7b model for inference.

actual behavior

I0514 20:17:41.942323 259 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0514 20:17:41.942336 259 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I0514 20:17:42.093866 259 model_lifecycle.cc:469] loading: gemma_7b_trt:1
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
[TensorRT-LLM][WARNING] max_beam_width is not specified, will use default value of 1
[TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000
[TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] kv_cache_host_memory_bytes not set, defaulting to 0
[TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise
[TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead
[TensorRT-LLM][INFO] Engine version 0.10.0.dev2024050700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'cross_attention' not found
[TensorRT-LLM][WARNING] Optional value for parameter cross_attention will not be set.
[TensorRT-LLM][WARNING] Parameter layer_types cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'layer_types' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 8
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 3100
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] Loaded engine size: 16295 MiB
[TensorRT-LLM][INFO] Allocated 7031.25 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 16292 (MiB)
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 25
[TensorRT-LLM][INFO] Max tokens in paged KV cache: 117760. Allocating 54022635520 bytes.
[TensorRT-LLM][WARNING] exclude_input_in_output is not specified, will be set to false
[TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][WARNING] max_beam_width is not specified, will use default value of 1
[TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000
[TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] kv_cache_host_memory_bytes not set, defaulting to 0
[TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise
[TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead
[TensorRT-LLM][INFO] Engine version 0.10.0.dev2024050700 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'cross_attention' not found
[TensorRT-LLM][WARNING] Optional value for parameter cross_attention will not be set.
[TensorRT-LLM][WARNING] Parameter layer_types cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'layer_types' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 8
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 3100
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.
[TensorRT-LLM][INFO] Loaded engine size: 16295 MiB
[TensorRT-LLM][ERROR] 1: [defaultAllocator.cpp::allocate::19] Error Code 1: Cuda Runtime (out of memory)
[TensorRT-LLM][WARNING] Requested amount of GPU memory (17083400192 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
[TensorRT-LLM][ERROR] 2: [safeDeserialize.cpp::load::269] Error Code 2: OutOfMemory (no further information)
[keith-a100-dev4:259  :0:263] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x8)
==== backtrace (tid:    263) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x000000000241e94e tensorrt_llm::runtime::TllmRuntime::TllmRuntime()  ???:0
 2 0x000000000262f389 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching()  ???:0
 3 0x00000000025efe90 tensorrt_llm::batch_manager::TrtGptModelFactory::create()  ???:0
 4 0x000000000265615b tensorrt_llm::executor::Executor::Impl::createModel()  ???:0
 5 0x0000000002656d61 tensorrt_llm::executor::Executor::Impl::Impl()  ???:0
 6 0x000000000264d572 tensorrt_llm::executor::Executor::Executor()  ???:0
 7 0x000000000001dee7 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState()  ???:0
 8 0x000000000001e322 triton::backend::inflight_batcher_llm::ModelInstanceState::Create()  ???:0
 9 0x0000000000002a55 TRITONBACKEND_ModelInstanceInitialize()  ???:0
10 0x00000000001af096 triton::core::TritonModelInstance::ConstructAndInitializeInstance()  :0
11 0x00000000001b02d6 triton::core::TritonModelInstance::CreateInstance()  :0
12 0x00000000001928e5 triton::core::TritonModel::PrepareInstances(inference::ModelConfig const&, std::vector<std::shared_ptr<triton::core::TritonModelInstance>, std::allocator<std::shared_ptr<triton::core::TritonModelInstance> > >*, std::vector<std::shared_ptr<triton::core::TritonModelInstance>, std::allocator<std::shared_ptr<triton::core::TritonModelInstance> > >*)::{lambda()#1}::operator()()  backend_model.cc:0
13 0x0000000000192f26 std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<triton::core::Status>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<triton::core::TritonModel::PrepareInstances(inference::ModelConfig const&, std::vector<std::shared_ptr<triton::core::TritonModelInstance>, std::allocator<std::shared_ptr<triton::core::TritonModelInstance> > >*, std::vector<std::shared_ptr<triton::core::TritonModelInstance>, std::allocator<std::shared_ptr<triton::core::TritonModelInstance> > >*)::{lambda()#1}> >, triton::core::Status> >::_M_invoke()  backend_model.cc:0
14 0x000000000019f81d std::__future_base::_State_baseV2::_M_do_set()  :0
15 0x0000000000099ee8 pthread_mutexattr_setkind_np()  ???:0
16 0x000000000018965b std::__future_base::_Deferred_state<std::thread::_Invoker<std::tuple<triton::core::TritonModel::PrepareInstances(inference::ModelConfig const&, std::vector<std::shared_ptr<triton::core::TritonModelInstance>, std::allocator<std::shared_ptr<triton::core::TritonModelInstance> > >*, std::vector<std::shared_ptr<triton::core::TritonModelInstance>, std::allocator<std::shared_ptr<triton::core::TritonModelInstance> > >*)::{lambda()#1}> >, triton::core::Status>::_M_complete_async()  backend_model.cc:0
17 0x000000000019a505 triton::core::TritonModel::PrepareInstances()  :0
18 0x000000000019ec3e triton::core::TritonModel::Create()  :0
19 0x0000000000293328 triton::core::ModelLifeCycle::CreateModel()  :0
20 0x0000000000296c0c std::_Function_handler<void (), triton::core::ModelLifeCycle::AsyncLoad(triton::core::ModelIdentifier const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, inference::ModelConfig const&, bool, bool, std::shared_ptr<triton::core::TritonRepoAgentModelList> const&, std::function<void (triton::core::Status)>&&)::{lambda()#2}>::_M_invoke()  model_lifecycle.cc:0
21 0x00000000003f29d2 std::thread::_State_impl<std::thread::_Invoker<std::tuple<triton::common::ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run()  thread_pool.cc:0
22 0x00000000000dc253 std::error_code::default_error_condition()  ???:0
23 0x0000000000094ac3 pthread_condattr_setpshared()  ???:0
24 0x0000000000125a04 clone()  ???:0
=================================
[keith-a100-dev4:00259] *** Process received signal ***
[keith-a100-dev4:00259] Signal: Segmentation fault (11)
[keith-a100-dev4:00259] Signal code:  (-6)
[keith-a100-dev4:00259] Failing at address: 0x103
[keith-a100-dev4:00259] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7bc8a4f1e520]
[keith-a100-dev4:00259] [ 1] /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(_ZN12tensorrt_llm7runtime11TllmRuntimeC2EPKvmRN8nvinfer17ILoggerE+0x1ee)[0x7bab2c4e194e]
[keith-a100-dev4:00259] [ 2] /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager27TrtGptModelInflightBatchingC1EiSt10shared_ptrIN8nvinfer17ILoggerEERKNS_7runtime11ModelConfigERKNS6_11WorldConfigERKSt6vectorIhSaIhEEbNS0_15batch_scheduler15SchedulerPolicyERKNS0_25TrtGptModelOptionalParamsE+0x519)[0x7bab2c6f2389]
[keith-a100-dev4:00259] [ 3] /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager18TrtGptModelFactory6createERKSt6vectorIhSaIhEERKNS_7runtime13GptJsonConfigERKNS7_11WorldConfigENS0_15TrtGptModelTypeEiNS0_15batch_scheduler15SchedulerPolicyERKNS0_25TrtGptModelOptionalParamsE+0x3e0)[0x7bab2c6b2e90]
[keith-a100-dev4:00259] [ 4] /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(_ZN12tensorrt_llm8executor8Executor4Impl11createModelERKSt6vectorIhSaIhEERKNS_7runtime13GptJsonConfigERKNS8_11WorldConfigERKNS0_14ExecutorConfigE+0x2eb)[0x7bab2c71915b]
[keith-a100-dev4:00259] [ 5] /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(_ZN12tensorrt_llm8executor8Executor4ImplC2ERKNSt10filesystem4pathENS0_9ModelTypeERKNS0_14ExecutorConfigE+0x811)[0x7bab2c719d61]
[keith-a100-dev4:00259] [ 6] /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(_ZN12tensorrt_llm8executor8ExecutorC2ERKNSt10filesystem4pathENS0_9ModelTypeERKNS0_14ExecutorConfigE+0x32)[0x7bab2c710572]
[keith-a100-dev4:00259] [ 7] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm_common.so(_ZN6triton7backend20inflight_batcher_llm18ModelInstanceStateC2EPNS1_10ModelStateEP27TRITONBACKEND_ModelInstance+0x4b7)[0x7bc4809f7ee7]
[keith-a100-dev4:00259] [ 8] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm_common.so(_ZN6triton7backend20inflight_batcher_llm18ModelInstanceState6CreateEPNS1_10ModelStateEP27TRITONBACKEND_ModelInstancePPS2_+0x42)[0x7bc4809f8322]
[keith-a100-dev4:00259] [ 9] /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(TRITONBACKEND_ModelInstanceInitialize+0x65)[0x7bc8a748ba55]
[keith-a100-dev4:00259] [10] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1af096)[0x7bc8a5924096]
[keith-a100-dev4:00259] [11] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1b02d6)[0x7bc8a59252d6]
[keith-a100-dev4:00259] [12] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1928e5)[0x7bc8a59078e5]
[keith-a100-dev4:00259] [13] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x192f26)[0x7bc8a5907f26]
[keith-a100-dev4:00259] [14] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19f81d)[0x7bc8a591481d]
[keith-a100-dev4:00259] [15] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8)[0x7bc8a4f75ee8]
[keith-a100-dev4:00259] [16] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18965b)[0x7bc8a58fe65b]
[keith-a100-dev4:00259] [17] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19a505)[0x7bc8a590f505]
[keith-a100-dev4:00259] [18] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19ec3e)[0x7bc8a5913c3e]
[keith-a100-dev4:00259] [19] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x293328)[0x7bc8a5a08328]
[keith-a100-dev4:00259] [20] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x296c0c)[0x7bc8a5a0bc0c]
[keith-a100-dev4:00259] [21] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3f29d2)[0x7bc8a5b679d2]
[keith-a100-dev4:00259] [22] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253)[0x7bc8a51e1253]
[keith-a100-dev4:00259] [23] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x7bc8a4f70ac3]
[keith-a100-dev4:00259] [24] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x44)[0x7bc8a5001a04]
[keith-a100-dev4:00259] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node keith-a100-dev4 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

additional notes

This is the gemma 7b config.pbtxt I'm using. I can successfully start the model using KIND_CPU instead of KIND_GPU.

name: "gemma_7b_trt"
backend: "tensorrtllm"
max_batch_size: 10

input [
  {
    name: "input_ids"
    data_type: TYPE_INT32
    dims: [ -1 ]
    allow_ragged_batch: true
  }
]
output [
  {
    name: "output_ids"
    data_type: TYPE_INT32
    dims: [ -1, -1 ]
  }
]
instance_group [
  {
    count: 1
    kind : KIND_GPU
  }
]
parameters: {
  key: "gpt_model_type"
  value: {
    string_value: "inflight_fused_batching"
  }
}
parameters: {
  key: "gpt_model_path"
  value: {
    string_value: "/home/triton_model_repo/gemma_7b_trt/1"
  }
}
parameters: {
  key: "batch_scheduler_policy"
  value: {
    string_value: "max_utilization"
  }
}

Memory usage on the gpu is high but not at max while running the tritonserver

| N/A   36C    P0              74W / 300W |  75403MiB / 81920MiB |      0%      Default |

triton-inference-server / tensorrtllm_backend

OOM Running Gemma 7B triton runtime engine #456