Deploying Mixtral-8x7B-v0.1 with Triton 24.02 on A100 (160GB) raises "Cuda Runtime (out of memory)" exception

kelkarn commented 2 months ago

System Info

Environment

CPU architecture: x86_64 CPU/Host memory size: 440 GiB memory

GPU properties

GPU name: A100 GPU memory size: 160GB I am using the Azure offering of this GPU: Standard NC48ads A100 v4 (48 vcpus, 440 GiB memory)

Libraries

TensorRT-LLM branch or tag: v0.8.0 Container used: 24.02-trtllm-python-py3 (following the support matrix)

NVIDIA driver version: Driver Version: 535.161.07

OS: Ubuntu 22.04 (Jammy)

Who can help?

@byshiue @schetlur-nv

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Convert checkpoint:

# Run with tensor parallelism
python3 ../llama/convert_checkpoint.py --model_dir ./Mixtral-8x7B-v0.1 \
                         --output_dir ./tllm_checkpoint_mixtral_2gpu \
                         --dtype float16 \
                         --tp_size 2

Build engine file:

trtllm-build --checkpoint_dir ./tllm_checkpoint_mixtral_2gpu \
             --output_dir ./mixtral-engine-1 \
             --gemm_plugin float16

Copy engine into models directory:

rm /tensorrtllm_backend/models/mixtral56b/mixtral56b/1/* && cp mixtral-engine-0/* /tensorrtllm_backend/models/mixtral56b/mixtral56b/1/.

Run Triton server 24.02-trtllm-python-py3 from /tensorrtllm_backend volume-mounted folder: python3 scripts/launch_triton_server.py --world_size=2 --model_repo=/tensorrtllm_backend/models/mixtral56b --tensorrt_llm_model_name=mixtral56b --log

Expected behavior

I expect Triton server to start successfully, and show the Mixtral model in READY state and the server listening on ports 8000 and 8001 for HTTP and GRPC requests respectively.

actual behavior

I get a CUDA out of memory error like so:

I0429 19:54:55.691380 689 cache_manager.cc:480] Create CacheManager with cache_dir: '/opt/tritonserver/caches'
I0429 19:54:56.279264 689 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7512ee000000' with size 268435456
I0429 19:54:57.497967 689 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0429 19:54:57.497991 689 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I0429 19:54:57.649604 689 model_config_utils.cc:680] Server side auto-completed config: name: "ensemble"
platform: "ensemble"
max_batch_size: 1024
input {
  name: "text_input"
  data_type: TYPE_STRING
  dims: -1
}
input {
  name: "max_tokens"
  data_type: TYPE_INT32
  dims: -1
}
output {
  name: "text_output"
  data_type: TYPE_STRING
  dims: -1
}
ensemble_scheduling {
  step {
    model_name: "preprocessing"
    model_version: -1
    input_map {
      key: "QUERY"
      value: "text_input"
    }
    input_map {
      key: "REQUEST_OUTPUT_LEN"
      value: "max_tokens"
    }
    output_map {
      key: "INPUT_ID"
      value: "_INPUT_ID"
    }
    output_map {
      key: "REQUEST_OUTPUT_LEN"
      value: "_REQUEST_OUTPUT_LEN"
    }
  }
  step {
    model_name: "mixtral56b"
    model_version: -1
    input_map {
      key: "input_ids"
      value: "_INPUT_ID"
    }
    input_map {
      key: "request_output_len"
      value: "_REQUEST_OUTPUT_LEN"
    }
    output_map {
      key: "output_ids"
      value: "_TOKENS_BATCH"
    }
  }
  step {
    model_name: "postprocessing"
    model_version: -1
    input_map {
      key: "TOKENS_BATCH"
      value: "_TOKENS_BATCH"
    }
    output_map {
      key: "OUTPUT"
      value: "text_output"
    }
  }
}

I0429 19:55:02.705961 689 model_config_utils.cc:680] Server side auto-completed config: name: "mixtral56b"
max_batch_size: 1024
input {
  name: "input_ids"
  data_type: TYPE_INT32
  dims: -1
  allow_ragged_batch: true
}
input {
  name: "request_output_len"
  data_type: TYPE_INT32
  dims: 1
}
output {
  name: "output_ids"
  data_type: TYPE_INT32
  dims: -1
  dims: -1
}
output {
  name: "sequence_length"
  data_type: TYPE_INT32
  dims: -1
}
output {
  name: "cum_log_probs"
  data_type: TYPE_FP32
  dims: -1
}
output {
  name: "output_log_probs"
  data_type: TYPE_FP32
  dims: -1
  dims: -1
}
output {
  name: "context_logits"
  data_type: TYPE_FP32
  dims: -1
  dims: -1
}
output {
  name: "generation_logits"
  data_type: TYPE_FP32
  dims: -1
  dims: -1
  dims: -1
}
instance_group {
  count: 1
  kind: KIND_GPU
}
parameters {
  key: "FORCE_CPU_ONLY_INPUT_TENSORS"
  value {
    string_value: "no"
  }
}
parameters {
  key: "batch_scheduler_policy"
  value {
    string_value: "guaranteed_no_evict"
  }
}
parameters {
  key: "enable_chunked_context"
  value {
    string_value: "false"
  }
}
parameters {
  key: "enable_kv_cache_reuse"
  value {
    string_value: "${enable_kv_cache_reuse}"
  }
}
parameters {
  key: "enable_trt_overlap"
  value {
    string_value: "false"
  }
}
parameters {
  key: "exclude_input_in_output"
  value {
    string_value: "true"
  }
}
parameters {
  key: "gpt_model_path"
  value {
    string_value: "/tensorrtllm_backend/models/mixtral56b/mixtral56b/1"
  }
}
parameters {
  key: "gpt_model_type"
  value {
    string_value: "V1"
  }
}
parameters {
  key: "gpu_device_ids"
  value {
    string_value: "${gpu_device_ids}"
  }
}
parameters {
  key: "kv_cache_free_gpu_mem_fraction"
  value {
    string_value: "${kv_cache_free_gpu_mem_fraction}"
  }
}
parameters {
  key: "max_attention_window_size"
  value {
    string_value: "${max_attention_window_size}"
  }
}
parameters {
  key: "max_beam_width"
  value {
    string_value: "${max_beam_width}"
  }
}
parameters {
  key: "max_tokens_in_paged_kv_cache"
  value {
    string_value: "34000"
  }
}
parameters {
  key: "normalize_log_probs"
  value {
    string_value: "true"
  }
}
backend: "tensorrtllm"
model_transaction_policy {
}

I0429 19:55:03.103345 689 model_config_utils.cc:680] Server side auto-completed config: name: "postprocessing"
max_batch_size: 1024
input {
  name: "TOKENS_BATCH"
  data_type: TYPE_INT32
  dims: -1
  dims: -1
}
output {
  name: "OUTPUT"
  data_type: TYPE_STRING
  dims: -1
}
instance_group {
  count: 1
  kind: KIND_CPU
}
default_model_filename: "model.py"
parameters {
  key: "skip_special_tokens"
  value {
    string_value: "True"
  }
}
parameters {
  key: "tokenizer_dir"
  value {
    string_value: "/tensorrtllm_backend/tensorrt_llm/examples/mixtral/Mixtral-8x7B-v0.1"
  }
}
parameters {
  key: "tokenizer_type"
  value {
    string_value: "auto"
  }
}
backend: "python"

I0429 19:55:03.104288 689 model_config_utils.cc:680] Server side auto-completed config: name: "preprocessing"
max_batch_size: 1024
input {
  name: "QUERY"
  data_type: TYPE_STRING
  dims: -1
}
input {
  name: "REQUEST_OUTPUT_LEN"
  data_type: TYPE_INT32
  dims: -1
}
input {
  name: "BAD_WORDS_DICT"
  data_type: TYPE_STRING
  dims: -1
  optional: true
}
input {
  name: "STOP_WORDS_DICT"
  data_type: TYPE_STRING
  dims: -1
  optional: true
}
input {
  name: "EMBEDDING_BIAS_WORDS"
  data_type: TYPE_STRING
  dims: -1
  optional: true
}
input {
  name: "EMBEDDING_BIAS_WEIGHTS"
  data_type: TYPE_FP32
  dims: -1
  optional: true
}
input {
  name: "END_ID"
  data_type: TYPE_INT32
  dims: -1
  optional: true
}
input {
  name: "PAD_ID"
  data_type: TYPE_INT32
  dims: -1
  optional: true
}
output {
  name: "INPUT_ID"
  data_type: TYPE_INT32
  dims: -1
}
output {
  name: "REQUEST_OUTPUT_LEN"
  data_type: TYPE_INT32
  dims: -1
}
instance_group {
  count: 1
  kind: KIND_CPU
}
default_model_filename: "model.py"
parameters {
  key: "add_special_tokens"
  value {
    string_value: "False"
  }
}
parameters {
  key: "tokenizer_dir"
  value {
    string_value: "/tensorrtllm_backend/tensorrt_llm/examples/mixtral/Mixtral-8x7B-v0.1"
  }
}
parameters {
  key: "tokenizer_type"
  value {
    string_value: "auto"
  }
}
backend: "python"

I0429 19:55:03.104390 689 model_lifecycle.cc:438] AsyncLoad() 'preprocessing'
I0429 19:55:03.104439 689 model_lifecycle.cc:469] loading: preprocessing:1
I0429 19:55:03.104463 689 model_lifecycle.cc:438] AsyncLoad() 'postprocessing'
I0429 19:55:03.104500 689 model_lifecycle.cc:469] loading: postprocessing:1
I0429 19:55:03.104514 689 model_lifecycle.cc:438] AsyncLoad() 'mixtral56b'
I0429 19:55:03.104554 689 model_lifecycle.cc:469] loading: mixtral56b:1
I0429 19:55:03.104566 689 model_lifecycle.cc:547] CreateModel() 'preprocessing' version 1
I0429 19:55:03.104643 689 model_lifecycle.cc:547] CreateModel() 'mixtral56b' version 1
I0429 19:55:03.104653 689 model_lifecycle.cc:547] CreateModel() 'postprocessing' version 1
I0429 19:55:03.104737 689 backend_model.cc:502] Adding default backend config setting: default-max-batch-size,4
I0429 19:55:03.104696 689 backend_model.cc:502] Adding default backend config setting: default-max-batch-size,4
I0429 19:55:03.104765 689 shared_library.cc:112] OpenLibraryHandle: /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so
I0429 19:55:03.104803 689 backend_model.cc:502] Adding default backend config setting: default-max-batch-size,4
I0429 19:55:03.159322 689 shared_library.cc:112] OpenLibraryHandle: /opt/tritonserver/backends/python/libtriton_python.so
I0429 19:55:03.165373 689 model_config_utils.cc:1902] ModelConfig 64-bit fields:
I0429 19:55:03.201148 689 model_config_utils.cc:1904]   ModelConfig::dynamic_batching::default_priority_level
I0429 19:55:03.201158 689 model_config_utils.cc:1904]   ModelConfig::dynamic_batching::default_queue_policy::default_timeout_microseconds
I0429 19:55:03.201166 689 model_config_utils.cc:1904]   ModelConfig::dynamic_batching::max_queue_delay_microseconds
I0429 19:55:03.201172 689 model_config_utils.cc:1904]   ModelConfig::dynamic_batching::priority_levels
I0429 19:55:03.201179 689 model_config_utils.cc:1904]   ModelConfig::dynamic_batching::priority_queue_policy::key
I0429 19:55:03.201185 689 model_config_utils.cc:1904]   ModelConfig::dynamic_batching::priority_queue_policy::value::default_timeout_microseconds
I0429 19:55:03.201191 689 model_config_utils.cc:1904]   ModelConfig::ensemble_scheduling::step::model_version
I0429 19:55:03.201198 689 model_config_utils.cc:1904]   ModelConfig::input::dims
I0429 19:55:03.201220 689 model_config_utils.cc:1904]   ModelConfig::input::reshape::shape
I0429 19:55:03.201227 689 model_config_utils.cc:1904]   ModelConfig::instance_group::secondary_devices::device_id
I0429 19:55:03.201233 689 model_config_utils.cc:1904]   ModelConfig::model_warmup::inputs::value::dims
I0429 19:55:03.201239 689 model_config_utils.cc:1904]   ModelConfig::optimization::cuda::graph_spec::graph_lower_bound::input::value::dim
I0429 19:55:03.201246 689 model_config_utils.cc:1904]   ModelConfig::optimization::cuda::graph_spec::input::value::dim
I0429 19:55:03.201253 689 model_config_utils.cc:1904]   ModelConfig::output::dims
I0429 19:55:03.201259 689 model_config_utils.cc:1904]   ModelConfig::output::reshape::shape
I0429 19:55:03.201269 689 model_config_utils.cc:1904]   ModelConfig::sequence_batching::direct::max_queue_delay_microseconds
I0429 19:55:03.201275 689 model_config_utils.cc:1904]   ModelConfig::sequence_batching::max_sequence_idle_microseconds
I0429 19:55:03.201281 689 model_config_utils.cc:1904]   ModelConfig::sequence_batching::oldest::max_queue_delay_microseconds
I0429 19:55:03.201287 689 model_config_utils.cc:1904]   ModelConfig::sequence_batching::state::dims
I0429 19:55:03.201294 689 model_config_utils.cc:1904]   ModelConfig::sequence_batching::state::initial_state::dims
I0429 19:55:03.201300 689 model_config_utils.cc:1904]   ModelConfig::version_policy::specific::versions
I0429 19:55:03.202695 689 python_be.cc:2075] 'python' TRITONBACKEND API version: 1.18
I0429 19:55:03.202715 689 python_be.cc:2097] backend configuration:
{"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","shm-region-prefix-name":"prefix0_","default-max-batch-size":"4"}}
I0429 19:55:03.202749 689 python_be.cc:2236] Shared memory configuration is shm-default-byte-size=1048576,shm-growth-byte-size=1048576,stub-timeout-seconds=30
I0429 19:55:03.212009 689 python_be.cc:2559] TRITONBACKEND_GetBackendAttribute: setting attributes
I0429 19:55:03.221478 689 python_be.cc:2337] TRITONBACKEND_ModelInitialize: preprocessing (version 1)
I0429 19:55:03.221719 689 python_be.cc:2337] TRITONBACKEND_ModelInitialize: postprocessing (version 1)
I0429 19:55:03.221901 689 python_be.cc:2031] model configuration:
{
    "name": "preprocessing",
    "platform": "",
    "backend": "python",
    "runtime": "",
    "version_policy": {
        "latest": {
            "num_versions": 1
        }
    },
    "max_batch_size": 1024,
    "input": [
        {
            "name": "QUERY",
            "data_type": "TYPE_STRING",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        },
        {
            "name": "REQUEST_OUTPUT_LEN",
            "data_type": "TYPE_INT32",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        },
        {
            "name": "BAD_WORDS_DICT",
            "data_type": "TYPE_STRING",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        },
        {
            "name": "STOP_WORDS_DICT",
            "data_type": "TYPE_STRING",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        },
        {
            "name": "EMBEDDING_BIAS_WORDS",
            "data_type": "TYPE_STRING",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        },
        {
            "name": "EMBEDDING_BIAS_WEIGHTS",
            "data_type": "TYPE_FP32",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        },
        {
            "name": "END_ID",
            "data_type": "TYPE_INT32",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        },
        {
            "name": "PAD_ID",
            "data_type": "TYPE_INT32",
            "format": "FORMAT_NONE",
            "dims": [
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": true
        }
    ],
    "output": [
        {
            "name": "INPUT_ID",
            "data_type": "TYPE_INT32",
            "dims": [
                -1
            ],
            "label_filename": "",
            "is_shape_tensor": false
        },
        {
            "name": "REQUEST_OUTPUT_LEN",
            "data_type": "TYPE_INT32",
            "dims": [
                -1
            ],
            "label_filename": "",
            "is_shape_tensor": false
        }
    ],
    "batch_input": [],
    "batch_output": [],
    "optimization": {
        "priority": "PRIORITY_DEFAULT",
        "input_pinned_memory": {
            "enable": true
        },
        "output_pinned_memory": {
            "enable": true
        },
        "gather_kernel_buffer_threshold": 0,
        "eager_batching": false
    },
    "instance_group": [
        {
            "name": "preprocessing_0",
            "kind": "KIND_CPU",
            "count": 1,
            "gpus": [],
            "secondary_devices": [],
            "profile": [],
            "passive": false,
            "host_policy": ""
        }
    ],
    "default_model_filename": "model.py",
    "cc_model_filenames": {},
    "metric_tags": {},
    "parameters": {
        "add_special_tokens": {
            "string_value": "False"
        },
        "tokenizer_dir": {
            "string_value": "/tensorrtllm_backend/tensorrt_llm/examples/mixtral/Mixtral-8x7B-v0.1"
        },
        "tokenizer_type": {
            "string_value": "auto"
        }
    },
    "model_warmup": []
}
I0429 19:55:03.222121 689 python_be.cc:2031] model configuration:
{
    "name": "postprocessing",
    "platform": "",
    "backend": "python",
    "runtime": "",
    "version_policy": {
        "latest": {
            "num_versions": 1
        }
    },
    "max_batch_size": 1024,
    "input": [
        {
            "name": "TOKENS_BATCH",
            "data_type": "TYPE_INT32",
            "format": "FORMAT_NONE",
            "dims": [
                -1,
                -1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        }
    ],
    "output": [
        {
            "name": "OUTPUT",
            "data_type": "TYPE_STRING",
            "dims": [
                -1
            ],
            "label_filename": "",
            "is_shape_tensor": false
        }
    ],
    "batch_input": [],
    "batch_output": [],
    "optimization": {
        "priority": "PRIORITY_DEFAULT",
        "input_pinned_memory": {
            "enable": true
        },
        "output_pinned_memory": {
            "enable": true
        },
        "gather_kernel_buffer_threshold": 0,
        "eager_batching": false
    },
    "instance_group": [
        {
            "name": "postprocessing_0",
            "kind": "KIND_CPU",
            "count": 1,
            "gpus": [],
            "secondary_devices": [],
            "profile": [],
            "passive": false,
            "host_policy": ""
        }
    ],
    "default_model_filename": "model.py",
    "cc_model_filenames": {},
    "metric_tags": {},
    "parameters": {
        "tokenizer_type": {
            "string_value": "auto"
        },
        "tokenizer_dir": {
            "string_value": "/tensorrtllm_backend/tensorrt_llm/examples/mixtral/Mixtral-8x7B-v0.1"
        },
        "skip_special_tokens": {
            "string_value": "True"
        }
    },
    "model_warmup": []
}
I0429 19:55:03.222144 689 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
I0429 19:55:03.222198 689 backend_model_instance.cc:69] Creating instance preprocessing_0_0 on CPU using artifact 'model.py'
I0429 19:55:03.222474 689 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
I0429 19:55:03.222494 689 backend_model_instance.cc:69] Creating instance postprocessing_0_0 on CPU using artifact 'model.py'
I0429 19:55:03.235963 689 stub_launcher.cc:388] Starting Python backend stub:  exec /opt/tritonserver/backends/python/triton_python_backend_stub /tensorrtllm_backend/models/mixtral56b/preprocessing/1/model.py prefix0_1 1048576 1048576 689 /opt/tritonserver/backends/python 336 preprocessing_0_0 DEFAULT
I0429 19:55:03.236004 689 stub_launcher.cc:388] Starting Python backend stub:  exec /opt/tritonserver/backends/python/triton_python_backend_stub /tensorrtllm_backend/models/mixtral56b/postprocessing/1/model.py prefix0_2 1048576 1048576 689 /opt/tritonserver/backends/python 336 postprocessing_0_0 DEFAULT
I0429 19:55:19.751221 689 python_be.cc:2402] TRITONBACKEND_ModelInstanceInitialize: instance initialization successful postprocessing_0_0 (device 0)
I0429 19:55:19.751472 689 backend_model_instance.cc:772] Starting backend thread for postprocessing_0_0 at nice 0 on device 0...
I0429 19:55:19.751599 689 backend_model.cc:674] Created model instance named 'postprocessing_0_0' with device id '0'
I0429 19:55:19.751910 689 model_lifecycle.cc:692] OnLoadComplete() 'postprocessing' version 1
I0429 19:55:19.751964 689 model_lifecycle.cc:730] OnLoadFinal() 'postprocessing' for all version(s)
I0429 19:55:19.751978 689 model_lifecycle.cc:835] successfully loaded 'postprocessing'
I0429 19:55:19.754814 689 python_be.cc:2402] TRITONBACKEND_ModelInstanceInitialize: instance initialization successful preprocessing_0_0 (device 0)
I0429 19:55:19.755006 689 backend_model_instance.cc:772] Starting backend thread for preprocessing_0_0 at nice 0 on device 0...
I0429 19:55:19.755118 689 backend_model.cc:674] Created model instance named 'preprocessing_0_0' with device id '0'
I0429 19:55:19.755267 689 model_lifecycle.cc:692] OnLoadComplete() 'preprocessing' version 1
I0429 19:55:19.755305 689 model_lifecycle.cc:730] OnLoadFinal() 'preprocessing' for all version(s)
I0429 19:55:19.755317 689 model_lifecycle.cc:835] successfully loaded 'preprocessing'
I0429 19:55:19.755449 689 model_lifecycle.cc:294] VersionStates() 'preprocessing'
I0429 19:55:19.755517 689 model_lifecycle.cc:294] VersionStates() 'postprocessing'
I0429 20:01:30.442184 689 backend_model_instance.cc:772] Starting backend thread for mixtral56b_0_0 at nice 0 on device 0...
I0429 20:01:30.442460 689 backend_model.cc:674] Created model instance named 'mixtral56b_0_0' with device id '0'
E0429 20:01:53.524615 689 backend_model.cc:691] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:72)
1       0x75124c2614ba tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x75124c2850a0 /opt/tritonserver/backends/tensorrtllm/libtensorrt_llm.so(+0x79c0a0) [0x75124c2850a0]
3       0x75124e0cb572 tensorrt_llm::runtime::GptSession::GptSession(tensorrt_llm::runtime::GptSession::Config const&, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, void const*, unsigned long, std::shared_ptr<nvinfer1::ILogger>) + 946
4       0x75124e15731d tensorrt_llm::batch_manager::TrtGptModelV1::TrtGptModelV1(int, std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::GptModelConfig, tensorrt_llm::runtime::WorldConfig, std::vector<unsigned char, std::allocator<unsigned char> > const&, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 701
5       0x75124e125dd4 tensorrt_llm::batch_manager::TrtGptModelFactory::create(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 2804
6       0x75124e11ce00 tensorrt_llm::batch_manager::GptManager::GptManager(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, std::function<std::list<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest> > > (int)>, std::function<void (unsigned long, std::list<tensorrt_llm::batch_manager::NamedTensor, std::allocator<tensorrt_llm::batch_manager::NamedTensor> > const&, bool, std::string const&)>, std::function<std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > ()>, std::function<void (std::string const&)>, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&, std::optional<unsigned long>, std::optional<int>, bool) + 336
7       0x75134403cb62 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x18b62) [0x75134403cb62]
8       0x75134403d3f2 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x193f2) [0x75134403d3f2]
9       0x75134402ffd5 TRITONBACKEND_ModelInstanceInitialize + 101
10      0x75134a932296 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ad296) [0x75134a932296]
11      0x75134a9334d6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ae4d6) [0x75134a9334d6]
12      0x75134a916045 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191045) [0x75134a916045]
13      0x75134a916686 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191686) [0x75134a916686]
14      0x75134a922efd /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19defd) [0x75134a922efd]
15      0x751349f86ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x751349f86ee8]
16      0x75134a90cf0b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x187f0b) [0x75134a90cf0b]
17      0x75134a91dc65 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x198c65) [0x75134a91dc65]
18      0x75134a92231e /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19d31e) [0x75134a92231e]
19      0x75134aa140c8 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x28f0c8) [0x75134aa140c8]
20      0x75134aa179ac /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2929ac) [0x75134aa179ac]
21      0x75134ab6b6c2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3e66c2) [0x75134ab6b6c2]
22      0x75134a1f2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x75134a1f2253]
23      0x751349f81ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x751349f81ac3]
24      0x75134a012a04 clone + 68
I0429 20:01:53.524820 689 backend_model_instance.cc:795] Stopping backend thread for mixtral56b_0_0...

On the command-line I see:

[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][INFO] MPI size: 2, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 1
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2048
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.
[TensorRT-LLM][INFO] Loaded engine size: 44668 MiB
[TensorRT-LLM][ERROR] 1: [defaultAllocator.cpp::allocate::20] Error Code 1: Cuda Runtime (out of memory)
[TensorRT-LLM][WARNING] Requested amount of GPU memory (46835179520 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
[TensorRT-LLM][ERROR] 2: [safeDeserialize.cpp::load::269] Error Code 2: OutOfMemory (no further information)

additional notes

I followed the process documented in here (using v0.8.0 of TRT-LLM) for the --tp_size=2 case: https://github.com/NVIDIA/TensorRT-LLM/blob/v0.8.0/examples/mixtral/README.md

kelkarn commented 2 months ago

I even tried quantizing the model weights with int4 but I still get this error:

python3 ../llama/convert_checkpoint.py --model_dir ./Mixtral-8x7B-v0.1 \
                              --output_dir ./mixtral-ckpt-1 \
                              --dtype float16 \
                              --pp_size 2 \
                              --use_weight_only \
                              --weight_only_precision=int4 \
                              --workers 2 \
                              --int8_kv_cache

trtllm-build \
    --checkpoint_dir ./mixtral-ckpt-1 \
    --output_dir ./mixtral-engine-1 \
    --gemm_plugin float16 \
    --gpt_attention_plugin float16 \
    --remove_input_padding enable \
    --max_input_len 32768 \
    --max_output_len 1024 \
    --workers 2 \
    --max_batch_size 1

And the error I see is:

root@fea09f8d121f:/tensorrtllm_backend# python3 scripts/launch_triton_server.py --world_size=2 --model_repo=/tensorrtllm_backend/models --tensorrt_llm_model_name=mixtral
root@fea09f8d121f:/tensorrtllm_backend# I0430 06:54:55.802643 163 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7fc39a000000' with size 268435456
I0430 06:54:55.808485 163 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0430 06:54:55.808494 163 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I0430 06:54:55.808623 164 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7f8796000000' with size 268435456
I0430 06:54:55.818606 164 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0430 06:54:55.818614 164 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I0430 06:54:56.046786 163 model_lifecycle.cc:469] loading: mixtral:1
I0430 06:54:56.046890 164 model_lifecycle.cc:469] loading: mixtral:1
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
[TensorRT-LLM][WARNING] max_beam_width is not specified, will use default value of 1
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][INFO] Engine version 0.8.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be array, but is null
[TensorRT-LLM][WARNING] Optional value for parameter lora_target_modules will not be set.
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
[TensorRT-LLM][WARNING] max_beam_width is not specified, will use default value of 1
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][INFO] Engine version 0.8.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be array, but is null
[TensorRT-LLM][WARNING] Optional value for parameter lora_target_modules will not be set.
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
[TensorRT-LLM][INFO] MPI size: 2, rank: 1
[TensorRT-LLM][INFO] Rank 1 is using GPU 1
[TensorRT-LLM][INFO] MPI size: 2, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 33792
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] Loaded engine size: 11334 MiB
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 33792
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] Loaded engine size: 11334 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 11391, GPU 12328 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 11392, GPU 12338 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 11391, GPU 12328 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 11393, GPU 12338 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +11332, now: CPU 0, GPU 11332 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +11332, now: CPU 0, GPU 11332 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 11570, GPU 20578 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 11570, GPU 20586 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 11332 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 11580, GPU 20598 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +12, now: CPU 11581, GPU 20610 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 11332 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 11570, GPU 20578 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 11570, GPU 20586 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 11332 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 11581, GPU 20598 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +12, now: CPU 11581, GPU 20610 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 11332 (MiB)
[TensorRT-LLM][INFO] Allocate 56358862848 bytes for k/v cache. 
[TensorRT-LLM][INFO] Using 1719936 total tokens in paged KV cache, and 264 blocks per sequence
[TensorRT-LLM][INFO] Allocate 56358862848 bytes for k/v cache. 
[TensorRT-LLM][INFO] Using 1719936 total tokens in paged KV cache, and 264 blocks per sequence
[TensorRT-LLM][WARNING] max_beam_width is not specified, will use default value of 1
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][INFO] Engine version 0.8.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be array, but is null
[TensorRT-LLM][WARNING] Optional value for parameter lora_target_modules will not be set.
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][INFO] MPI size: 2, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 1
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 33792
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.
[TensorRT-LLM][INFO] Loaded engine size: 11334 MiB
[TensorRT-LLM][ERROR] 1: [defaultAllocator.cpp::allocate::20] Error Code 1: Cuda Runtime (out of memory)
[TensorRT-LLM][WARNING] Requested amount of GPU memory (11883102464 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
[TensorRT-LLM][ERROR] 2: [safeDeserialize.cpp::load::269] Error Code 2: OutOfMemory (no further information)
E0430 06:55:10.730665 163 backend_model.cc:691] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:72)
1       0x7fc2f82614ba tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7fc2f82850a0 /opt/tritonserver/backends/tensorrtllm/libtensorrt_llm.so(+0x79c0a0) [0x7fc2f82850a0]
3       0x7fc2fa14f742 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(int, std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, std::vector<unsigned char, std::allocator<unsigned char> > const&, bool, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1138
4       0x7fc2fa125977 tensorrt_llm::batch_manager::TrtGptModelFactory::create(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1687
5       0x7fc2fa11ce00 tensorrt_llm::batch_manager::GptManager::GptManager(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, std::function<std::list<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest> > > (int)>, std::function<void (unsigned long, std::list<tensorrt_llm::batch_manager::NamedTensor, std::allocator<tensorrt_llm::batch_manager::NamedTensor> > const&, bool, std::string const&)>, std::function<std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > ()>, std::function<void (std::string const&)>, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&, std::optional<unsigned long>, std::optional<int>, bool) + 336
6       0x7fc3dc1cab62 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x18b62) [0x7fc3dc1cab62]
7       0x7fc3dc1cb3f2 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x193f2) [0x7fc3dc1cb3f2]
8       0x7fc3dc1bdfd5 TRITONBACKEND_ModelInstanceInitialize + 101
9       0x7fc3eff32296 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ad296) [0x7fc3eff32296]
10      0x7fc3eff334d6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ae4d6) [0x7fc3eff334d6]
11      0x7fc3eff16045 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191045) [0x7fc3eff16045]
12      0x7fc3eff16686 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191686) [0x7fc3eff16686]
13      0x7fc3eff22efd /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19defd) [0x7fc3eff22efd]
14      0x7fc3ef586ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7fc3ef586ee8]
15      0x7fc3eff0cf0b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x187f0b) [0x7fc3eff0cf0b]
16      0x7fc3eff1dc65 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x198c65) [0x7fc3eff1dc65]
17      0x7fc3eff2231e /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19d31e) [0x7fc3eff2231e]
18      0x7fc3f00140c8 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x28f0c8) [0x7fc3f00140c8]
19      0x7fc3f00179ac /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2929ac) [0x7fc3f00179ac]
20      0x7fc3f016b6c2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3e66c2) [0x7fc3f016b6c2]
21      0x7fc3ef7f2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fc3ef7f2253]
22      0x7fc3ef581ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fc3ef581ac3]
23      0x7fc3ef612a04 clone + 68

Is Triton somehow not able to use the full GPU? Since my machine has 2 GPUs, each of size 80GB GPU memory, I would assume that a quantized model (details below) is small enough to fit into the 2 GPU memories:

root@fea09f8d121f:/tensorrtllm_backend# ls -al models/mixtral/1
total 23213444
drwx------ 2 1001 users        4096 Apr 30 06:41 .
drwx------ 3 1001 users        4096 Apr 30 02:44 ..
-rw------- 1 1001 users        3817 Apr 30 06:41 config.json
-rw------- 1 1001 users 11885240924 Apr 30 06:41 rank0.engine
-rw------- 1 1001 users 11885297932 Apr 30 06:42 rank1.engine

The config.json

root@fea09f8d121f:/tensorrtllm_backend# cat models/mixtral/1/config.json
{
    "version": "0.8.0",
    "pretrained_config": {
        "architecture": "MixtralForCausalLM",
        "dtype": "float16",
        "logits_dtype": "float32",
        "vocab_size": 32000,
        "max_position_embeddings": 32768,
        "hidden_size": 4096,
        "num_hidden_layers": 32,
        "num_attention_heads": 32,
        "num_key_value_heads": 8,
        "head_size": 128,
        "hidden_act": "swiglu",
        "intermediate_size": 14336,
        "norm_epsilon": 1e-05,
        "position_embedding_type": "rope_gpt_neox",
        "use_prompt_tuning": false,
        "use_parallel_embedding": false,
        "embedding_sharding_dim": 0,
        "share_embedding_table": false,
        "mapping": {
            "world_size": 2,
            "tp_size": 1,
            "pp_size": 2
        },
        "kv_dtype": "int8",
        "max_lora_rank": 64,
        "rotary_base": 1000000.0,
        "rotary_scaling": null,
        "moe_num_experts": 8,
        "moe_top_k": 2,
        "moe_tp_mode": 2,
        "moe_normalization_mode": 1,
        "enable_pos_shift": false,
        "dense_context_fmha": false,
        "lora_target_modules": null,
        "hf_modules_to_trtllm_modules": {
            "q_proj": "attn_q",
            "k_proj": "attn_k",
            "v_proj": "attn_v",
            "o_proj": "attn_dense",
            "gate_proj": "mlp_h_to_4h",
            "down_proj": "mlp_4h_to_h",
            "up_proj": "mlp_gate"
        },
        "trtllm_modules_to_hf_modules": {
            "attn_q": "q_proj",
            "attn_k": "k_proj",
            "attn_v": "v_proj",
            "attn_dense": "o_proj",
            "mlp_h_to_4h": "gate_proj",
            "mlp_4h_to_h": "down_proj",
            "mlp_gate": "up_proj"
        },
        "disable_weight_only_quant_plugin": false,
        "mlp_bias": false,
        "attn_bias": false,
        "quantization": {
            "quant_algo": "W4A16",
            "kv_cache_quant_algo": "INT8",
            "group_size": 128,
            "has_zero_point": false,
            "pre_quant_scale": false,
            "exclude_modules": [
                "lm_head",
                "router"
            ],
            "sq_use_plugin": false
        }
    },
    "build_config": {
        "max_input_len": 32768,
        "max_output_len": 1024,
        "max_batch_size": 1,
        "max_beam_width": 1,
        "max_num_tokens": 32768,
        "max_prompt_embedding_table_size": 0,
        "gather_context_logits": false,
        "gather_generation_logits": false,
        "strongly_typed": false,
        "builder_opt": null,
        "profiling_verbosity": "layer_names_only",
        "enable_debug_output": false,
        "plugin_config": {
            "bert_attention_plugin": "float16",
            "gpt_attention_plugin": "float16",
            "gemm_plugin": "float16",
            "smooth_quant_gemm_plugin": null,
            "identity_plugin": null,
            "layernorm_quantization_plugin": null,
            "rmsnorm_quantization_plugin": null,
            "nccl_plugin": "float16",
            "lookup_plugin": null,
            "lora_plugin": null,
            "weight_only_groupwise_quant_matmul_plugin": null,
            "weight_only_quant_matmul_plugin": "float16",
            "quantize_per_token_plugin": false,
            "quantize_tensor_plugin": false,
            "context_fmha": true,
            "context_fmha_fp32_acc": false,
            "paged_kv_cache": true,
            "remove_input_padding": true,
            "use_custom_all_reduce": true,
            "multi_block_mode": false,
            "enable_xqa": true,
            "attention_qk_half_accumulation": false,
            "tokens_per_block": 128,
            "use_paged_context_fmha": false,
            "use_context_fmha_for_generation": false
        }
    }
}

byshiue commented 1 month ago

I cannot reproduce the issue on latest main branch. Could you take a try on latest main branch? I use same 160 GB memory environment (with 2 GPUs) and the reproduced steps are

export HF_LLAMA_MODEL=Mixtral-8x7B-v0.1/
export UNIFIED_CKPT_PATH=/tmp/tllm_checkpoint_mixtral_2gpu
export ENGINE_PATH=/tmp/mixtral-engine-1

python3 ./examples/llama/convert_checkpoint.py --model_dir ${HF_LLAMA_MODEL} \
                             --output_dir ${UNIFIED_CKPT_PATH} \
                             --dtype float16 \
                             --tp_size 2

python3 -m tensorrt_llm.commands.build --checkpoint_dir ${UNIFIED_CKPT_PATH} \
                 --output_dir ${ENGINE_PATH} \
                 --gemm_plugin float16

cp all_models/inflight_batcher_llm/ mixtral_ifb -r

python3 tools/fill_template.py -i mixtral_ifb/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:64,preprocessing_instance_count:1
python3 tools/fill_template.py -i mixtral_ifb/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:64,postprocessing_instance_count:1
python3 tools/fill_template.py -i mixtral_ifb/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
python3 tools/fill_template.py -i mixtral_ifb/ensemble/config.pbtxt triton_max_batch_size:64
python3 tools/fill_template.py -i mixtral_ifb/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0

python3 scripts/launch_triton_server.py --world_size=2 --model_repo=mixtral_ifb/ --log
curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2}'

full log

```bash I0510 07:54:51.367195 21661 cache_manager.cc:480] Create CacheManager with cache_dir: '/opt/tritonserver/caches' I0510 07:54:52.315140 21661 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7f637c000000' with size 268435456 I0510 07:54:52.331337 21661 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864 I0510 07:54:52.331348 21661 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864 I0510 07:54:52.744584 21661 model_config_utils.cc:680] Server side auto-completed config: name: "ensemble" platform: "ensemble" max_batch_size: 64 input { name: "text_input" data_type: TYPE_STRING dims: -1 } input { name: "max_tokens" data_type: TYPE_INT32 dims: -1 } input { name: "bad_words" data_type: TYPE_STRING dims: -1 optional: true } input { name: "stop_words" data_type: TYPE_STRING dims: -1 optional: true } input { name: "end_id" data_type: TYPE_INT32 dims: 1 optional: true } input { name: "pad_id" data_type: TYPE_INT32 dims: 1 optional: true } input { name: "top_k" data_type: TYPE_INT32 dims: 1 optional: true } input { name: "top_p" data_type: TYPE_FP32 dims: 1 optional: true } input { name: "temperature" data_type: TYPE_FP32 dims: 1 optional: true } input { name: "length_penalty" data_type: TYPE_FP32 dims: 1 optional: true } input { name: "repetition_penalty" data_type: TYPE_FP32 dims: 1 optional: true } input { name: "min_length" data_type: TYPE_INT32 dims: 1 optional: true } input { name: "presence_penalty" data_type: TYPE_FP32 dims: 1 optional: true } input { name: "frequency_penalty" data_type: TYPE_FP32 dims: 1 optional: true } input { name: "random_seed" data_type: TYPE_UINT64 dims: 1 optional: true } input { name: "return_log_probs" data_type: TYPE_BOOL dims: 1 optional: true } input { name: "return_context_logits" data_type: TYPE_BOOL dims: 1 optional: true } input { name: "return_generation_logits" data_type: TYPE_BOOL dims: 1 optional: true } input { name: "beam_width" data_type: TYPE_INT32 dims: 1 optional: true } input { name: "stream" data_type: TYPE_BOOL dims: 1 optional: true } input { name: "prompt_embedding_table" data_type: TYPE_FP16 dims: -1 dims: -1 optional: true } input { name: "prompt_vocab_size" data_type: TYPE_INT32 dims: 1 optional: true } input { name: "embedding_bias_words" data_type: TYPE_STRING dims: -1 optional: true } input { name: "embedding_bias_weights" data_type: TYPE_FP32 dims: -1 optional: true } output { name: "text_output" data_type: TYPE_STRING dims: -1 } output { name: "cum_log_probs" data_type: TYPE_FP32 dims: -1 } output { name: "output_log_probs" data_type: TYPE_FP32 dims: -1 dims: -1 } output { name: "context_logits" data_type: TYPE_FP32 dims: -1 dims: -1 } output { name: "generation_logits" data_type: TYPE_FP32 dims: -1 dims: -1 dims: -1 } ensemble_scheduling { step { model_name: "preprocessing" model_version: -1 input_map { key: "BAD_WORDS_DICT" value: "bad_words" } input_map { key: "EMBEDDING_BIAS_WEIGHTS" value: "embedding_bias_weights" } input_map { key: "EMBEDDING_BIAS_WORDS" value: "embedding_bias_words" } input_map { key: "END_ID" value: "end_id" } input_map { key: "PAD_ID" value: "pad_id" } input_map { key: "QUERY" value: "text_input" } input_map { key: "REQUEST_OUTPUT_LEN" value: "max_tokens" } input_map { key: "STOP_WORDS_DICT" value: "stop_words" } output_map { key: "BAD_WORDS_IDS" value: "_BAD_WORDS_IDS" } output_map { key: "EMBEDDING_BIAS" value: "_EMBEDDING_BIAS" } output_map { key: "INPUT_ID" value: "_INPUT_ID" } output_map { key: "OUT_END_ID" value: "_PREPROCESSOR_END_ID" } output_map { key: "OUT_PAD_ID" value: "_PREPROCESSOR_PAD_ID" } output_map { key: "REQUEST_INPUT_LEN" value: "_REQUEST_INPUT_LEN" } output_map { key: "REQUEST_OUTPUT_LEN" value: "_REQUEST_OUTPUT_LEN" } output_map { key: "STOP_WORDS_IDS" value: "_STOP_WORDS_IDS" } } step { model_name: "tensorrt_llm" model_version: -1 input_map { key: "bad_words_list" value: "_BAD_WORDS_IDS" } input_map { key: "beam_width" value: "beam_width" } input_map { key: "embedding_bias" value: "_EMBEDDING_BIAS" } input_map { key: "end_id" value: "_PREPROCESSOR_END_ID" } input_map { key: "frequency_penalty" value: "frequency_penalty" } input_map { key: "input_ids" value: "_INPUT_ID" } input_map { key: "input_lengths" value: "_REQUEST_INPUT_LEN" } input_map { key: "len_penalty" value: "length_penalty" } input_map { key: "min_length" value: "min_length" } input_map { key: "pad_id" value: "_PREPROCESSOR_PAD_ID" } input_map { key: "presence_penalty" value: "presence_penalty" } input_map { key: "prompt_embedding_table" value: "prompt_embedding_table" } input_map { key: "prompt_vocab_size" value: "prompt_vocab_size" } input_map { key: "random_seed" value: "random_seed" } input_map { key: "repetition_penalty" value: "repetition_penalty" } input_map { key: "request_output_len" value: "_REQUEST_OUTPUT_LEN" } input_map { key: "return_context_logits" value: "return_context_logits" } input_map { key: "return_generation_logits" value: "return_generation_logits" } input_map { key: "return_log_probs" value: "return_log_probs" } input_map { key: "runtime_top_k" value: "top_k" } input_map { key: "runtime_top_p" value: "top_p" } input_map { key: "stop_words_list" value: "_STOP_WORDS_IDS" } input_map { key: "streaming" value: "stream" } input_map { key: "temperature" value: "temperature" } output_map { key: "context_logits" value: "_CONTEXT_LOGITS" } output_map { key: "cum_log_probs" value: "_CUM_LOG_PROBS" } output_map { key: "generation_logits" value: "_GENERATION_LOGITS" } output_map { key: "output_ids" value: "_TOKENS_BATCH" } output_map { key: "output_log_probs" value: "_OUTPUT_LOG_PROBS" } output_map { key: "sequence_length" value: "_SEQUENCE_LENGTH" } } step { model_name: "postprocessing" model_version: -1 input_map { key: "CONTEXT_LOGITS" value: "_CONTEXT_LOGITS" } input_map { key: "CUM_LOG_PROBS" value: "_CUM_LOG_PROBS" } input_map { key: "GENERATION_LOGITS" value: "_GENERATION_LOGITS" } input_map { key: "OUTPUT_LOG_PROBS" value: "_OUTPUT_LOG_PROBS" } input_map { key: "SEQUENCE_LENGTH" value: "_SEQUENCE_LENGTH" } input_map { key: "TOKENS_BATCH" value: "_TOKENS_BATCH" } output_map { key: "OUTPUT" value: "text_output" } output_map { key: "OUT_CONTEXT_LOGITS" value: "context_logits" } output_map { key: "OUT_CUM_LOG_PROBS" value: "cum_log_probs" } output_map { key: "OUT_GENERATION_LOGITS" value: "generation_logits" } output_map { key: "OUT_OUTPUT_LOG_PROBS" value: "output_log_probs" } } } I0510 07:54:52.761898 21661 model_config_utils.cc:680] Server side auto-completed config: name: "postprocessing" max_batch_size: 64 input { name: "TOKENS_BATCH" data_type: TYPE_INT32 dims: -1 dims: -1 } input { name: "SEQUENCE_LENGTH" data_type: TYPE_INT32 dims: -1 } input { name: "CUM_LOG_PROBS" data_type: TYPE_FP32 dims: -1 optional: true } input { name: "OUTPUT_LOG_PROBS" data_type: TYPE_FP32 dims: -1 dims: -1 optional: true } input { name: "CONTEXT_LOGITS" data_type: TYPE_FP32 dims: -1 dims: -1 optional: true } input { name: "GENERATION_LOGITS" data_type: TYPE_FP32 dims: -1 dims: -1 dims: -1 optional: true } output { name: "OUTPUT" data_type: TYPE_STRING dims: -1 } output { name: "OUT_CUM_LOG_PROBS" data_type: TYPE_FP32 dims: -1 } output { name: "OUT_OUTPUT_LOG_PROBS" data_type: TYPE_FP32 dims: -1 dims: -1 } output { name: "OUT_CONTEXT_LOGITS" data_type: TYPE_FP32 dims: -1 dims: -1 } output { name: "OUT_GENERATION_LOGITS" data_type: TYPE_FP32 dims: -1 dims: -1 dims: -1 } instance_group { count: 1 kind: KIND_CPU } default_model_filename: "model.py" parameters { key: "skip_special_tokens" value { string_value: "${skip_special_tokens}" } } parameters { key: "tokenizer_dir" value { string_value: "/home/scratch.trt_llm_data/llm-models/Mixtral-8x7B-v0.1/" } } backend: "python" I0510 07:54:52.778395 21661 model_config_utils.cc:680] Server side auto-completed config: name: "preprocessing" max_batch_size: 64 input { name: "QUERY" data_type: TYPE_STRING dims: -1 } input { name: "REQUEST_OUTPUT_LEN" data_type: TYPE_INT32 dims: -1 } input { name: "BAD_WORDS_DICT" data_type: TYPE_STRING dims: -1 optional: true } input { name: "STOP_WORDS_DICT" data_type: TYPE_STRING dims: -1 optional: true } input { name: "EMBEDDING_BIAS_WORDS" data_type: TYPE_STRING dims: -1 optional: true } input { name: "EMBEDDING_BIAS_WEIGHTS" data_type: TYPE_FP32 dims: -1 optional: true } input { name: "END_ID" data_type: TYPE_INT32 dims: -1 optional: true } input { name: "PAD_ID" data_type: TYPE_INT32 dims: -1 optional: true } output { name: "INPUT_ID" data_type: TYPE_INT32 dims: -1 } output { name: "REQUEST_INPUT_LEN" data_type: TYPE_INT32 dims: 1 } output { name: "BAD_WORDS_IDS" data_type: TYPE_INT32 dims: 2 dims: -1 } output { name: "STOP_WORDS_IDS" data_type: TYPE_INT32 dims: 2 dims: -1 } output { name: "EMBEDDING_BIAS" data_type: TYPE_FP32 dims: -1 } output { name: "REQUEST_OUTPUT_LEN" data_type: TYPE_INT32 dims: -1 } output { name: "OUT_END_ID" data_type: TYPE_INT32 dims: -1 } output { name: "OUT_PAD_ID" data_type: TYPE_INT32 dims: -1 } instance_group { count: 1 kind: KIND_CPU } default_model_filename: "model.py" parameters { key: "add_special_tokens" value { string_value: "${add_special_tokens}" } } parameters { key: "tokenizer_dir" value { string_value: "/home/scratch.trt_llm_data/llm-models/Mixtral-8x7B-v0.1/" } } backend: "python" I0510 07:54:52.783230 21661 model_config_utils.cc:680] Server side auto-completed config: name: "tensorrt_llm" max_batch_size: 64 input { name: "input_ids" data_type: TYPE_INT32 dims: -1 allow_ragged_batch: true } input { name: "input_lengths" data_type: TYPE_INT32 dims: 1 reshape { } } input { name: "request_output_len" data_type: TYPE_INT32 dims: 1 } input { name: "draft_input_ids" data_type: TYPE_INT32 dims: -1 allow_ragged_batch: true optional: true } input { name: "draft_logits" data_type: TYPE_FP32 dims: -1 dims: -1 allow_ragged_batch: true optional: true } input { name: "draft_acceptance_threshold" data_type: TYPE_FP32 dims: 1 reshape { } optional: true } input { name: "end_id" data_type: TYPE_INT32 dims: 1 reshape { } optional: true } input { name: "pad_id" data_type: TYPE_INT32 dims: 1 reshape { } optional: true } input { name: "stop_words_list" data_type: TYPE_INT32 dims: 2 dims: -1 allow_ragged_batch: true optional: true } input { name: "bad_words_list" data_type: TYPE_INT32 dims: 2 dims: -1 allow_ragged_batch: true optional: true } input { name: "embedding_bias" data_type: TYPE_FP32 dims: -1 allow_ragged_batch: true optional: true } input { name: "beam_width" data_type: TYPE_INT32 dims: 1 reshape { } optional: true } input { name: "temperature" data_type: TYPE_FP32 dims: 1 reshape { } optional: true } input { name: "runtime_top_k" data_type: TYPE_INT32 dims: 1 reshape { } optional: true } input { name: "runtime_top_p" data_type: TYPE_FP32 dims: 1 reshape { } optional: true } input { name: "runtime_top_p_min" data_type: TYPE_FP32 dims: 1 reshape { } optional: true } input { name: "runtime_top_p_decay" data_type: TYPE_FP32 dims: 1 reshape { } optional: true } input { name: "runtime_top_p_reset_ids" data_type: TYPE_INT32 dims: 1 reshape { } optional: true } input { name: "len_penalty" data_type: TYPE_FP32 dims: 1 reshape { } optional: true } input { name: "early_stopping" data_type: TYPE_BOOL dims: 1 reshape { } optional: true } input { name: "repetition_penalty" data_type: TYPE_FP32 dims: 1 reshape { } optional: true } input { name: "min_length" data_type: TYPE_INT32 dims: 1 reshape { } optional: true } input { name: "beam_search_diversity_rate" data_type: TYPE_FP32 dims: 1 reshape { } optional: true } input { name: "presence_penalty" data_type: TYPE_FP32 dims: 1 reshape { } optional: true } input { name: "frequency_penalty" data_type: TYPE_FP32 dims: 1 reshape { } optional: true } input { name: "random_seed" data_type: TYPE_UINT64 dims: 1 reshape { } optional: true } input { name: "return_log_probs" data_type: TYPE_BOOL dims: 1 reshape { } optional: true } input { name: "return_context_logits" data_type: TYPE_BOOL dims: 1 reshape { } optional: true } input { name: "return_generation_logits" data_type: TYPE_BOOL dims: 1 reshape { } optional: true } input { name: "stop" data_type: TYPE_BOOL dims: 1 optional: true } input { name: "streaming" data_type: TYPE_BOOL dims: 1 optional: true } input { name: "prompt_embedding_table" data_type: TYPE_FP16 dims: -1 dims: -1 allow_ragged_batch: true optional: true } input { name: "prompt_vocab_size" data_type: TYPE_INT32 dims: 1 reshape { } optional: true } input { name: "lora_task_id" data_type: TYPE_UINT64 dims: 1 reshape { } optional: true } input { name: "lora_weights" data_type: TYPE_FP16 dims: -1 dims: -1 allow_ragged_batch: true optional: true } input { name: "lora_config" data_type: TYPE_INT32 dims: -1 dims: 3 allow_ragged_batch: true optional: true } output { name: "output_ids" data_type: TYPE_INT32 dims: -1 dims: -1 } output { name: "sequence_length" data_type: TYPE_INT32 dims: -1 } output { name: "cum_log_probs" data_type: TYPE_FP32 dims: -1 } output { name: "output_log_probs" data_type: TYPE_FP32 dims: -1 dims: -1 } output { name: "context_logits" data_type: TYPE_FP32 dims: -1 dims: -1 } output { name: "generation_logits" data_type: TYPE_FP32 dims: -1 dims: -1 dims: -1 } instance_group { count: 1 kind: KIND_CPU } dynamic_batching { preferred_batch_size: 64 } parameters { key: "FORCE_CPU_ONLY_INPUT_TENSORS" value { string_value: "no" } } parameters { key: "batch_scheduler_policy" value { string_value: "${batch_scheduler_policy}" } } parameters { key: "cancellation_check_period_ms" value { string_value: "${cancellation_check_period_ms}" } } parameters { key: "decoding_mode" value { string_value: "${decoding_mode}" } } parameters { key: "enable_chunked_context" value { string_value: "${enable_chunked_context}" } } parameters { key: "enable_kv_cache_reuse" value { string_value: "False" } } parameters { key: "exclude_input_in_output" value { string_value: "True" } } parameters { key: "executor_worker_path" value { string_value: "/opt/tritonserver/backends/tensorrtllm/trtllmExecutorWorker" } } parameters { key: "gpt_model_path" value { string_value: "/tmp/mixtral-engine-1" } } parameters { key: "gpt_model_type" value { string_value: "inflight_fused_batching" } } parameters { key: "gpu_device_ids" value { string_value: "${gpu_device_ids}" } } parameters { key: "iter_stats_max_iterations" value { string_value: "${iter_stats_max_iterations}" } } parameters { key: "kv_cache_free_gpu_mem_fraction" value { string_value: "0.5" } } parameters { key: "kv_cache_host_memory_bytes" value { string_value: "${kv_cache_host_memory_bytes}" } } parameters { key: "kv_cache_onboard_blocks" value { string_value: "${kv_cache_onboard_blocks}" } } parameters { key: "lora_cache_gpu_memory_fraction" value { string_value: "${lora_cache_gpu_memory_fraction}" } } parameters { key: "lora_cache_host_memory_bytes" value { string_value: "${lora_cache_host_memory_bytes}" } } parameters { key: "lora_cache_max_adapter_size" value { string_value: "${lora_cache_max_adapter_size}" } } parameters { key: "lora_cache_optimal_adapter_size" value { string_value: "${lora_cache_optimal_adapter_size}" } } parameters { key: "max_attention_window_size" value { string_value: "2560" } } parameters { key: "max_beam_width" value { string_value: "1" } } parameters { key: "max_tokens_in_paged_kv_cache" value { string_value: "2560" } } parameters { key: "medusa_choices" value { string_value: "${medusa_choices}" } } parameters { key: "normalize_log_probs" value { string_value: "${normalize_log_probs}" } } parameters { key: "request_stats_max_iterations" value { string_value: "${request_stats_max_iterations}" } } parameters { key: "sink_token_length" value { string_value: "${sink_token_length}" } } backend: "tensorrtllm" model_transaction_policy { } I0510 07:54:52.800435 21661 model_config_utils.cc:680] Server side auto-completed config: name: "tensorrt_llm_bls" max_batch_size: 64 input { name: "text_input" data_type: TYPE_STRING dims: -1 } input { name: "max_tokens" data_type: TYPE_INT32 dims: -1 } input { name: "bad_words" data_type: TYPE_STRING dims: -1 optional: true } input { name: "stop_words" data_type: TYPE_STRING dims: -1 optional: true } input { name: "end_id" data_type: TYPE_INT32 dims: 1 optional: true } input { name: "pad_id" data_type: TYPE_INT32 dims: 1 optional: true } input { name: "top_k" data_type: TYPE_INT32 dims: 1 optional: true } input { name: "top_p" data_type: TYPE_FP32 dims: 1 optional: true } input { name: "temperature" data_type: TYPE_FP32 dims: 1 optional: true } input { name: "length_penalty" data_type: TYPE_FP32 dims: 1 optional: true } input { name: "repetition_penalty" data_type: TYPE_FP32 dims: 1 optional: true } input { name: "min_length" data_type: TYPE_INT32 dims: 1 optional: true } input { name: "presence_penalty" data_type: TYPE_FP32 dims: 1 optional: true } input { name: "frequency_penalty" data_type: TYPE_FP32 dims: 1 optional: true } input { name: "random_seed" data_type: TYPE_UINT64 dims: 1 optional: true } input { name: "return_log_probs" data_type: TYPE_BOOL dims: 1 reshape { } optional: true } input { name: "return_context_logits" data_type: TYPE_BOOL dims: 1 reshape { } optional: true } input { name: "return_generation_logits" data_type: TYPE_BOOL dims: 1 reshape { } optional: true } input { name: "beam_width" data_type: TYPE_INT32 dims: 1 optional: true } input { name: "stream" data_type: TYPE_BOOL dims: 1 optional: true } input { name: "prompt_embedding_table" data_type: TYPE_FP16 dims: -1 dims: -1 optional: true } input { name: "prompt_vocab_size" data_type: TYPE_INT32 dims: 1 optional: true } input { name: "embedding_bias_words" data_type: TYPE_STRING dims: -1 optional: true } input { name: "embedding_bias_weights" data_type: TYPE_FP32 dims: -1 optional: true } input { name: "num_draft_tokens" data_type: TYPE_INT32 dims: 1 optional: true } input { name: "use_draft_logits" data_type: TYPE_BOOL dims: 1 reshape { } optional: true } output { name: "text_output" data_type: TYPE_STRING dims: -1 } output { name: "cum_log_probs" data_type: TYPE_FP32 dims: -1 } output { name: "output_log_probs" data_type: TYPE_FP32 dims: -1 dims: -1 } output { name: "context_logits" data_type: TYPE_FP32 dims: -1 dims: -1 } output { name: "generation_logits" data_type: TYPE_FP32 dims: -1 dims: -1 dims: -1 } instance_group { count: 1 kind: KIND_CPU } default_model_filename: "model.py" parameters { key: "accumulate_tokens" value { string_value: "False" } } parameters { key: "tensorrt_llm_draft_model_name" value { string_value: "${tensorrt_llm_draft_model_name}" } } parameters { key: "tensorrt_llm_model_name" value { string_value: "${tensorrt_llm_model_name}" } } backend: "python" model_transaction_policy { } I0510 07:54:52.800591 21661 model_lifecycle.cc:438] AsyncLoad() 'postprocessing' I0510 07:54:52.801520 21661 model_lifecycle.cc:469] loading: postprocessing:1 I0510 07:54:52.801534 21661 model_lifecycle.cc:438] AsyncLoad() 'preprocessing' I0510 07:54:52.801654 21661 model_lifecycle.cc:547] CreateModel() 'postprocessing' version 1 I0510 07:54:52.801787 21661 backend_model.cc:502] Adding default backend config setting: default-max-batch-size,4 I0510 07:54:52.801813 21661 shared_library.cc:112] OpenLibraryHandle: /opt/tritonserver/backends/python/libtriton_python.so I0510 07:54:52.802615 21661 model_lifecycle.cc:469] loading: preprocessing:1 I0510 07:54:52.802625 21661 model_lifecycle.cc:438] AsyncLoad() 'tensorrt_llm' I0510 07:54:52.802756 21661 model_lifecycle.cc:547] CreateModel() 'preprocessing' version 1 I0510 07:54:52.802881 21661 backend_model.cc:502] Adding default backend config setting: default-max-batch-size,4 I0510 07:54:52.803352 21661 python_be.cc:2085] 'python' TRITONBACKEND API version: 1.19 I0510 07:54:52.803366 21661 python_be.cc:2107] backend configuration: {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","shm-region-prefix-name":"prefix0_","default-max-batch-size":"4"}} I0510 07:54:52.803400 21661 python_be.cc:2246] Shared memory configuration is shm-default-byte-size=1048576,shm-growth-byte-size=1048576,stub-timeout-seconds=30 I0510 07:54:52.803472 21661 model_lifecycle.cc:469] loading: tensorrt_llm:1 I0510 07:54:52.803480 21661 model_lifecycle.cc:438] AsyncLoad() 'tensorrt_llm_bls' I0510 07:54:52.803542 21661 model_lifecycle.cc:547] CreateModel() 'tensorrt_llm' version 1 I0510 07:54:52.803669 21661 backend_model.cc:502] Adding default backend config setting: default-max-batch-size,4 I0510 07:54:52.803726 21661 python_be.cc:2569] TRITONBACKEND_GetBackendAttribute: setting attributes I0510 07:54:52.803868 21661 shared_library.cc:112] OpenLibraryHandle: /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so I0510 07:54:52.804330 21661 model_lifecycle.cc:469] loading: tensorrt_llm_bls:1 I0510 07:54:52.804451 21661 model_lifecycle.cc:547] CreateModel() 'tensorrt_llm_bls' version 1 I0510 07:54:52.804539 21661 backend_model.cc:502] Adding default backend config setting: default-max-batch-size,4 I0510 07:54:52.826674 21661 python_be.cc:2347] TRITONBACKEND_ModelInitialize: postprocessing (version 1) I0510 07:54:52.827325 21661 python_be.cc:2347] TRITONBACKEND_ModelInitialize: preprocessing (version 1) I0510 07:54:52.827431 21661 model_config_utils.cc:1902] ModelConfig 64-bit fields: I0510 07:54:52.827441 21661 model_config_utils.cc:1904] ModelConfig::dynamic_batching::default_priority_level I0510 07:54:52.827446 21661 model_config_utils.cc:1904] ModelConfig::dynamic_batching::default_queue_policy::default_timeout_microseconds I0510 07:54:52.827450 21661 model_config_utils.cc:1904] ModelConfig::dynamic_batching::max_queue_delay_microseconds I0510 07:54:52.827454 21661 model_config_utils.cc:1904] ModelConfig::dynamic_batching::priority_levels I0510 07:54:52.827458 21661 model_config_utils.cc:1904] ModelConfig::dynamic_batching::priority_queue_policy::key I0510 07:54:52.827474 21661 model_config_utils.cc:1904] ModelConfig::dynamic_batching::priority_queue_policy::value::default_timeout_microseconds I0510 07:54:52.827478 21661 model_config_utils.cc:1904] ModelConfig::ensemble_scheduling::step::model_version I0510 07:54:52.827482 21661 model_config_utils.cc:1904] ModelConfig::input::dims I0510 07:54:52.827486 21661 model_config_utils.cc:1904] ModelConfig::input::reshape::shape I0510 07:54:52.827490 21661 model_config_utils.cc:1904] ModelConfig::instance_group::secondary_devices::device_id I0510 07:54:52.827494 21661 model_config_utils.cc:1904] ModelConfig::model_warmup::inputs::value::dims I0510 07:54:52.827498 21661 model_config_utils.cc:1904] ModelConfig::optimization::cuda::graph_spec::graph_lower_bound::input::value::dim I0510 07:54:52.827502 21661 model_config_utils.cc:1904] ModelConfig::optimization::cuda::graph_spec::input::value::dim I0510 07:54:52.827506 21661 model_config_utils.cc:1904] ModelConfig::output::dims I0510 07:54:52.827510 21661 model_config_utils.cc:1904] ModelConfig::output::reshape::shape I0510 07:54:52.827514 21661 model_config_utils.cc:1904] ModelConfig::sequence_batching::direct::max_queue_delay_microseconds I0510 07:54:52.827518 21661 model_config_utils.cc:1904] ModelConfig::sequence_batching::max_sequence_idle_microseconds I0510 07:54:52.827522 21661 model_config_utils.cc:1904] ModelConfig::sequence_batching::oldest::max_queue_delay_microseconds I0510 07:54:52.827526 21661 model_config_utils.cc:1904] ModelConfig::sequence_batching::state::dims I0510 07:54:52.827529 21661 model_config_utils.cc:1904] ModelConfig::sequence_batching::state::initial_state::dims I0510 07:54:52.827533 21661 model_config_utils.cc:1904] ModelConfig::version_policy::specific::versions I0510 07:54:52.827780 21661 python_be.cc:2041] model configuration: { "name": "postprocessing", "platform": "", "backend": "python", "runtime": "", "version_policy": { "latest": { "num_versions": 1 } }, "max_batch_size": 64, "input": [ { "name": "TOKENS_BATCH", "data_type": "TYPE_INT32", "format": "FORMAT_NONE", "dims": [ -1, -1 ], "is_shape_tensor": false, "allow_ragged_batch": false, "optional": false }, { "name": "SEQUENCE_LENGTH", "data_type": "TYPE_INT32", "format": "FORMAT_NONE", "dims": [ -1 ], "is_shape_tensor": false, "allow_ragged_batch": false, "optional": false }, { "name": "CUM_LOG_PROBS", "data_type": "TYPE_FP32", "format": "FORMAT_NONE", "dims": [ -1 ], "is_shape_tensor": false, "allow_ragged_batch": false, "optional": true }, { "name": "OUTPUT_LOG_PROBS", "data_type": "TYPE_FP32", "format": "FORMAT_NONE", "dims": [ -1, -1 ], "is_shape_tensor": false, "allow_ragged_batch": false, "optional": true }, { "name": "CONTEXT_LOGITS", "data_type": "TYPE_FP32", "format": "FORMAT_NONE", "dims": [ -1, -1 ], "is_shape_tensor": false, "allow_ragged_batch": false, "optional": true }, { "name": "GENERATION_LOGITS", "data_type": "TYPE_FP32", "format": "FORMAT_NONE", "dims": [ -1, -1, -1 ], "is_shape_tensor": false, "allow_ragged_batch": false, "optional": true } ], "output": [ { "name": "OUTPUT", "data_type": "TYPE_STRING", "dims": [ -1 ], "label_filename": "", "is_shape_tensor": false }, { "name": "OUT_CUM_LOG_PROBS", "data_type": "TYPE_FP32", "dims": [ -1 ], "label_filename": "", "is_shape_tensor": false }, { "name": "OUT_OUTPUT_LOG_PROBS", "data_type": "TYPE_FP32", "dims": [ -1, -1 ], "label_filename": "", "is_shape_tensor": false }, { "name": "OUT_CONTEXT_LOGITS", "data_type": "TYPE_FP32", "dims": [ -1, -1 ], "label_filename": "", "is_shape_tensor": false }, { "name": "OUT_GENERATION_LOGITS", "data_type": "TYPE_FP32", "dims": [ -1, -1, -1 ], "label_filename": "", "is_shape_tensor": false } ], "batch_input": [], "batch_output": [], "optimization": { "priority": "PRIORITY_DEFAULT", "input_pinned_memory": { "enable": true }, "output_pinned_memory": { "enable": true }, "gather_kernel_buffer_threshold": 0, "eager_batching": false }, "instance_group": [ { "name": "postprocessing_0", "kind": "KIND_CPU", "count": 1, "gpus": [], "secondary_devices": [], "profile": [], "passive": false, "host_policy": "" } ], "default_model_filename": "model.py", "cc_model_filenames": {}, "metric_tags": {}, "parameters": { "skip_special_tokens": { "string_value": "${skip_special_tokens}" }, "tokenizer_dir": { "string_value": "/home/scratch.trt_llm_data/llm-models/Mixtral-8x7B-v0.1/" } }, "model_warmup": [] } I0510 07:54:52.828089 21661 python_be.cc:2041] model configuration: { "name": "preprocessing", "platform": "", "backend": "python", "runtime": "", "version_policy": { "latest": { "num_versions": 1 } }, "max_batch_size": 64, "input": [ { "name": "QUERY", "data_type": "TYPE_STRING", "format": "FORMAT_NONE", "dims": [ -1 ], "is_shape_tensor": false, "allow_ragged_batch": false, "optional": false }, { "name": "REQUEST_OUTPUT_LEN", "data_type": "TYPE_INT32", "format": "FORMAT_NONE", "dims": [ -1 ], "is_shape_tensor": false, "allow_ragged_batch": false, "optional": false }, { "name": "BAD_WORDS_DICT", "data_type": "TYPE_STRING", "format": "FORMAT_NONE", "dims": [ -1 ], "is_shape_tensor": false, "allow_ragged_batch": false, "optional": true }, { "name": "STOP_WORDS_DICT", "data_type": "TYPE_STRING", "format": "FORMAT_NONE", "dims": [ -1 ], "is_shape_tensor": false, "allow_ragged_batch": false, "optional": true }, { "name": "EMBEDDING_BIAS_WORDS", "data_type": "TYPE_STRING", "format": "FORMAT_NONE", "dims": [ -1 ], "is_shape_tensor": false, "allow_ragged_batch": false, "optional": true }, { "name": "EMBEDDING_BIAS_WEIGHTS", "data_type": "TYPE_FP32", "format": "FORMAT_NONE", "dims": [ -1 ], "is_shape_tensor": false, "allow_ragged_batch": false, "optional": true }, { "name": "END_ID", "data_type": "TYPE_INT32", "format": "FORMAT_NONE", "dims": [ -1 ], "is_shape_tensor": false, "allow_ragged_batch": false, "optional": true }, { "name": "PAD_ID", "data_type": "TYPE_INT32", "format": "FORMAT_NONE", "dims": [ -1 ], "is_shape_tensor": false, "allow_ragged_batch": false, "optional": true } ], "output": [ { "name": "INPUT_ID", "data_type": "TYPE_INT32", "dims": [ -1 ], "label_filename": "", "is_shape_tensor": false }, { "name": "REQUEST_INPUT_LEN", "data_type": "TYPE_INT32", "dims": [ 1 ], "label_filename": "", "is_shape_tensor": false }, { "name": "BAD_WORDS_IDS", "data_type": "TYPE_INT32", "dims": [ 2, -1 ], "label_filename": "", "is_shape_tensor": false }, { "name": "STOP_WORDS_IDS", "data_type": "TYPE_INT32", "dims": [ 2, -1 ], "label_filename": "", "is_shape_tensor": false }, { "name": "EMBEDDING_BIAS", "data_type": "TYPE_FP32", "dims": [ -1 ], "label_filename": "", "is_shape_tensor": false }, { "name": "REQUEST_OUTPUT_LEN", "data_type": "TYPE_INT32", "dims": [ -1 ], "label_filename": "", "is_shape_tensor": false }, { "name": "OUT_END_ID", "data_type": "TYPE_INT32", "dims": [ -1 ], "label_filename": "", "is_shape_tensor": false }, { "name": "OUT_PAD_ID", "data_type": "TYPE_INT32", "dims": [ -1 ], "label_filename": "", "is_shape_tensor": false } ], "batch_input": [], "batch_output": [], "optimization": { "priority": "PRIORITY_DEFAULT", "input_pinned_memory": { "enable": true }, "output_pinned_memory": { "enable": true }, "gather_kernel_buffer_threshold": 0, "eager_batching": false }, "instance_group": [ { "name": "preprocessing_0", "kind": "KIND_CPU", "count": 1, "gpus": [], "secondary_devices": [], "profile": [], "passive": false, "host_policy": "" } ], "default_model_filename": "model.py", "cc_model_filenames": {}, "metric_tags": {}, "parameters": { "tokenizer_dir": { "string_value": "/home/scratch.trt_llm_data/llm-models/Mixtral-8x7B-v0.1/" }, "add_special_tokens": { "string_value": "${add_special_tokens}" } }, "model_warmup": [] } I0510 07:54:52.883223 21661 python_be.cc:2391] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0) I0510 07:54:52.883224 21661 python_be.cc:2391] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0) I0510 07:54:52.883235 21661 backend_model_instance.cc:69] Creating instance preprocessing_0_0 on CPU using artifact 'model.py' I0510 07:54:52.883264 21661 backend_model_instance.cc:69] Creating instance postprocessing_0_0 on CPU using artifact 'model.py' I0510 07:54:52.883902 21661 stub_launcher.cc:388] Starting Python backend stub: exec /opt/tritonserver/backends/python/triton_python_backend_stub mixtral_ifb/preprocessing/1/model.py prefix0_1 1048576 1048576 21661 /opt/tritonserver/backends/python 336 preprocessing_0_0 DEFAULT I0510 07:54:52.884230 21661 stub_launcher.cc:388] Starting Python backend stub: exec /opt/tritonserver/backends/python/triton_python_backend_stub mixtral_ifb/postprocessing/1/model.py prefix0_2 1048576 1048576 21661 /opt/tritonserver/backends/python 336 postprocessing_0_0 DEFAULT I0510 07:54:52.944809 21661 python_be.cc:2347] TRITONBACKEND_ModelInitialize: tensorrt_llm_bls (version 1) I0510 07:54:52.945774 21661 python_be.cc:2041] model configuration: { "name": "tensorrt_llm_bls", "platform": "", "backend": "python", "runtime": "", "version_policy": { "latest": { "num_versions": 1 } }, "max_batch_size": 64, "input": [ { "name": "text_input", "data_type": "TYPE_STRING", "format": "FORMAT_NONE", "dims": [ -1 ], "is_shape_tensor": false, "allow_ragged_batch": false, "optional": false }, { "name": "max_tokens", "data_type": "TYPE_INT32", "format": "FORMAT_NONE", "dims": [ -1 ], "is_shape_tensor": false, "allow_ragged_batch": false, "optional": false }, { "name": "bad_words", "data_type": "TYPE_STRING", "format": "FORMAT_NONE", "dims": [ -1 ], "is_shape_tensor": false, "allow_ragged_batch": false, "optional": true }, { "name": "stop_words", "data_type": "TYPE_STRING", "format": "FORMAT_NONE", "dims": [ -1 ], "is_shape_tensor": false, "allow_ragged_batch": false, "optional": true }, { "name": "end_id", "data_type": "TYPE_INT32", "format": "FORMAT_NONE", "dims": [ 1 ], "is_shape_tensor": false, "allow_ragged_batch": false, "optional": true }, { "name": "pad_id", "data_type": "TYPE_INT32", "format": "FORMAT_NONE", "dims": [ 1 ], "is_shape_tensor": false, "allow_ragged_batch": false, "optional": true }, { "name": "top_k", "data_type": "TYPE_INT32", "format": "FORMAT_NONE", "dims": [ 1 ], "is_shape_tensor": false, "allow_ragged_batch": false, "optional": true }, { "name": "top_p", "data_type": "TYPE_FP32", "format": "FORMAT_NONE", "dims": [ 1 ], "is_shape_tensor": false, "allow_ragged_batch": false, "optional": true }, { "name": "temperature", "data_type": "TYPE_FP32", "format": "FORMAT_NONE", "dims": [ 1 ], "is_shape_tensor": false, "allow_ragged_batch": false, "optional": true }, { "name": "length_penalty", "data_type": "TYPE_FP32", "format": "FORMAT_NONE", "dims": [ 1 ], "is_shape_tensor": false, "allow_ragged_batch": false, "optional": true }, { "name": "repetition_penalty", "data_type": "TYPE_FP32", "format": "FORMAT_NONE", "dims": [ 1 ], "is_shape_tensor": false, "allow_ragged_batch": false, "optional": true }, { "name": "min_length", "data_type": "TYPE_INT32", "format": "FORMAT_NONE", "dims": [ 1 ], "is_shape_tensor": false, "allow_ragged_batch": false, "optional": true }, { "name": "presence_penalty", "data_type": "TYPE_FP32", "format": "FORMAT_NONE", "dims": [ 1 ], "is_shape_tensor": false, "allow_ragged_batch": false, "optional": true }, { "name": "frequency_penalty", "data_type": "TYPE_FP32", "format": "FORMAT_NONE", "dims": [ 1 ], "is_shape_tensor": false, "allow_ragged_batch": false, "optional": true }, { "name": "random_seed", "data_type": "TYPE_UINT64", "format": "FORMAT_NONE", "dims": [ 1 ], "is_shape_tensor": false, "allow_ragged_batch": false, "optional": true }, { "name": "return_log_probs", "data_type": "TYPE_BOOL", "format": "FORMAT_NONE", "dims": [ 1 ], "reshape": { "shape": [] }, "is_shape_tensor": false, "allow_ragged_batch": false, "optional": true }, { "name": "return_context_logits", "data_type": "TYPE_BOOL", "format": "FORMAT_NONE", "dims": [ 1 ], "reshape": { "shape": [] }, "is_shape_tensor": false, "allow_ragged_batch": false, "optional": true }, { "name": "return_generation_logits", "data_type": "TYPE_BOOL", "format": "FORMAT_NONE", "dims": [ 1 ], "reshape": { "shape": [] }, "is_shape_tensor": false, "allow_ragged_batch": false, "optional": true }, { "name": "beam_width", "data_type": "TYPE_INT32", "format": "FORMAT_NONE", "dims": [ 1 ], "is_shape_tensor": false, "allow_ragged_batch": false, "optional": true }, { "name": "stream", "data_type": "TYPE_BOOL", "format": "FORMAT_NONE", "dims": [ 1 ], "is_shape_tensor": false, "allow_ragged_batch": false, "optional": true }, { "name": "prompt_embedding_table", "data_type": "TYPE_FP16", "format": "FORMAT_NONE", "dims": [ -1, -1 ], "is_shape_tensor": false, "allow_ragged_batch": false, "optional": true }, { "name": "prompt_vocab_size", "data_type": "TYPE_INT32", "format": "FORMAT_NONE", "dims": [ 1 ], "is_shape_tensor": false, "allow_ragged_batch": false, "optional": true }, { "name": "embedding_bias_words", "data_type": "TYPE_STRING", "format": "FORMAT_NONE", "dims": [ -1 ], "is_shape_tensor": false, "allow_ragged_batch": false, "optional": true }, { "name": "embedding_bias_weights", "data_type": "TYPE_FP32", "format": "FORMAT_NONE", "dims": [ -1 ], "is_shape_tensor": false, "allow_ragged_batch": false, "optional": true }, { "name": "num_draft_tokens", "data_type": "TYPE_INT32", "format": "FORMAT_NONE", "dims": [ 1 ], "is_shape_tensor": false, "allow_ragged_batch": false, "optional": true }, { "name": "use_draft_logits", "data_type": "TYPE_BOOL", "format": "FORMAT_NONE", "dims": [ 1 ], "reshape": { "shape": [] }, "is_shape_tensor": false, "allow_ragged_batch": false, "optional": true } ], "output": [ { "name": "text_output", "data_type": "TYPE_STRING", "dims": [ -1 ], "label_filename": "", "is_shape_tensor": false }, { "name": "cum_log_probs", "data_type": "TYPE_FP32", "dims": [ -1 ], "label_filename": "", "is_shape_tensor": false }, { "name": "output_log_probs", "data_type": "TYPE_FP32", "dims": [ -1, -1 ], "label_filename": "", "is_shape_tensor": false }, { "name": "context_logits", "data_type": "TYPE_FP32", "dims": [ -1, -1 ], "label_filename": "", "is_shape_tensor": false }, { "name": "generation_logits", "data_type": "TYPE_FP32", "dims": [ -1, -1, -1 ], "label_filename": "", "is_shape_tensor": false } ], "batch_input": [], "batch_output": [], "optimization": { "priority": "PRIORITY_DEFAULT", "input_pinned_memory": { "enable": true }, "output_pinned_memory": { "enable": true }, "gather_kernel_buffer_threshold": 0, "eager_batching": false }, "instance_group": [ { "name": "tensorrt_llm_bls_0", "kind": "KIND_CPU", "count": 1, "gpus": [], "secondary_devices": [], "profile": [], "passive": false, "host_policy": "" } ], "default_model_filename": "model.py", "cc_model_filenames": {}, "metric_tags": {}, "parameters": { "tensorrt_llm_model_name": { "string_value": "${tensorrt_llm_model_name}" }, "accumulate_tokens": { "string_value": "False" }, "tensorrt_llm_draft_model_name": { "string_value": "${tensorrt_llm_draft_model_name}" } }, "model_warmup": [], "model_transaction_policy": { "decoupled": false } } I0510 07:54:52.946181 21661 python_be.cc:2391] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (CPU device 0) I0510 07:54:52.946217 21661 backend_model_instance.cc:69] Creating instance tensorrt_llm_bls_0_0 on CPU using artifact 'model.py' I0510 07:54:52.947059 21661 stub_launcher.cc:388] Starting Python backend stub: exec /opt/tritonserver/backends/python/triton_python_backend_stub mixtral_ifb/tensorrt_llm_bls/1/model.py prefix0_3 1048576 1048576 21661 /opt/tritonserver/backends/python 336 tensorrt_llm_bls_0_0 DEFAULT I0510 07:54:53.263691 21661 python_be.cc:2412] TRITONBACKEND_ModelInstanceInitialize: instance initialization successful tensorrt_llm_bls_0_0 (device 0) I0510 07:54:53.263865 21661 backend_model_instance.cc:772] Starting backend thread for tensorrt_llm_bls_0_0 at nice 0 on device 0... I0510 07:54:53.263944 21661 backend_model.cc:674] Created model instance named 'tensorrt_llm_bls_0_0' with device id '0' I0510 07:54:53.264120 21661 model_lifecycle.cc:692] OnLoadComplete() 'tensorrt_llm_bls' version 1 I0510 07:54:53.264155 21661 model_lifecycle.cc:730] OnLoadFinal() 'tensorrt_llm_bls' for all version(s) I0510 07:54:53.264163 21661 model_lifecycle.cc:835] successfully loaded 'tensorrt_llm_bls' I0510 07:54:54.830977 21661 python_be.cc:2412] TRITONBACKEND_ModelInstanceInitialize: instance initialization successful preprocessing_0_0 (device 0) I0510 07:54:54.831217 21661 backend_model_instance.cc:772] Starting backend thread for preprocessing_0_0 at nice 0 on device 0... I0510 07:54:54.831305 21661 backend_model.cc:674] Created model instance named 'preprocessing_0_0' with device id '0' I0510 07:54:54.831460 21661 model_lifecycle.cc:692] OnLoadComplete() 'preprocessing' version 1 I0510 07:54:54.831500 21661 model_lifecycle.cc:730] OnLoadFinal() 'preprocessing' for all version(s) I0510 07:54:54.831513 21661 model_lifecycle.cc:835] successfully loaded 'preprocessing' I0510 07:54:54.869113 21661 python_be.cc:2412] TRITONBACKEND_ModelInstanceInitialize: instance initialization successful postprocessing_0_0 (device 0) I0510 07:54:54.869313 21661 backend_model_instance.cc:772] Starting backend thread for postprocessing_0_0 at nice 0 on device 0... I0510 07:54:54.869382 21661 backend_model.cc:674] Created model instance named 'postprocessing_0_0' with device id '0' I0510 07:54:54.869521 21661 model_lifecycle.cc:692] OnLoadComplete() 'postprocessing' version 1 I0510 07:54:54.869543 21661 model_lifecycle.cc:730] OnLoadFinal() 'postprocessing' for all version(s) I0510 07:54:54.869551 21661 model_lifecycle.cc:835] successfully loaded 'postprocessing' I0510 07:54:54.869682 21661 model_lifecycle.cc:294] VersionStates() 'postprocessing' I0510 07:54:54.869742 21661 model_lifecycle.cc:294] VersionStates() 'preprocessing' I0510 07:55:29.107383 21661 backend_model_instance.cc:772] Starting backend thread for tensorrt_llm_0_0 at nice 0 on device 0... I0510 07:55:29.108501 21661 backend_model.cc:674] Created model instance named 'tensorrt_llm_0_0' with device id '0' I0510 07:55:29.108688 21661 dynamic_batch_scheduler.cc:311] Starting dynamic-batcher thread for tensorrt_llm at nice 0... I0510 07:55:29.108713 21661 model_lifecycle.cc:692] OnLoadComplete() 'tensorrt_llm' version 1 I0510 07:55:29.108721 21661 model_lifecycle.cc:730] OnLoadFinal() 'tensorrt_llm' for all version(s) I0510 07:55:29.108734 21661 model_lifecycle.cc:835] successfully loaded 'tensorrt_llm' I0510 07:55:29.108857 21661 model_lifecycle.cc:294] VersionStates() 'tensorrt_llm' I0510 07:55:29.108924 21661 model_lifecycle.cc:294] VersionStates() 'tensorrt_llm_bls' I0510 07:55:29.109021 21661 model_lifecycle.cc:336] GetModel() 'preprocessing' version -1 I0510 07:55:29.109125 21661 model_lifecycle.cc:336] GetModel() 'tensorrt_llm' version -1 I0510 07:55:29.109338 21661 model_lifecycle.cc:336] GetModel() 'postprocessing' version -1 I0510 07:55:29.109433 21661 model_lifecycle.cc:336] GetModel() 'preprocessing' version -1 I0510 07:55:29.109485 21661 model_lifecycle.cc:336] GetModel() 'tensorrt_llm' version -1 I0510 07:55:29.109622 21661 model_lifecycle.cc:336] GetModel() 'postprocessing' version -1 I0510 07:55:29.109696 21661 model_lifecycle.cc:336] GetModel() 'preprocessing' version -1 I0510 07:55:29.109742 21661 model_lifecycle.cc:336] GetModel() 'tensorrt_llm' version -1 I0510 07:55:29.109864 21661 model_lifecycle.cc:336] GetModel() 'postprocessing' version -1 I0510 07:55:29.109921 21661 model_lifecycle.cc:438] AsyncLoad() 'ensemble' I0510 07:55:29.111102 21661 model_lifecycle.cc:469] loading: ensemble:1 I0510 07:55:29.111212 21661 model_lifecycle.cc:547] CreateModel() 'ensemble' version 1 I0510 07:55:29.112186 21661 ensemble_model.cc:55] ensemble model for ensemble I0510 07:55:29.112212 21661 model_lifecycle.cc:336] GetModel() 'postprocessing' version -1 I0510 07:55:29.112222 21661 model_lifecycle.cc:336] GetModel() 'postprocessing' version -1 I0510 07:55:29.112229 21661 model_lifecycle.cc:336] GetModel() 'postprocessing' version -1 I0510 07:55:29.112235 21661 model_lifecycle.cc:336] GetModel() 'postprocessing' version -1 I0510 07:55:29.112254 21661 model_lifecycle.cc:336] GetModel() 'postprocessing' version -1 I0510 07:55:29.112262 21661 model_lifecycle.cc:692] OnLoadComplete() 'ensemble' version 1 I0510 07:55:29.112268 21661 model_lifecycle.cc:730] OnLoadFinal() 'ensemble' for all version(s) I0510 07:55:29.112275 21661 model_lifecycle.cc:835] successfully loaded 'ensemble' I0510 07:55:29.112329 21661 model_lifecycle.cc:294] VersionStates() 'ensemble' I0510 07:55:29.112372 21661 model_lifecycle.cc:294] VersionStates() 'tensorrt_llm_bls' I0510 07:55:29.112379 21661 model_lifecycle.cc:294] VersionStates() 'tensorrt_llm' I0510 07:55:29.112385 21661 model_lifecycle.cc:294] VersionStates() 'postprocessing' I0510 07:55:29.112390 21661 model_lifecycle.cc:294] VersionStates() 'preprocessing' I0510 07:55:29.112396 21661 model_lifecycle.cc:294] VersionStates() 'ensemble' I0510 07:55:29.112457 21661 server.cc:607] +------------------+------+ | Repository Agent | Path | +------------------+------+ +------------------+------+ I0510 07:55:29.112531 21661 server.cc:634] +-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+ | Backend | Path | Config | +-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+ | python | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","shm-region-prefi | | | | x-name":"prefix0_","default-max-batch-size":"4"}} | | tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batc | | | | h-size":"4"}} | +-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+ I0510 07:55:29.112543 21661 model_lifecycle.cc:273] ModelStates() I0510 07:55:29.112596 21661 server.cc:677] +------------------+---------+--------+ | Model | Version | Status | +------------------+---------+--------+ | ensemble | 1 | READY | | postprocessing | 1 | READY | | preprocessing | 1 | READY | | tensorrt_llm | 1 | READY | | tensorrt_llm_bls | 1 | READY | +------------------+---------+--------+ I0510 07:55:29.140543 21661 metrics.cc:877] Collecting metrics for GPU 0: NVIDIA H100 PCIe I0510 07:55:29.140561 21661 metrics.cc:877] Collecting metrics for GPU 1: NVIDIA H100 PCIe I0510 07:55:29.149342 21661 metrics.cc:770] Collecting CPU metrics I0510 07:55:29.149624 21661 tritonserver.cc:2538] +----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Option | Value | +----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | server_id | triton | | server_version | 2.44.0 | | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistic | | | s trace logging | | model_repository_path[0] | mixtral_ifb/ | | model_control_mode | MODE_NONE | | strict_model_config | 1 | | rate_limit | OFF | | pinned_memory_pool_byte_size | 268435456 | | cuda_memory_pool_byte_size{0} | 67108864 | | cuda_memory_pool_byte_size{1} | 67108864 | | min_supported_compute_capability | 6.0 | | strict_readiness | 1 | | exit_timeout | 30 | | cache_enabled | 0 | +----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ I0510 07:55:29.150522 21661 grpc_server.cc:2373] +----------------------------------------------+---------+ | GRPC KeepAlive Option | Value | +----------------------------------------------+---------+ | keepalive_time_ms | 7200000 | | keepalive_timeout_ms | 20000 | | keepalive_permit_without_calls | 0 | | http2_max_pings_without_data | 2 | | http2_min_recv_ping_interval_without_data_ms | 300000 | | http2_max_ping_strikes | 2 | +----------------------------------------------+---------+ I0510 07:55:29.151276 21661 grpc_server.cc:102] Ready for RPC 'Check', 0 I0510 07:55:29.151323 21661 grpc_server.cc:102] Ready for RPC 'ServerLive', 0 I0510 07:55:29.151333 21661 grpc_server.cc:102] Ready for RPC 'ServerReady', 0 I0510 07:55:29.151341 21661 grpc_server.cc:102] Ready for RPC 'ModelReady', 0 I0510 07:55:29.151350 21661 grpc_server.cc:102] Ready for RPC 'ServerMetadata', 0 I0510 07:55:29.151360 21661 grpc_server.cc:102] Ready for RPC 'ModelMetadata', 0 I0510 07:55:29.151370 21661 grpc_server.cc:102] Ready for RPC 'ModelConfig', 0 I0510 07:55:29.151384 21661 grpc_server.cc:102] Ready for RPC 'SystemSharedMemoryStatus', 0 I0510 07:55:29.151393 21661 grpc_server.cc:102] Ready for RPC 'SystemSharedMemoryRegister', 0 I0510 07:55:29.151403 21661 grpc_server.cc:102] Ready for RPC 'SystemSharedMemoryUnregister', 0 I0510 07:55:29.151413 21661 grpc_server.cc:102] Ready for RPC 'CudaSharedMemoryStatus', 0 I0510 07:55:29.151422 21661 grpc_server.cc:102] Ready for RPC 'CudaSharedMemoryRegister', 0 I0510 07:55:29.151431 21661 grpc_server.cc:102] Ready for RPC 'CudaSharedMemoryUnregister', 0 I0510 07:55:29.151441 21661 grpc_server.cc:102] Ready for RPC 'RepositoryIndex', 0 I0510 07:55:29.151457 21661 grpc_server.cc:102] Ready for RPC 'RepositoryModelLoad', 0 I0510 07:55:29.151465 21661 grpc_server.cc:102] Ready for RPC 'RepositoryModelUnload', 0 I0510 07:55:29.151475 21661 grpc_server.cc:102] Ready for RPC 'ModelStatistics', 0 I0510 07:55:29.151486 21661 grpc_server.cc:102] Ready for RPC 'Trace', 0 I0510 07:55:29.151496 21661 grpc_server.cc:102] Ready for RPC 'Logging', 0 I0510 07:55:29.151515 21661 grpc_server.cc:366] Thread started for CommonHandler I0510 07:55:29.151729 21661 infer_handler.h:1190] StateNew, 0 Step START I0510 07:55:29.151772 21661 infer_handler.cc:680] New request handler for ModelInferHandler, 0 I0510 07:55:29.151791 21661 infer_handler.h:1314] Thread started for ModelInferHandler I0510 07:55:29.151984 21661 infer_handler.h:1190] StateNew, 0 Step START I0510 07:55:29.152008 21661 infer_handler.cc:680] New request handler for ModelInferHandler, 0 I0510 07:55:29.152025 21661 infer_handler.h:1314] Thread started for ModelInferHandler I0510 07:55:29.152227 21661 infer_handler.h:1190] StateNew, 0 Step START I0510 07:55:29.152255 21661 stream_infer_handler.cc:128] New request handler for ModelStreamInferHandler, 0 I0510 07:55:29.152274 21661 infer_handler.h:1314] Thread started for ModelStreamInferHandler I0510 07:55:29.152283 21661 grpc_server.cc:2466] Started GRPCInferenceService at 0.0.0.0:8001 I0510 07:55:29.152562 21661 http_server.cc:4636] Started HTTPService at 0.0.0.0:8000 I0510 07:55:29.249366 21661 http_server.cc:320] Started Metrics Service at 0.0.0.0:8002 I0510 07:57:29.568017 21661 http_server.cc:4522] HTTP request: 2 /v2/models/ensemble/generate I0510 07:57:29.569578 21661 model_lifecycle.cc:336] GetModel() 'ensemble' version -1 I0510 07:57:29.569617 21661 model_lifecycle.cc:294] VersionStates() 'ensemble' I0510 07:57:29.569787 21661 model_lifecycle.cc:336] GetModel() 'ensemble' version -1 I0510 07:57:29.569870 21661 infer_request.cc:131] [request id: ] Setting state from INITIALIZED to INITIALIZED I0510 07:57:29.569878 21661 infer_request.cc:900] [request id: ] prepared: [0x0x7f61ac004670] request id: , model: ensemble, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0 original inputs: [0x0x7f61ac005978] input: end_id, type: INT32, original shape: [1,1], batch + shape: [1,1], shape: [1] [0x0x7f61ac005718] input: pad_id, type: INT32, original shape: [1,1], batch + shape: [1,1], shape: [1] [0x0x7f61ac005258] input: bad_words, type: BYTES, original shape: [1,1], batch + shape: [1,1], shape: [1] [0x0x7f61ac004ff8] input: max_tokens, type: INT32, original shape: [1,1], batch + shape: [1,1], shape: [1] [0x0x7f61ac0054b8] input: stop_words, type: BYTES, original shape: [1,1], batch + shape: [1,1], shape: [1] [0x0x7f61ac0136f8] input: text_input, type: BYTES, original shape: [1,1], batch + shape: [1,1], shape: [1] override inputs: inputs: [0x0x7f61ac0136f8] input: text_input, type: BYTES, original shape: [1,1], batch + shape: [1,1], shape: [1] [0x0x7f61ac0054b8] input: stop_words, type: BYTES, original shape: [1,1], batch + shape: [1,1], shape: [1] [0x0x7f61ac004ff8] input: max_tokens, type: INT32, original shape: [1,1], batch + shape: [1,1], shape: [1] [0x0x7f61ac005258] input: bad_words, type: BYTES, original shape: [1,1], batch + shape: [1,1], shape: [1] [0x0x7f61ac005718] input: pad_id, type: INT32, original shape: [1,1], batch + shape: [1,1], shape: [1] [0x0x7f61ac005978] input: end_id, type: INT32, original shape: [1,1], batch + shape: [1,1], shape: [1] original requested outputs: requested outputs: context_logits cum_log_probs generation_logits output_log_probs text_output I0510 07:57:29.569919 21661 infer_request.cc:131] [request id: ] Setting state from INITIALIZED to PENDING I0510 07:57:29.569932 21661 infer_request.cc:131] [request id: ] Setting state from PENDING to EXECUTING I0510 07:57:29.569946 21661 model_lifecycle.cc:336] GetModel() 'preprocessing' version -1 I0510 07:57:29.569960 21661 model_lifecycle.cc:336] GetModel() 'tensorrt_llm' version -1 I0510 07:57:29.569975 21661 model_lifecycle.cc:336] GetModel() 'postprocessing' version -1 I0510 07:57:29.570146 21661 infer_request.cc:131] [request id: ] Setting state from INITIALIZED to INITIALIZED I0510 07:57:29.570153 21661 infer_request.cc:900] [request id: ] prepared: [0x0x7f61ac00ae90] request id: , model: preprocessing, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0 original inputs: [0x0x7f61ac00bc58] input: BAD_WORDS_DICT, type: BYTES, original shape: [1,1], batch + shape: [1,1], shape: [1] [0x0x7f61ac00bab8] input: PAD_ID, type: INT32, original shape: [1,1], batch + shape: [1,1], shape: [1] [0x0x7f61ac00b958] input: QUERY, type: BYTES, original shape: [1,1], batch + shape: [1,1], shape: [1] [0x0x7f61ac00b778] input: REQUEST_OUTPUT_LEN, type: INT32, original shape: [1,1], batch + shape: [1,1], shape: [1] [0x0x7f61ac00b618] input: STOP_WORDS_DICT, type: BYTES, original shape: [1,1], batch + shape: [1,1], shape: [1] [0x0x7f61ac00b468] input: END_ID, type: INT32, original shape: [1,1], batch + shape: [1,1], shape: [1] override inputs: inputs: [0x0x7f61ac00b468] input: END_ID, type: INT32, original shape: [1,1], batch + shape: [1,1], shape: [1] [0x0x7f61ac00b618] input: STOP_WORDS_DICT, type: BYTES, original shape: [1,1], batch + shape: [1,1], shape: [1] [0x0x7f61ac00b778] input: REQUEST_OUTPUT_LEN, type: INT32, original shape: [1,1], batch + shape: [1,1], shape: [1] [0x0x7f61ac00b958] input: QUERY, type: BYTES, original shape: [1,1], batch + shape: [1,1], shape: [1] [0x0x7f61ac00bab8] input: PAD_ID, type: INT32, original shape: [1,1], batch + shape: [1,1], shape: [1] [0x0x7f61ac00bc58] input: BAD_WORDS_DICT, type: BYTES, original shape: [1,1], batch + shape: [1,1], shape: [1] original requested outputs: BAD_WORDS_IDS EMBEDDING_BIAS INPUT_ID OUT_END_ID OUT_PAD_ID REQUEST_INPUT_LEN REQUEST_OUTPUT_LEN STOP_WORDS_IDS requested outputs: BAD_WORDS_IDS EMBEDDING_BIAS INPUT_ID OUT_END_ID OUT_PAD_ID REQUEST_INPUT_LEN REQUEST_OUTPUT_LEN STOP_WORDS_IDS I0510 07:57:29.570190 21661 infer_request.cc:131] [request id: ] Setting state from INITIALIZED to PENDING I0510 07:57:29.570321 21661 infer_request.cc:131] [request id: ] Setting state from PENDING to EXECUTING I0510 07:57:29.570431 21661 python_be.cc:1381] model preprocessing, instance preprocessing_0_0, executing 1 requests I0510 07:57:29.572791 21661 infer_response.cc:174] add response output: output: INPUT_ID, type: INT32, shape: [1,6] I0510 07:57:29.572827 21661 pinned_memory_manager.cc:198] pinned memory allocation: size 24, addr 0x7f637c000090 I0510 07:57:29.572837 21661 ensemble_scheduler.cc:559] Internal response allocation: INPUT_ID, size 24, addr 0x7f637c000090, memory type 1, type id 0 I0510 07:57:29.572846 21661 infer_response.cc:174] add response output: output: BAD_WORDS_IDS, type: INT32, shape: [1,2,1] I0510 07:57:29.572855 21661 pinned_memory_manager.cc:198] pinned memory allocation: size 8, addr 0x7f637c0000c0 I0510 07:57:29.572862 21661 ensemble_scheduler.cc:559] Internal response allocation: BAD_WORDS_IDS, size 8, addr 0x7f637c0000c0, memory type 1, type id 0 I0510 07:57:29.572874 21661 infer_response.cc:174] add response output: output: STOP_WORDS_IDS, type: INT32, shape: [1,2,1] I0510 07:57:29.572882 21661 pinned_memory_manager.cc:198] pinned memory allocation: size 8, addr 0x7f637c0000f0 I0510 07:57:29.572888 21661 ensemble_scheduler.cc:559] Internal response allocation: STOP_WORDS_IDS, size 8, addr 0x7f637c0000f0, memory type 1, type id 0 I0510 07:57:29.572896 21661 infer_response.cc:174] add response output: output: REQUEST_INPUT_LEN, type: INT32, shape: [1,1] I0510 07:57:29.572904 21661 pinned_memory_manager.cc:198] pinned memory allocation: size 4, addr 0x7f637c000120 I0510 07:57:29.572910 21661 ensemble_scheduler.cc:559] Internal response allocation: REQUEST_INPUT_LEN, size 4, addr 0x7f637c000120, memory type 1, type id 0 I0510 07:57:29.572916 21661 infer_response.cc:174] add response output: output: REQUEST_OUTPUT_LEN, type: INT32, shape: [1,1] I0510 07:57:29.572923 21661 pinned_memory_manager.cc:198] pinned memory allocation: size 4, addr 0x7f637c000150 I0510 07:57:29.572929 21661 ensemble_scheduler.cc:559] Internal response allocation: REQUEST_OUTPUT_LEN, size 4, addr 0x7f637c000150, memory type 1, type id 0 I0510 07:57:29.572935 21661 infer_response.cc:174] add response output: output: EMBEDDING_BIAS, type: FP32, shape: [1,0] I0510 07:57:29.572941 21661 ensemble_scheduler.cc:559] Internal response allocation: EMBEDDING_BIAS, size 0, addr 0, memory type 0, type id 0 I0510 07:57:29.572949 21661 infer_response.cc:174] add response output: output: OUT_END_ID, type: INT32, shape: [1,1] I0510 07:57:29.572956 21661 pinned_memory_manager.cc:198] pinned memory allocation: size 4, addr 0x7f637c000180 I0510 07:57:29.572961 21661 ensemble_scheduler.cc:559] Internal response allocation: OUT_END_ID, size 4, addr 0x7f637c000180, memory type 1, type id 0 I0510 07:57:29.572968 21661 infer_response.cc:174] add response output: output: OUT_PAD_ID, type: INT32, shape: [1,1] I0510 07:57:29.572974 21661 pinned_memory_manager.cc:198] pinned memory allocation: size 4, addr 0x7f637c0001b0 I0510 07:57:29.572980 21661 ensemble_scheduler.cc:559] Internal response allocation: OUT_PAD_ID, size 4, addr 0x7f637c0001b0, memory type 1, type id 0 I0510 07:57:29.573026 21661 ensemble_scheduler.cc:574] Internal response release: size 4, addr 0x7f637c000120 I0510 07:57:29.573032 21661 ensemble_scheduler.cc:574] Internal response release: size 4, addr 0x7f637c000150 I0510 07:57:29.573037 21661 ensemble_scheduler.cc:574] Internal response release: size 24, addr 0x7f637c000090 I0510 07:57:29.573043 21661 ensemble_scheduler.cc:574] Internal response release: size 8, addr 0x7f637c0000c0 I0510 07:57:29.573048 21661 ensemble_scheduler.cc:574] Internal response release: size 8, addr 0x7f637c0000f0 I0510 07:57:29.573054 21661 ensemble_scheduler.cc:574] Internal response release: size 4, addr 0x7f637c000180 I0510 07:57:29.573059 21661 ensemble_scheduler.cc:574] Internal response release: size 4, addr 0x7f637c0001b0 I0510 07:57:29.573169 21661 infer_request.cc:131] [request id: ] Setting state from INITIALIZED to INITIALIZED I0510 07:57:29.573176 21661 infer_request.cc:900] [request id: ] prepared: [0x0x7f5678009210] request id: , model: tensorrt_llm, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0 original inputs: [0x0x7f567800a038] input: bad_words_list, type: INT32, original shape: [1,2,1], batch + shape: [1,2,1], shape: [2,1] [0x0x7f5678009cf8] input: stop_words_list, type: INT32, original shape: [1,2,1], batch + shape: [1,2,1], shape: [2,1] [0x0x7f5678009bb8] input: input_ids, type: INT32, original shape: [1,6], batch + shape: [1,6], shape: [6] [0x0x7f567800a1f8] input: request_output_len, type: INT32, original shape: [1,1], batch + shape: [1,1], shape: [1] [0x0x7f56780098b8] input: pad_id, type: INT32, original shape: [1,1], batch + shape: [1], shape: [] [0x0x7f5678009778] input: input_lengths, type: INT32, original shape: [1,1], batch + shape: [1], shape: [] [0x0x7f5678009ef8] input: embedding_bias, type: FP32, original shape: [1,0], batch + shape: [1,0], shape: [0] [0x0x7f56780095b8] input: end_id, type: INT32, original shape: [1,1], batch + shape: [1], shape: [] override inputs: inputs: [0x0x7f56780095b8] input: end_id, type: INT32, original shape: [1,1], batch + shape: [1], shape: [] [0x0x7f5678009ef8] input: embedding_bias, type: FP32, original shape: [1,0], batch + shape: [1,0], shape: [0] [0x0x7f5678009778] input: input_lengths, type: INT32, original shape: [1,1], batch + shape: [1], shape: [] [0x0x7f56780098b8] input: pad_id, type: INT32, original shape: [1,1], batch + shape: [1], shape: [] [0x0x7f567800a1f8] input: request_output_len, type: INT32, original shape: [1,1], batch + shape: [1,1], shape: [1] [0x0x7f5678009bb8] input: input_ids, type: INT32, original shape: [1,6], batch + shape: [1,6], shape: [6] [0x0x7f5678009cf8] input: stop_words_list, type: INT32, original shape: [1,2,1], batch + shape: [1,2,1], shape: [2,1] [0x0x7f567800a038] input: bad_words_list, type: INT32, original shape: [1,2,1], batch + shape: [1,2,1], shape: [2,1] original requested outputs: context_logits cum_log_probs generation_logits output_ids output_log_probs sequence_length requested outputs: context_logits cum_log_probs generation_logits output_ids output_log_probs sequence_length I0510 07:57:29.573232 21661 infer_request.cc:131] [request id: ] Setting state from INITIALIZED to PENDING I0510 07:57:29.573351 21661 infer_request.cc:131] [request id: ] Setting state from EXECUTING to RELEASED I0510 07:57:29.573367 21661 python_be.cc:2537] TRITONBACKEND_ModelInstanceExecute: model instance name preprocessing_0_0 released 1 requests I0510 07:57:29.573538 21661 infer_request.cc:131] [request id: ] Setting state from PENDING to EXECUTING I0510 07:57:29.573673 21661 utils.cc:219] ModelInstanceState::getRequestBooleanInputTensor: user did not not provide stop input for the request I0510 07:57:30.004688 21661 utils.cc:219] ModelInstanceState::getRequestBooleanInputTensor: user did not not provide streaming input for the request I0510 07:57:30.330971 21661 model_instance_state.cc:840] {"Active Request Count":1,"Iteration Counter":0,"Max Request Count":1,"Runtime CPU Memory Usage":500,"Runtime GPU Memory Usage":396380913,"Runtime Pinned Memory Usage":1073766777,"Timestamp":"05-10-2024 07:57:30","Context Requests":1,"Generation Requests":0,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":6,"Free KV cache blocks":39,"Max KV cache blocks":40,"Tokens per KV cache block":64,"Used KV cache blocks":1} I0510 07:57:30.755447 21661 model_instance_state.cc:840] {"Active Request Count":1,"Iteration Counter":1,"Max Request Count":1,"Runtime CPU Memory Usage":500,"Runtime GPU Memory Usage":396380909,"Runtime Pinned Memory Usage":1073766777,"Timestamp":"05-10-2024 07:57:30","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":39,"Max KV cache blocks":40,"Tokens per KV cache block":64,"Used KV cache blocks":1} I0510 07:57:30.770724 21661 model_instance_state.cc:840] {"Active Request Count":1,"Iteration Counter":2,"Max Request Count":1,"Runtime CPU Memory Usage":500,"Runtime GPU Memory Usage":396380909,"Runtime Pinned Memory Usage":1073766777,"Timestamp":"05-10-2024 07:57:30","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":39,"Max KV cache blocks":40,"Tokens per KV cache block":64,"Used KV cache blocks":1} I0510 07:57:30.781111 21661 model_instance_state.cc:840] {"Active Request Count":1,"Iteration Counter":3,"Max Request Count":1,"Runtime CPU Memory Usage":500,"Runtime GPU Memory Usage":396380909,"Runtime Pinned Memory Usage":1073766777,"Timestamp":"05-10-2024 07:57:30","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":39,"Max KV cache blocks":40,"Tokens per KV cache block":64,"Used KV cache blocks":1} I0510 07:57:30.791509 21661 model_instance_state.cc:840] {"Active Request Count":1,"Iteration Counter":4,"Max Request Count":1,"Runtime CPU Memory Usage":500,"Runtime GPU Memory Usage":396380909,"Runtime Pinned Memory Usage":1073766777,"Timestamp":"05-10-2024 07:57:30","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":39,"Max KV cache blocks":40,"Tokens per KV cache block":64,"Used KV cache blocks":1} I0510 07:57:30.801947 21661 model_instance_state.cc:840] {"Active Request Count":1,"Iteration Counter":5,"Max Request Count":1,"Runtime CPU Memory Usage":500,"Runtime GPU Memory Usage":396380909,"Runtime Pinned Memory Usage":1073766777,"Timestamp":"05-10-2024 07:57:30","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":39,"Max KV cache blocks":40,"Tokens per KV cache block":64,"Used KV cache blocks":1} I0510 07:57:30.812324 21661 model_instance_state.cc:840] {"Active Request Count":1,"Iteration Counter":6,"Max Request Count":1,"Runtime CPU Memory Usage":500,"Runtime GPU Memory Usage":396380909,"Runtime Pinned Memory Usage":1073766777,"Timestamp":"05-10-2024 07:57:30","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":39,"Max KV cache blocks":40,"Tokens per KV cache block":64,"Used KV cache blocks":1} I0510 07:57:30.822882 21661 model_instance_state.cc:840] {"Active Request Count":1,"Iteration Counter":7,"Max Request Count":1,"Runtime CPU Memory Usage":500,"Runtime GPU Memory Usage":396380909,"Runtime Pinned Memory Usage":1073766777,"Timestamp":"05-10-2024 07:57:30","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":39,"Max KV cache blocks":40,"Tokens per KV cache block":64,"Used KV cache blocks":1} I0510 07:57:30.833378 21661 model_instance_state.cc:840] {"Active Request Count":1,"Iteration Counter":8,"Max Request Count":1,"Runtime CPU Memory Usage":500,"Runtime GPU Memory Usage":396380909,"Runtime Pinned Memory Usage":1073766777,"Timestamp":"05-10-2024 07:57:30","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":39,"Max KV cache blocks":40,"Tokens per KV cache block":64,"Used KV cache blocks":1} I0510 07:57:30.844105 21661 model_instance_state.cc:840] {"Active Request Count":1,"Iteration Counter":9,"Max Request Count":1,"Runtime CPU Memory Usage":500,"Runtime GPU Memory Usage":396380909,"Runtime Pinned Memory Usage":1073766777,"Timestamp":"05-10-2024 07:57:30","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":39,"Max KV cache blocks":40,"Tokens per KV cache block":64,"Used KV cache blocks":1} I0510 07:57:30.854783 21661 model_instance_state.cc:840] {"Active Request Count":1,"Iteration Counter":10,"Max Request Count":1,"Runtime CPU Memory Usage":500,"Runtime GPU Memory Usage":396380909,"Runtime Pinned Memory Usage":1073766777,"Timestamp":"05-10-2024 07:57:30","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":39,"Max KV cache blocks":40,"Tokens per KV cache block":64,"Used KV cache blocks":1} I0510 07:57:30.865491 21661 model_instance_state.cc:840] {"Active Request Count":1,"Iteration Counter":11,"Max Request Count":1,"Runtime CPU Memory Usage":500,"Runtime GPU Memory Usage":396380909,"Runtime Pinned Memory Usage":1073766777,"Timestamp":"05-10-2024 07:57:30","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":39,"Max KV cache blocks":40,"Tokens per KV cache block":64,"Used KV cache blocks":1} I0510 07:57:30.875953 21661 model_instance_state.cc:840] {"Active Request Count":1,"Iteration Counter":12,"Max Request Count":1,"Runtime CPU Memory Usage":500,"Runtime GPU Memory Usage":396380909,"Runtime Pinned Memory Usage":1073766777,"Timestamp":"05-10-2024 07:57:30","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":39,"Max KV cache blocks":40,"Tokens per KV cache block":64,"Used KV cache blocks":1} I0510 07:57:30.886443 21661 model_instance_state.cc:840] {"Active Request Count":1,"Iteration Counter":13,"Max Request Count":1,"Runtime CPU Memory Usage":500,"Runtime GPU Memory Usage":396380909,"Runtime Pinned Memory Usage":1073766777,"Timestamp":"05-10-2024 07:57:30","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":39,"Max KV cache blocks":40,"Tokens per KV cache block":64,"Used KV cache blocks":1} I0510 07:57:30.896922 21661 model_instance_state.cc:840] {"Active Request Count":1,"Iteration Counter":14,"Max Request Count":1,"Runtime CPU Memory Usage":500,"Runtime GPU Memory Usage":396380909,"Runtime Pinned Memory Usage":1073766777,"Timestamp":"05-10-2024 07:57:30","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":39,"Max KV cache blocks":40,"Tokens per KV cache block":64,"Used KV cache blocks":1} I0510 07:57:30.907410 21661 model_instance_state.cc:840] {"Active Request Count":1,"Iteration Counter":15,"Max Request Count":1,"Runtime CPU Memory Usage":500,"Runtime GPU Memory Usage":396380909,"Runtime Pinned Memory Usage":1073766777,"Timestamp":"05-10-2024 07:57:30","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":39,"Max KV cache blocks":40,"Tokens per KV cache block":64,"Used KV cache blocks":1} I0510 07:57:30.918071 21661 model_instance_state.cc:840] {"Active Request Count":1,"Iteration Counter":16,"Max Request Count":1,"Runtime CPU Memory Usage":500,"Runtime GPU Memory Usage":396380909,"Runtime Pinned Memory Usage":1073766777,"Timestamp":"05-10-2024 07:57:30","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":39,"Max KV cache blocks":40,"Tokens per KV cache block":64,"Used KV cache blocks":1} I0510 07:57:30.928554 21661 model_instance_state.cc:840] {"Active Request Count":1,"Iteration Counter":17,"Max Request Count":1,"Runtime CPU Memory Usage":500,"Runtime GPU Memory Usage":396380909,"Runtime Pinned Memory Usage":1073766777,"Timestamp":"05-10-2024 07:57:30","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":39,"Max KV cache blocks":40,"Tokens per KV cache block":64,"Used KV cache blocks":1} I0510 07:57:30.939107 21661 model_instance_state.cc:840] {"Active Request Count":1,"Iteration Counter":18,"Max Request Count":1,"Runtime CPU Memory Usage":500,"Runtime GPU Memory Usage":396380909,"Runtime Pinned Memory Usage":1073766777,"Timestamp":"05-10-2024 07:57:30","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":39,"Max KV cache blocks":40,"Tokens per KV cache block":64,"Used KV cache blocks":1} I0510 07:57:30.949743 21661 model_instance_state.cc:840] {"Active Request Count":1,"Iteration Counter":19,"Max Request Count":1,"Runtime CPU Memory Usage":500,"Runtime GPU Memory Usage":396380909,"Runtime Pinned Memory Usage":1073766777,"Timestamp":"05-10-2024 07:57:30","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":39,"Max KV cache blocks":40,"Tokens per KV cache block":64,"Used KV cache blocks":1} I0510 07:57:30.956348 21661 infer_response.cc:174] add response output: output: output_ids, type: INT32, shape: [1,1,20] I0510 07:57:30.956412 21661 pinned_memory_manager.cc:198] pinned memory allocation: size 80, addr 0x7f637c0001e0 I0510 07:57:30.956422 21661 ensemble_scheduler.cc:559] Internal response allocation: output_ids, size 80, addr 0x7f637c0001e0, memory type 1, type id 0 I0510 07:57:30.956431 21661 infer_response.cc:174] add response output: output: sequence_length, type: INT32, shape: [1,1] I0510 07:57:30.956439 21661 pinned_memory_manager.cc:198] pinned memory allocation: size 4, addr 0x7f637c000240 I0510 07:57:30.956445 21661 ensemble_scheduler.cc:559] Internal response allocation: sequence_length, size 4, addr 0x7f637c000240, memory type 1, type id 0 I0510 07:57:30.956452 21661 infer_response.cc:174] add response output: output: context_logits, type: FP32, shape: [1,1,1] I0510 07:57:30.956460 21661 pinned_memory_manager.cc:198] pinned memory allocation: size 4, addr 0x7f637c000270 I0510 07:57:30.956466 21661 ensemble_scheduler.cc:559] Internal response allocation: context_logits, size 4, addr 0x7f637c000270, memory type 1, type id 0 I0510 07:57:30.956476 21661 infer_response.cc:174] add response output: output: generation_logits, type: FP32, shape: [1,1,1,1] I0510 07:57:30.956483 21661 pinned_memory_manager.cc:198] pinned memory allocation: size 4, addr 0x7f637c0002a0 I0510 07:57:30.956489 21661 ensemble_scheduler.cc:559] Internal response allocation: generation_logits, size 4, addr 0x7f637c0002a0, memory type 1, type id 0 I0510 07:57:30.956496 21661 infer_response.cc:174] add response output: output: output_log_probs, type: FP32, shape: [1,1,6] I0510 07:57:30.956503 21661 pinned_memory_manager.cc:198] pinned memory allocation: size 24, addr 0x7f637c0002d0 I0510 07:57:30.956510 21661 ensemble_scheduler.cc:559] Internal response allocation: output_log_probs, size 24, addr 0x7f637c0002d0, memory type 1, type id 0 I0510 07:57:30.956517 21661 infer_response.cc:174] add response output: output: cum_log_probs, type: FP32, shape: [1,1] I0510 07:57:30.956524 21661 pinned_memory_manager.cc:198] pinned memory allocation: size 4, addr 0x7f637c000300 I0510 07:57:30.956530 21661 ensemble_scheduler.cc:559] Internal response allocation: cum_log_probs, size 4, addr 0x7f637c000300, memory type 1, type id 0 I0510 07:57:30.956571 21661 ensemble_scheduler.cc:574] Internal response release: size 4, addr 0x7f637c0002a0 I0510 07:57:30.956577 21661 ensemble_scheduler.cc:574] Internal response release: size 24, addr 0x7f637c0002d0 I0510 07:57:30.956583 21661 ensemble_scheduler.cc:574] Internal response release: size 4, addr 0x7f637c000300 I0510 07:57:30.956589 21661 ensemble_scheduler.cc:574] Internal response release: size 80, addr 0x7f637c0001e0 I0510 07:57:30.956594 21661 ensemble_scheduler.cc:574] Internal response release: size 4, addr 0x7f637c000240 I0510 07:57:30.956600 21661 ensemble_scheduler.cc:574] Internal response release: size 4, addr 0x7f637c000270 I0510 07:57:30.956744 21661 infer_request.cc:131] [request id: ] Setting state from INITIALIZED to INITIALIZED I0510 07:57:30.956753 21661 infer_request.cc:900] [request id: ] prepared: [0x0x7f60c80035c0] request id: , model: postprocessing, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0 original inputs: [0x0x7f60c8003f88] input: CONTEXT_LOGITS, type: FP32, original shape: [1,1,1], batch + shape: [1,1,1], shape: [1,1] [0x0x7f60c8003e08] input: CUM_LOG_PROBS, type: FP32, original shape: [1,1], batch + shape: [1,1], shape: [1] [0x0x7f60c8003ca8] input: GENERATION_LOGITS, type: FP32, original shape: [1,1,1,1], batch + shape: [1,1,1,1], shape: [1,1,1] [0x0x7f60c8003b68] input: TOKENS_BATCH, type: INT32, original shape: [1,1,20], batch + shape: [1,1,20], shape: [1,20] [0x0x7f60c8003a28] input: SEQUENCE_LENGTH, type: INT32, original shape: [1,1], batch + shape: [1,1], shape: [1] [0x0x7f60c8003928] input: OUTPUT_LOG_PROBS, type: FP32, original shape: [1,1,6], batch + shape: [1,1,6], shape: [1,6] override inputs: inputs: [0x0x7f60c8003928] input: OUTPUT_LOG_PROBS, type: FP32, original shape: [1,1,6], batch + shape: [1,1,6], shape: [1,6] [0x0x7f60c8003a28] input: SEQUENCE_LENGTH, type: INT32, original shape: [1,1], batch + shape: [1,1], shape: [1] [0x0x7f60c8003b68] input: TOKENS_BATCH, type: INT32, original shape: [1,1,20], batch + shape: [1,1,20], shape: [1,20] [0x0x7f60c8003ca8] input: GENERATION_LOGITS, type: FP32, original shape: [1,1,1,1], batch + shape: [1,1,1,1], shape: [1,1,1] [0x0x7f60c8003e08] input: CUM_LOG_PROBS, type: FP32, original shape: [1,1], batch + shape: [1,1], shape: [1] [0x0x7f60c8003f88] input: CONTEXT_LOGITS, type: FP32, original shape: [1,1,1], batch + shape: [1,1,1], shape: [1,1] original requested outputs: OUTPUT OUT_CONTEXT_LOGITS OUT_CUM_LOG_PROBS OUT_GENERATION_LOGITS OUT_OUTPUT_LOG_PROBS requested outputs: OUTPUT OUT_CONTEXT_LOGITS OUT_CUM_LOG_PROBS OUT_GENERATION_LOGITS OUT_OUTPUT_LOG_PROBS I0510 07:57:30.956803 21661 infer_request.cc:131] [request id: ] Setting state from INITIALIZED to PENDING I0510 07:57:30.956957 21661 infer_request.cc:131] [request id: ] Setting state from PENDING to EXECUTING I0510 07:57:30.957015 21661 infer_request.cc:131] [request id: ] Setting state from EXECUTING to RELEASED I0510 07:57:30.957030 21661 pinned_memory_manager.cc:226] pinned memory deallocation: addr 0x7f637c0000c0 I0510 07:57:30.957039 21661 pinned_memory_manager.cc:226] pinned memory deallocation: addr 0x7f637c0000f0 I0510 07:57:30.957046 21661 pinned_memory_manager.cc:226] pinned memory deallocation: addr 0x7f637c000090 I0510 07:57:30.957049 21661 python_be.cc:1381] model postprocessing, instance postprocessing_0_0, executing 1 requests I0510 07:57:30.957054 21661 pinned_memory_manager.cc:226] pinned memory deallocation: addr 0x7f637c000150 I0510 07:57:30.957066 21661 pinned_memory_manager.cc:226] pinned memory deallocation: addr 0x7f637c0001b0 I0510 07:57:30.957073 21661 pinned_memory_manager.cc:226] pinned memory deallocation: addr 0x7f637c000120 I0510 07:57:30.957083 21661 pinned_memory_manager.cc:226] pinned memory deallocation: addr 0x7f637c000180 I0510 07:57:30.958232 21661 infer_response.cc:174] add response output: output: OUTPUT, type: BYTES, shape: [1] I0510 07:57:30.958261 21661 pinned_memory_manager.cc:198] pinned memory allocation: size 123, addr 0x7f637c000090 I0510 07:57:30.958272 21661 ensemble_scheduler.cc:559] Internal response allocation: OUTPUT, size 123, addr 0x7f637c000090, memory type 1, type id 0 I0510 07:57:30.958281 21661 infer_response.cc:174] add response output: output: OUT_CUM_LOG_PROBS, type: FP32, shape: [1,1] I0510 07:57:30.958291 21661 pinned_memory_manager.cc:198] pinned memory allocation: size 4, addr 0x7f637c000120 I0510 07:57:30.958298 21661 ensemble_scheduler.cc:559] Internal response allocation: OUT_CUM_LOG_PROBS, size 4, addr 0x7f637c000120, memory type 1, type id 0 I0510 07:57:30.958305 21661 infer_response.cc:174] add response output: output: OUT_OUTPUT_LOG_PROBS, type: FP32, shape: [1,1,6] I0510 07:57:30.958314 21661 pinned_memory_manager.cc:198] pinned memory allocation: size 24, addr 0x7f637c000150 I0510 07:57:30.958320 21661 ensemble_scheduler.cc:559] Internal response allocation: OUT_OUTPUT_LOG_PROBS, size 24, addr 0x7f637c000150, memory type 1, type id 0 I0510 07:57:30.958341 21661 infer_response.cc:174] add response output: output: OUT_CONTEXT_LOGITS, type: FP32, shape: [1,1,1] I0510 07:57:30.958350 21661 pinned_memory_manager.cc:198] pinned memory allocation: size 4, addr 0x7f637c000180 I0510 07:57:30.958357 21661 ensemble_scheduler.cc:559] Internal response allocation: OUT_CONTEXT_LOGITS, size 4, addr 0x7f637c000180, memory type 1, type id 0 I0510 07:57:30.958364 21661 infer_response.cc:174] add response output: output: OUT_GENERATION_LOGITS, type: FP32, shape: [1,1,1,1] I0510 07:57:30.958372 21661 pinned_memory_manager.cc:198] pinned memory allocation: size 4, addr 0x7f637c0001b0 I0510 07:57:30.958378 21661 ensemble_scheduler.cc:559] Internal response allocation: OUT_GENERATION_LOGITS, size 4, addr 0x7f637c0001b0, memory type 1, type id 0 I0510 07:57:30.958414 21661 ensemble_scheduler.cc:574] Internal response release: size 123, addr 0x7f637c000090 I0510 07:57:30.958420 21661 ensemble_scheduler.cc:574] Internal response release: size 4, addr 0x7f637c000120 I0510 07:57:30.958427 21661 ensemble_scheduler.cc:574] Internal response release: size 24, addr 0x7f637c000150 I0510 07:57:30.958433 21661 ensemble_scheduler.cc:574] Internal response release: size 4, addr 0x7f637c000180 I0510 07:57:30.958440 21661 ensemble_scheduler.cc:574] Internal response release: size 4, addr 0x7f637c0001b0 I0510 07:57:30.958463 21661 infer_response.cc:148] add response output: output: generation_logits, type: FP32, shape: [1,1,1,1] I0510 07:57:30.958476 21661 http_server.cc:1217] HTTP: unable to provide 'generation_logits' in CPU_PINNED, will use CPU I0510 07:57:30.958493 21661 http_server.cc:1237] HTTP using buffer for: 'generation_logits', size: 4, addr: 0x7f55e8007b30 I0510 07:57:30.958504 21661 infer_response.cc:148] add response output: output: output_log_probs, type: FP32, shape: [1,1,6] I0510 07:57:30.958514 21661 http_server.cc:1217] HTTP: unable to provide 'output_log_probs' in CPU_PINNED, will use CPU I0510 07:57:30.958524 21661 http_server.cc:1237] HTTP using buffer for: 'output_log_probs', size: 24, addr: 0x7f55e8007fd0 I0510 07:57:30.958533 21661 infer_response.cc:148] add response output: output: cum_log_probs, type: FP32, shape: [1,1] I0510 07:57:30.958540 21661 http_server.cc:1217] HTTP: unable to provide 'cum_log_probs' in CPU_PINNED, will use CPU I0510 07:57:30.958546 21661 http_server.cc:1237] HTTP using buffer for: 'cum_log_probs', size: 4, addr: 0x7f55e8008690 I0510 07:57:30.958555 21661 infer_response.cc:148] add response output: output: context_logits, type: FP32, shape: [1,1,1] I0510 07:57:30.958561 21661 http_server.cc:1217] HTTP: unable to provide 'context_logits' in CPU_PINNED, will use CPU I0510 07:57:30.958571 21661 http_server.cc:1237] HTTP using buffer for: 'context_logits', size: 4, addr: 0x7f55e8008de0 I0510 07:57:30.958579 21661 infer_response.cc:148] add response output: output: text_output, type: BYTES, shape: [1] I0510 07:57:30.958586 21661 http_server.cc:1217] HTTP: unable to provide 'text_output' in CPU_PINNED, will use CPU I0510 07:57:30.958592 21661 http_server.cc:1237] HTTP using buffer for: 'text_output', size: 123, addr: 0x7f55e8009530 I0510 07:57:30.958601 21661 pinned_memory_manager.cc:226] pinned memory deallocation: addr 0x7f637c000120 I0510 07:57:30.958609 21661 pinned_memory_manager.cc:226] pinned memory deallocation: addr 0x7f637c000150 I0510 07:57:30.958617 21661 pinned_memory_manager.cc:226] pinned memory deallocation: addr 0x7f637c000090 I0510 07:57:30.958624 21661 pinned_memory_manager.cc:226] pinned memory deallocation: addr 0x7f637c0001b0 I0510 07:57:30.958632 21661 pinned_memory_manager.cc:226] pinned memory deallocation: addr 0x7f637c000180 I0510 07:57:30.958718 21661 http_server.cc:1311] HTTP release: size 4, addr 0x7f55e8007b30 I0510 07:57:30.958727 21661 http_server.cc:1311] HTTP release: size 24, addr 0x7f55e8007fd0 I0510 07:57:30.958734 21661 http_server.cc:1311] HTTP release: size 4, addr 0x7f55e8008690 I0510 07:57:30.958741 21661 http_server.cc:1311] HTTP release: size 4, addr 0x7f55e8008de0 I0510 07:57:30.958753 21661 http_server.cc:1311] HTTP release: size 123, addr 0x7f55e8009530 I0510 07:57:30.958818 21661 infer_request.cc:131] [request id: ] Setting state from EXECUTING to RELEASED I0510 07:57:30.958833 21661 pinned_memory_manager.cc:226] pinned memory deallocation: addr 0x7f637c000270 I0510 07:57:30.958842 21661 pinned_memory_manager.cc:226] pinned memory deallocation: addr 0x7f637c000300 I0510 07:57:30.958851 21661 pinned_memory_manager.cc:226] pinned memory deallocation: addr 0x7f637c0002a0 I0510 07:57:30.958859 21661 pinned_memory_manager.cc:226] pinned memory deallocation: addr 0x7f637c0001e0 I0510 07:57:30.958867 21661 pinned_memory_manager.cc:226] pinned memory deallocation: addr 0x7f637c000240 I0510 07:57:30.958875 21661 pinned_memory_manager.cc:226] pinned memory deallocation: addr 0x7f637c0002d0 I0510 07:57:30.958890 21661 infer_request.cc:131] [request id: ] Setting state from EXECUTING to RELEASED I0510 07:57:30.958912 21661 python_be.cc:2537] TRITONBACKEND_ModelInstanceExecute: model instance name postprocessing_0_0 released 1 requests ```

triton-inference-server / tensorrtllm_backend