Model loaded via `model repository` api does not appear after querying it with `v2/repository/index` endpoint

ogvalt commented 8 months ago

Description I've loaded a model via v2/repository/models/simple/load endpoint. But when querying v2/repository/index endpoint I get a [] as a responce.

Triton Information What version of Triton are you using? 2.42.0 Are you using the Triton container or did you build it yourself? Triton container, version nvcr.io/nvidia/tritonserver:24.01-py3 To Reproduce

I've took this model: https://github.com/triton-inference-server/server/tree/main/docs/examples/model_repository/simple

Loaded it with with python script using tritonclient

model_name = "simple"

config_path = models_repository[model_name]["config"]
model_path = models_repository[model_name]["model"]

with open(model_path, "rb") as f:
    model_bytes = f.read()

json_obj = _pbtxt_to_json(config_path)

triton_client.load_model(
    model_name=model_name,
    config=json_obj,
    files={
        "file:1/model.graphdef": model_bytes,
    },
)

Then

triton_client.get_model_repository_index()
# returns: []

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well). Model mentioned above

Expected behavior I expect than this code:

triton_client.get_model_repository_index()

will return responce according to this specification

$repository_index_response =
[
  {
    "name" : $string,
    "version" : $string #optional,
    "state" : $string,
    "reason" : $string
  },
  …
]

https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/protocol/extension_model_repository.html

nnshah1 commented 8 months ago

can you share the corresponding triton server log?

nnshah1 commented 8 months ago

As reference I was able to do the following locally:

mkdir repro; cd repro
git clone https://github.com/triton-inference-server/server

docker run -it --rm \
  --name triton \
  --gpus all --network host \
   --shm-size=1g --ulimit memlock=-1 \
    -v /tmp:/tmp \
    -v ${PWD}:/workspace \
    -v ${HOME}/.cache/huggingface:/root/.cache/huggingface \
    -v ${PWD}/models:/root/models \
    -w /workspace \
     nvcr.io/nvidia/tritonserver:24.01-py3

 tritonserver --model-control-mode=explicit --load-model simple --model-repository=server\
       /docs/examples/model_repository --log-verbose=6 --log-error=1

Expected Output:

<SNIP>
 I0403 05:01:26.112318 155 server.cc:676]
+--------+---------+--------+
| Model  | Version | Status |
+--------+---------+--------+
| simple  | 1      | READY  |
+--------+---------+--------+
<SNIP>

and then from a seperate shell

curl --request POST http://localhost:8000/v2/repository/index

Expected Output:

  [{"name":"densenet_onnx"},{"name":"inception_graphdef"},{"name":"simple","version":"1","state":"READY"},{"name":"simple_dyna_sequence"},{"name":"simple_identity"},{"name":"simple_int8"},{"name":"simple_sequence"},{"name":"simple_string"}]

ogvalt commented 8 months ago

@nnshah1 Sorry, I was little in a hurry and missed some key details.

I'm launching my triton instance with an empty model repository, which translates into my commands looking like follows:

docker run -it --rm \
--name triton \
--gpus all --network host \
--shm-size=1g --ulimit memlock=-1 \
 nvcr.io/nvidia/tritonserver:24.01-py3

tritonserver --model-control-mode=explicit --model-repository=/home --log-verbose=6 --log-error=1

Then I'm loading simple model using tritonclient python SDK and functionality that could be found in its tritonclient.http.InferenceServerClient class. I'm refererring to load_model method for loading simple model and corresponding get_model_repository_index method for querying index.

The idea is that I'm launching tritonserver without any model at all and then load and unload models as I please.

nnshah1 commented 7 months ago

Can you provide the server logs?

I ran the server without loading the model (but still pointing to the example artifacts):

tritonserver --model-control-mode=explicit --model-repository=server/docs/examples/model_repository --log-verbose=6 --log-error=1

And loaded the example model directly:

   1 │ import tritonclient
   2 │
   6 │ import sys
   7 │
   8 │ import tritonclient.http as httpclient
   9 │
  10 │ if __name__ == "__main__":
  11 │
  12 │     model_name = "simple"
  13 │
  14 │     try:
  15 │         triton_client = httpclient.InferenceServerClient(
  16 │             url="localhost:8000", verbose=True
  17 │         )
  18 │     except Exception as e:
  19 │         print("context creation failed: " + str(e))
  20 │         sys.exit(1)
  21 │
  22 │     triton_client.load_model("simple")
  23 │
  24 │     triton_client.get_model_repository_index()

And everything worked as expected. Can you check that as a sanity test?

My guess is that there is an error either in the pbtxt to json or the way the model bytes are loaded.

If you can share the pbtxt to json conversion code you are using could also see if the exact steps reproduce on our end.

ogvalt commented 7 months ago

@nnshah1 You are pointing your server to the folder with models that already in there. As far as I understand documentation index will return the list of all model, loaded or not.

But I expect that if I uploaded model to the server via API it should show when I query index independently from it existance in folder where --model-repository points to. Correct me please, If my expectation is wrong.

To reproduce my case - you need to point to an empty model repository like I suggested:

tritonserver --model-control-mode=explicit --model-repository=/home --log-verbose=6 --log-error=1

Since I'm running tritonserver in docker, /home folder is empty in the container.

My use case: my starting triton container on some server with empty model repository and then gradually uploading or unloading models as my needs change.

My code to convert pbtxt to json convertion:

import pathlib

import google.protobuf.message
import google.protobuf.text_format
import google.protobuf.json_format
import tritonclient.grpc as tritongrpcclient

def pbtxt_to_json(filepath: pathlib.Path) -> str:
    with open(filepath, "r") as f:
        json_obj = google.protobuf.json_format.MessageToJson(
            google.protobuf.text_format.Parse(
                f.read(), 
                tritongrpcclient.model_config_pb2.ModelConfig()
            )
        )
    return json_obj

nnshah1 commented 7 months ago

@ogvalt I understand your use case. Are there any errors on the server side log when loading the model? Can you confirm that loading the example as above (explicitly from a directory via the client) works as well? I'd like to see at which point things diverge from loading the example model directly from disk and when loading by passing the bits in manually.

ogvalt commented 7 months ago

@nnshah1 Understood, I'm working on launching your code. Meanwhile here is a server log you asked for

I0409 14:33:20.431590 1 cache_manager.cc:480] Create CacheManager with cache_dir: '/opt/tritonserver/caches'
I0409 14:33:20.569581 1 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x71d448000000' with size 268435456
I0409 14:33:20.569750 1 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0409 14:33:20.570429 1 server.cc:606] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0409 14:33:20.570442 1 server.cc:633] 
+---------+------+--------+
| Backend | Path | Config |
+---------+------+--------+
+---------+------+--------+

I0409 14:33:20.570444 1 model_lifecycle.cc:265] ModelStates()
I0409 14:33:20.570451 1 server.cc:676] 
+-------+---------+--------+
| Model | Version | Status |
+-------+---------+--------+
+-------+---------+--------+

I0409 14:33:20.601711 1 metrics.cc:877] Collecting metrics for GPU 0: NVIDIA GeForce RTX 2070 with Max-Q Design
I0409 14:33:20.603244 1 metrics.cc:770] Collecting CPU metrics
I0409 14:33:20.603362 1 tritonserver.cc:2498] 
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                                           |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                                          |
| server_version                   | 2.42.0                                                                                                                                                                                                          |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0]         | /home                                                                                                                                                                                                           |
| model_control_mode               | MODE_EXPLICIT                                                                                                                                                                                                   |
| strict_model_config              | 0                                                                                                                                                                                                               |
| rate_limit                       | OFF                                                                                                                                                                                                             |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                                       |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                                        |
| min_supported_compute_capability | 6.0                                                                                                                                                                                                             |
| strict_readiness                 | 1                                                                                                                                                                                                               |
| exit_timeout                     | 30                                                                                                                                                                                                              |
| cache_enabled                    | 0                                                                                                                                                                                                               |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0409 14:33:20.603750 1 grpc_server.cc:2426] 
+----------------------------------------------+---------+
| GRPC KeepAlive Option                        | Value   |
+----------------------------------------------+---------+
| keepalive_time_ms                            | 7200000 |
| keepalive_timeout_ms                         | 20000   |
| keepalive_permit_without_calls               | 0       |
| http2_max_pings_without_data                 | 2       |
| http2_min_recv_ping_interval_without_data_ms | 300000  |
| http2_max_ping_strikes                       | 2       |
+----------------------------------------------+---------+

I0409 14:33:20.604148 1 grpc_server.cc:102] Ready for RPC 'Check', 0
I0409 14:33:20.604164 1 grpc_server.cc:102] Ready for RPC 'ServerLive', 0
I0409 14:33:20.604168 1 grpc_server.cc:102] Ready for RPC 'ServerReady', 0
I0409 14:33:20.604172 1 grpc_server.cc:102] Ready for RPC 'ModelReady', 0
I0409 14:33:20.604176 1 grpc_server.cc:102] Ready for RPC 'ServerMetadata', 0
I0409 14:33:20.604180 1 grpc_server.cc:102] Ready for RPC 'ModelMetadata', 0
I0409 14:33:20.604184 1 grpc_server.cc:102] Ready for RPC 'ModelConfig', 0
I0409 14:33:20.604190 1 grpc_server.cc:102] Ready for RPC 'SystemSharedMemoryStatus', 0
I0409 14:33:20.604194 1 grpc_server.cc:102] Ready for RPC 'SystemSharedMemoryRegister', 0
I0409 14:33:20.604198 1 grpc_server.cc:102] Ready for RPC 'SystemSharedMemoryUnregister', 0
I0409 14:33:20.604203 1 grpc_server.cc:102] Ready for RPC 'CudaSharedMemoryStatus', 0
I0409 14:33:20.604206 1 grpc_server.cc:102] Ready for RPC 'CudaSharedMemoryRegister', 0
I0409 14:33:20.604210 1 grpc_server.cc:102] Ready for RPC 'CudaSharedMemoryUnregister', 0
I0409 14:33:20.604215 1 grpc_server.cc:102] Ready for RPC 'RepositoryIndex', 0
I0409 14:33:20.604222 1 grpc_server.cc:102] Ready for RPC 'RepositoryModelLoad', 0
I0409 14:33:20.604225 1 grpc_server.cc:102] Ready for RPC 'RepositoryModelUnload', 0
I0409 14:33:20.604231 1 grpc_server.cc:102] Ready for RPC 'ModelStatistics', 0
I0409 14:33:20.604236 1 grpc_server.cc:102] Ready for RPC 'Trace', 0
I0409 14:33:20.604244 1 grpc_server.cc:102] Ready for RPC 'Logging', 0
I0409 14:33:20.604256 1 grpc_server.cc:359] Thread started for CommonHandler
I0409 14:33:20.604386 1 infer_handler.h:1185] StateNew, 0 Step START
I0409 14:33:20.604400 1 infer_handler.cc:674] New request handler for ModelInferHandler, 0
I0409 14:33:20.604410 1 infer_handler.h:1309] Thread started for ModelInferHandler
I0409 14:33:20.604522 1 infer_handler.h:1185] StateNew, 0 Step START
I0409 14:33:20.604533 1 infer_handler.cc:674] New request handler for ModelInferHandler, 0
I0409 14:33:20.604542 1 infer_handler.h:1309] Thread started for ModelInferHandler
I0409 14:33:20.604606 1 infer_handler.h:1185] StateNew, 0 Step START
I0409 14:33:20.604615 1 stream_infer_handler.cc:128] New request handler for ModelStreamInferHandler, 0
I0409 14:33:20.604624 1 infer_handler.h:1309] Thread started for ModelStreamInferHandler
I0409 14:33:20.604631 1 grpc_server.cc:2519] Started GRPCInferenceService at 0.0.0.0:8001
I0409 14:33:20.604824 1 http_server.cc:4623] Started HTTPService at 0.0.0.0:8000
I0409 14:33:20.645724 1 http_server.cc:315] Started Metrics Service at 0.0.0.0:8002
I0409 14:33:21.357261 1 http_server.cc:4509] HTTP request: 0 /v2/health/ready
I0409 14:33:21.357323 1 model_lifecycle.cc:265] ModelStates()
I0409 14:33:21.373833 1 http_server.cc:4509] HTTP request: 2 /v2/repository/models/simple/load
I0409 14:33:21.378079 1 model_config_utils.cc:680] Server side auto-completed config: name: "simple"
platform: "tensorflow_graphdef"
max_batch_size: 8
input {
  name: "INPUT0"
  data_type: TYPE_INT32
  dims: 16
}
input {
  name: "INPUT1"
  data_type: TYPE_INT32
  dims: 16
}
output {
  name: "OUTPUT0"
  data_type: TYPE_INT32
  dims: 16
}
output {
  name: "OUTPUT1"
  data_type: TYPE_INT32
  dims: 16
}
default_model_filename: "model.graphdef"
backend: "tensorflow"

I0409 14:33:21.378206 1 model_lifecycle.cc:430] AsyncLoad() 'simple'
I0409 14:33:21.378312 1 model_lifecycle.cc:461] loading: simple:1
I0409 14:33:21.378438 1 model_lifecycle.cc:539] CreateModel() 'simple' version 1
I0409 14:33:21.378647 1 backend_model.cc:502] Adding default backend config setting: default-max-batch-size,4
I0409 14:33:21.378692 1 shared_library.cc:112] OpenLibraryHandle: /opt/tritonserver/backends/tensorflow/libtriton_tensorflow.so
W0409 14:33:21.604963 1 metrics.cc:631] Unable to get power limit for GPU 0. Status:Success, value:0.000000
2024-04-09 14:33:21.658999: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9360] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-09 14:33:21.659028: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-09 14:33:21.659052: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1537] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
I0409 14:33:21.666817 1 tensorflow.cc:2577] TRITONBACKEND_Initialize: tensorflow
I0409 14:33:21.666835 1 tensorflow.cc:2587] Triton TRITONBACKEND API version: 1.17
I0409 14:33:21.666838 1 tensorflow.cc:2593] 'tensorflow' TRITONBACKEND API version: 1.17
I0409 14:33:21.666841 1 tensorflow.cc:2617] backend configuration:
{"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}}
I0409 14:33:21.667066 1 tensorflow.cc:2683] TRITONBACKEND_ModelInitialize: simple (version 1)
I0409 14:33:21.667443 1 model_config_utils.cc:1902] ModelConfig 64-bit fields:
I0409 14:33:21.667451 1 model_config_utils.cc:1904]     ModelConfig::dynamic_batching::default_priority_level
I0409 14:33:21.667453 1 model_config_utils.cc:1904]     ModelConfig::dynamic_batching::default_queue_policy::default_timeout_microseconds
I0409 14:33:21.667455 1 model_config_utils.cc:1904]     ModelConfig::dynamic_batching::max_queue_delay_microseconds
I0409 14:33:21.667457 1 model_config_utils.cc:1904]     ModelConfig::dynamic_batching::priority_levels
I0409 14:33:21.667459 1 model_config_utils.cc:1904]     ModelConfig::dynamic_batching::priority_queue_policy::key
I0409 14:33:21.667461 1 model_config_utils.cc:1904]     ModelConfig::dynamic_batching::priority_queue_policy::value::default_timeout_microseconds
I0409 14:33:21.667463 1 model_config_utils.cc:1904]     ModelConfig::ensemble_scheduling::step::model_version
I0409 14:33:21.667465 1 model_config_utils.cc:1904]     ModelConfig::input::dims
I0409 14:33:21.667467 1 model_config_utils.cc:1904]     ModelConfig::input::reshape::shape
I0409 14:33:21.667469 1 model_config_utils.cc:1904]     ModelConfig::instance_group::secondary_devices::device_id
I0409 14:33:21.667471 1 model_config_utils.cc:1904]     ModelConfig::model_warmup::inputs::value::dims
I0409 14:33:21.667473 1 model_config_utils.cc:1904]     ModelConfig::optimization::cuda::graph_spec::graph_lower_bound::input::value::dim
I0409 14:33:21.667474 1 model_config_utils.cc:1904]     ModelConfig::optimization::cuda::graph_spec::input::value::dim
I0409 14:33:21.667476 1 model_config_utils.cc:1904]     ModelConfig::output::dims
I0409 14:33:21.667478 1 model_config_utils.cc:1904]     ModelConfig::output::reshape::shape
I0409 14:33:21.667480 1 model_config_utils.cc:1904]     ModelConfig::sequence_batching::direct::max_queue_delay_microseconds
I0409 14:33:21.667482 1 model_config_utils.cc:1904]     ModelConfig::sequence_batching::max_sequence_idle_microseconds
I0409 14:33:21.667484 1 model_config_utils.cc:1904]     ModelConfig::sequence_batching::oldest::max_queue_delay_microseconds
I0409 14:33:21.667486 1 model_config_utils.cc:1904]     ModelConfig::sequence_batching::state::dims
I0409 14:33:21.667488 1 model_config_utils.cc:1904]     ModelConfig::sequence_batching::state::initial_state::dims
I0409 14:33:21.667491 1 model_config_utils.cc:1904]     ModelConfig::version_policy::specific::versions
I0409 14:33:21.667579 1 tensorflow.cc:1833] model configuration:
{
    "name": "simple",
    "platform": "tensorflow_graphdef",
    "backend": "tensorflow",
    "runtime": "",
    "version_policy": {
        "latest": {
            "num_versions": 1
        }
    },
    "max_batch_size": 8,
    "input": [
        {
            "name": "INPUT0",
            "data_type": "TYPE_INT32",
            "format": "FORMAT_NONE",
            "dims": [
                16
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        },
        {
            "name": "INPUT1",
            "data_type": "TYPE_INT32",
            "format": "FORMAT_NONE",
            "dims": [
                16
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        }
    ],
    "output": [
        {
            "name": "OUTPUT0",
            "data_type": "TYPE_INT32",
            "dims": [
                16
            ],
            "label_filename": "",
            "is_shape_tensor": false
        },
        {
            "name": "OUTPUT1",
            "data_type": "TYPE_INT32",
            "dims": [
                16
            ],
            "label_filename": "",
            "is_shape_tensor": false
        }
    ],
    "batch_input": [],
    "batch_output": [],
    "optimization": {
        "priority": "PRIORITY_DEFAULT",
        "input_pinned_memory": {
            "enable": true
        },
        "output_pinned_memory": {
            "enable": true
        },
        "gather_kernel_buffer_threshold": 0,
        "eager_batching": false
    },
    "instance_group": [
        {
            "name": "simple",
            "kind": "KIND_GPU",
            "count": 1,
            "gpus": [
                0
            ],
            "secondary_devices": [],
            "profile": [],
            "passive": false,
            "host_policy": ""
        }
    ],
    "default_model_filename": "model.graphdef",
    "cc_model_filenames": {},
    "metric_tags": {},
    "parameters": {},
    "model_warmup": []
}
I0409 14:33:21.670116 1 tensorflow.cc:2732] TRITONBACKEND_ModelInstanceInitialize: simple_0 (GPU device 0)
I0409 14:33:21.670231 1 backend_model_instance.cc:106] Creating instance simple_0 on GPU 0 (7.5) using artifact 'model.graphdef'
2024-04-09 14:33:21.674731: I tensorflow/core/platform/cpu_feature_guard.cc:183] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-09 14:33:21.675352: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-04-09 14:33:21.704935: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-04-09 14:33:21.705094: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-04-09 14:33:21.705346: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-04-09 14:33:21.705476: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-04-09 14:33:21.705599: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-04-09 14:33:21.705701: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1883] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 5854 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2070 with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 7.5
2024-04-09 14:33:21.721025: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:382] MLIR V1 optimization pass is not enabled
I0409 14:33:21.721266 1 backend_model_instance.cc:772] Starting backend thread for simple_0 at nice 0 on device 0...
I0409 14:33:21.721356 1 backend_model.cc:674] Created model instance named 'simple_0' with device id '0'
I0409 14:33:21.721379 1 model_lifecycle.cc:684] OnLoadComplete() 'simple' version 1
I0409 14:33:21.721384 1 model_lifecycle.cc:722] OnLoadFinal() 'simple' for all version(s)
I0409 14:33:21.721387 1 model_lifecycle.cc:827] successfully loaded 'simple'
I0409 14:33:21.721404 1 model_lifecycle.cc:286] VersionStates() 'simple'
I0409 14:33:21.721433 1 model_lifecycle.cc:286] VersionStates() 'simple'
I0409 14:33:21.721844 1 http_server.cc:4509] HTTP request: 2 /v2/models/simple/versions/1/infer
I0409 14:33:21.721859 1 model_lifecycle.cc:328] GetModel() 'simple' version 1
I0409 14:33:21.721865 1 model_lifecycle.cc:328] GetModel() 'simple' version 1
I0409 14:33:21.721919 1 infer_request.cc:131] [request id: <id_unknown>] Setting state from INITIALIZED to INITIALIZED
I0409 14:33:21.721928 1 infer_request.cc:893] [request id: <id_unknown>] prepared: [0x0x71d4100100b0] request id: , model: simple, requested version: 1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 8, priority: 0, timeout (us): 0
original inputs:
[0x0x71d410043b38] input: INPUT1, type: INT32, original shape: [8,16], batch + shape: [8,16], shape: [16]
[0x0x71d4100036a8] input: INPUT0, type: INT32, original shape: [8,16], batch + shape: [8,16], shape: [16]
override inputs:
inputs:
[0x0x71d4100036a8] input: INPUT0, type: INT32, original shape: [8,16], batch + shape: [8,16], shape: [16]
[0x0x71d410043b38] input: INPUT1, type: INT32, original shape: [8,16], batch + shape: [8,16], shape: [16]
original requested outputs:
OUTPUT0
OUTPUT1
requested outputs:
OUTPUT0
OUTPUT1

I0409 14:33:21.721940 1 infer_request.cc:131] [request id: <id_unknown>] Setting state from INITIALIZED to PENDING
I0409 14:33:21.721958 1 infer_request.cc:131] [request id: <id_unknown>] Setting state from PENDING to EXECUTING
I0409 14:33:21.721980 1 tensorflow.cc:2803] model simple, instance simple_0, executing 1 requests
I0409 14:33:21.721986 1 tensorflow.cc:1971] TRITONBACKEND_ModelExecute: Running simple_0 with 1 requests
I0409 14:33:21.722021 1 tensorflow.cc:2223] TRITONBACKEND_ModelExecute: input 'INPUT0' is GPU tensor: false
I0409 14:33:21.722029 1 tensorflow.cc:2223] TRITONBACKEND_ModelExecute: input 'INPUT1' is GPU tensor: false
I0409 14:33:21.731327 1 infer_response.cc:167] add response output: output: OUTPUT0, type: INT32, shape: [8,16]
I0409 14:33:21.731352 1 http_server.cc:1232] HTTP using buffer for: 'OUTPUT0', size: 512, addr: 0x71d2c4053230
I0409 14:33:21.731361 1 tensorflow.cc:2497] TRITONBACKEND_ModelExecute: output 'OUTPUT0' is GPU tensor: false
I0409 14:33:21.731366 1 infer_response.cc:167] add response output: output: OUTPUT1, type: INT32, shape: [8,16]
I0409 14:33:21.731372 1 http_server.cc:1232] HTTP using buffer for: 'OUTPUT1', size: 512, addr: 0x71d2c4028e90
I0409 14:33:21.731377 1 tensorflow.cc:2497] TRITONBACKEND_ModelExecute: output 'OUTPUT1' is GPU tensor: false
I0409 14:33:21.731413 1 http_server.cc:1306] HTTP release: size 512, addr 0x71d2c4053230
I0409 14:33:21.731419 1 http_server.cc:1306] HTTP release: size 512, addr 0x71d2c4028e90
I0409 14:33:21.731430 1 infer_request.cc:131] [request id: <id_unknown>] Setting state from EXECUTING to RELEASED
I0409 14:33:21.731444 1 tensorflow.cc:2555] TRITONBACKEND_ModelExecute: model simple_0 released 1 requests
I0409 14:33:21.731924 1 http_server.cc:4509] HTTP request: 0 /v2/models/simple/versions/1/ready
I0409 14:33:21.731942 1 model_lifecycle.cc:328] GetModel() 'simple' version 1
I0409 14:33:21.732167 1 http_server.cc:4509] HTTP request: 2 /v2/repository/index
I0409 14:33:21.732217 1 model_lifecycle.cc:265] ModelStates()
I0409 14:33:21.776010 1 http_server.cc:4509] HTTP request: 0 /v2/health/ready
I0409 14:33:21.776034 1 model_lifecycle.cc:265] ModelStates()

nnshah1 commented 7 months ago

quick update - I believe I'm able to reproduce what you are describing - will investigate -

ogvalt commented 7 months ago

@nnshah1 logs above obtained by launching everything my way with empty repository

ogvalt commented 7 months ago

@nnshah1 FYI: I've run your code and got:

POST /v2/repository/models/simple/load, headers {}
{}
<HTTPSocketPoolResponse status=200 headers={'content-type': 'application/json', 'content-length': '0'}>
Loaded model 'simple'
POST /v2/repository/index, headers {}

<HTTPSocketPoolResponse status=200 headers={'content-type': 'application/json', 'content-length': '238'}>
bytearray(b'[{"name":"densenet_onnx"},{"name":"inception_graphdef"},{"name":"simple","version":"1","state":"READY"},{"name":"simple_dyna_sequence"},{"name":"simple_identity"},{"name":"simple_int8"},{"name":"simple_sequence"},{"name":"simple_string"}]')

Sanity test - checked

nnshah1 commented 7 months ago

@nnshah1 FYI: I've run your code and got:

POST /v2/repository/models/simple/load, headers {}
{}
<HTTPSocketPoolResponse status=200 headers={'content-type': 'application/json', 'content-length': '0'}>
Loaded model 'simple'
POST /v2/repository/index, headers {}

<HTTPSocketPoolResponse status=200 headers={'content-type': 'application/json', 'content-length': '238'}>
bytearray(b'[{"name":"densenet_onnx"},{"name":"inception_graphdef"},{"name":"simple","version":"1","state":"READY"},{"name":"simple_dyna_sequence"},{"name":"simple_identity"},{"name":"simple_int8"},{"name":"simple_sequence"},{"name":"simple_string"}]')

Sanity test - checked

Thanks! Appreciate it. I'm suspecting that since the models get loaded into a temp directory and not /home - there is a difference in how they are listed out in the index. Need to investigate if that is by design or a bug ....

ogvalt commented 7 months ago

I'm suspecting that since the models get loaded into a temp directory and not /home - there is a difference in how they are listed out in the index. Need to investigate if that is by design or a bug ....

Looking forward for an answer too. In any case it would be great to list any model server under the triton

nnshah1 commented 7 months ago

@ogvalt I've filed an internal ticket to track - let us know if there timeline / priority for this

ogvalt commented 7 months ago

@ogvalt I've filed an internal ticket to track - let us know if there timeline / priority for this

it's not urgent, but I hope it won't take months to see a release with this fix.

nnshah1 commented 7 months ago

@ogvalt - we're discussing internally and will get back on ETA.

nnshah1 commented 7 months ago

@ogvalt For a temporary workaround you can find: https://github.com/triton-inference-server/core/pull/340

Need to finalize the change in behavior - but in case you'd like to see it sooner than later.

ogvalt commented 7 months ago

@nnshah1 thanks for an update.

I was wondering what kind of side effects to expect after dynamically loaded model was unloaded?

Like some amount of ram or disk space will be left occupied? or it would be completely deleted?

nnshah1 commented 7 months ago

It will generally depend on the backend and how it handles things. For the python backend- model instances are in seperate processes so memory would be reclaimed. For In-Process backends like tensorflow and pytorch mileage can very on how quickly and if all memory is reclaimed. For tensorflow specifically we have seen memory being held.

ogvalt commented 4 months ago

just checking, how things are going?

ogvalt commented 1 month ago

@nnshah1 hey, any updates?

triton-inference-server / server

Model loaded via `model repository` api does not appear after querying it with `v2/repository/index` endpoint #7066