Open ogvalt opened 8 months ago
can you share the corresponding triton server log?
As reference I was able to do the following locally:
mkdir repro; cd repro
git clone https://github.com/triton-inference-server/server
docker run -it --rm \
--name triton \
--gpus all --network host \
--shm-size=1g --ulimit memlock=-1 \
-v /tmp:/tmp \
-v ${PWD}:/workspace \
-v ${HOME}/.cache/huggingface:/root/.cache/huggingface \
-v ${PWD}/models:/root/models \
-w /workspace \
nvcr.io/nvidia/tritonserver:24.01-py3
tritonserver --model-control-mode=explicit --load-model simple --model-repository=server\
/docs/examples/model_repository --log-verbose=6 --log-error=1
Expected Output:
<SNIP>
I0403 05:01:26.112318 155 server.cc:676]
+--------+---------+--------+
| Model | Version | Status |
+--------+---------+--------+
| simple | 1 | READY |
+--------+---------+--------+
<SNIP>
and then from a seperate shell
curl --request POST http://localhost:8000/v2/repository/index
Expected Output:
[{"name":"densenet_onnx"},{"name":"inception_graphdef"},{"name":"simple","version":"1","state":"READY"},{"name":"simple_dyna_sequence"},{"name":"simple_identity"},{"name":"simple_int8"},{"name":"simple_sequence"},{"name":"simple_string"}]
@nnshah1 Sorry, I was little in a hurry and missed some key details.
docker run -it --rm \
--name triton \
--gpus all --network host \
--shm-size=1g --ulimit memlock=-1 \
nvcr.io/nvidia/tritonserver:24.01-py3
tritonserver --model-control-mode=explicit --model-repository=/home --log-verbose=6 --log-error=1
simple
model using tritonclient
python SDK and functionality that could be found in its tritonclient.http.InferenceServerClient
class. I'm refererring to load_model
method for loading simple
model and corresponding get_model_repository_index
method for querying index.The idea is that I'm launching tritonserver
without any model at all and then load
and unload
models as I please.
Can you provide the server logs?
I ran the server without loading the model (but still pointing to the example artifacts):
tritonserver --model-control-mode=explicit --model-repository=server/docs/examples/model_repository --log-verbose=6 --log-error=1
And loaded the example model directly:
1 │ import tritonclient
2 │
6 │ import sys
7 │
8 │ import tritonclient.http as httpclient
9 │
10 │ if __name__ == "__main__":
11 │
12 │ model_name = "simple"
13 │
14 │ try:
15 │ triton_client = httpclient.InferenceServerClient(
16 │ url="localhost:8000", verbose=True
17 │ )
18 │ except Exception as e:
19 │ print("context creation failed: " + str(e))
20 │ sys.exit(1)
21 │
22 │ triton_client.load_model("simple")
23 │
24 │ triton_client.get_model_repository_index()
And everything worked as expected. Can you check that as a sanity test?
My guess is that there is an error either in the pbtxt to json or the way the model bytes are loaded.
If you can share the pbtxt to json conversion code you are using could also see if the exact steps reproduce on our end.
@nnshah1 You are pointing your server to the folder with models that already in there. As far as I understand documentation index will return the list of all model, loaded or not.
But I expect that if I uploaded model to the server via API it should show when I query index independently from it existance in folder where --model-repository
points to.
Correct me please, If my expectation is wrong.
To reproduce my case - you need to point to an empty model repository like I suggested:
tritonserver --model-control-mode=explicit --model-repository=/home --log-verbose=6 --log-error=1
Since I'm running tritonserver in docker
, /home
folder is empty in the container.
My use case: my starting triton container on some server with empty model repository and then gradually uploading or unloading models as my needs change.
My code to convert pbtxt to json convertion:
import pathlib
import google.protobuf.message
import google.protobuf.text_format
import google.protobuf.json_format
import tritonclient.grpc as tritongrpcclient
def pbtxt_to_json(filepath: pathlib.Path) -> str:
with open(filepath, "r") as f:
json_obj = google.protobuf.json_format.MessageToJson(
google.protobuf.text_format.Parse(
f.read(),
tritongrpcclient.model_config_pb2.ModelConfig()
)
)
return json_obj
@ogvalt I understand your use case. Are there any errors on the server side log when loading the model? Can you confirm that loading the example as above (explicitly from a directory via the client) works as well? I'd like to see at which point things diverge from loading the example model directly from disk and when loading by passing the bits in manually.
@nnshah1 Understood, I'm working on launching your code. Meanwhile here is a server log you asked for
I0409 14:33:20.431590 1 cache_manager.cc:480] Create CacheManager with cache_dir: '/opt/tritonserver/caches'
I0409 14:33:20.569581 1 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x71d448000000' with size 268435456
I0409 14:33:20.569750 1 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0409 14:33:20.570429 1 server.cc:606]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+
I0409 14:33:20.570442 1 server.cc:633]
+---------+------+--------+
| Backend | Path | Config |
+---------+------+--------+
+---------+------+--------+
I0409 14:33:20.570444 1 model_lifecycle.cc:265] ModelStates()
I0409 14:33:20.570451 1 server.cc:676]
+-------+---------+--------+
| Model | Version | Status |
+-------+---------+--------+
+-------+---------+--------+
I0409 14:33:20.601711 1 metrics.cc:877] Collecting metrics for GPU 0: NVIDIA GeForce RTX 2070 with Max-Q Design
I0409 14:33:20.603244 1 metrics.cc:770] Collecting CPU metrics
I0409 14:33:20.603362 1 tritonserver.cc:2498]
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.42.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0] | /home |
| model_control_mode | MODE_EXPLICIT |
| strict_model_config | 0 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
| cache_enabled | 0 |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I0409 14:33:20.603750 1 grpc_server.cc:2426]
+----------------------------------------------+---------+
| GRPC KeepAlive Option | Value |
+----------------------------------------------+---------+
| keepalive_time_ms | 7200000 |
| keepalive_timeout_ms | 20000 |
| keepalive_permit_without_calls | 0 |
| http2_max_pings_without_data | 2 |
| http2_min_recv_ping_interval_without_data_ms | 300000 |
| http2_max_ping_strikes | 2 |
+----------------------------------------------+---------+
I0409 14:33:20.604148 1 grpc_server.cc:102] Ready for RPC 'Check', 0
I0409 14:33:20.604164 1 grpc_server.cc:102] Ready for RPC 'ServerLive', 0
I0409 14:33:20.604168 1 grpc_server.cc:102] Ready for RPC 'ServerReady', 0
I0409 14:33:20.604172 1 grpc_server.cc:102] Ready for RPC 'ModelReady', 0
I0409 14:33:20.604176 1 grpc_server.cc:102] Ready for RPC 'ServerMetadata', 0
I0409 14:33:20.604180 1 grpc_server.cc:102] Ready for RPC 'ModelMetadata', 0
I0409 14:33:20.604184 1 grpc_server.cc:102] Ready for RPC 'ModelConfig', 0
I0409 14:33:20.604190 1 grpc_server.cc:102] Ready for RPC 'SystemSharedMemoryStatus', 0
I0409 14:33:20.604194 1 grpc_server.cc:102] Ready for RPC 'SystemSharedMemoryRegister', 0
I0409 14:33:20.604198 1 grpc_server.cc:102] Ready for RPC 'SystemSharedMemoryUnregister', 0
I0409 14:33:20.604203 1 grpc_server.cc:102] Ready for RPC 'CudaSharedMemoryStatus', 0
I0409 14:33:20.604206 1 grpc_server.cc:102] Ready for RPC 'CudaSharedMemoryRegister', 0
I0409 14:33:20.604210 1 grpc_server.cc:102] Ready for RPC 'CudaSharedMemoryUnregister', 0
I0409 14:33:20.604215 1 grpc_server.cc:102] Ready for RPC 'RepositoryIndex', 0
I0409 14:33:20.604222 1 grpc_server.cc:102] Ready for RPC 'RepositoryModelLoad', 0
I0409 14:33:20.604225 1 grpc_server.cc:102] Ready for RPC 'RepositoryModelUnload', 0
I0409 14:33:20.604231 1 grpc_server.cc:102] Ready for RPC 'ModelStatistics', 0
I0409 14:33:20.604236 1 grpc_server.cc:102] Ready for RPC 'Trace', 0
I0409 14:33:20.604244 1 grpc_server.cc:102] Ready for RPC 'Logging', 0
I0409 14:33:20.604256 1 grpc_server.cc:359] Thread started for CommonHandler
I0409 14:33:20.604386 1 infer_handler.h:1185] StateNew, 0 Step START
I0409 14:33:20.604400 1 infer_handler.cc:674] New request handler for ModelInferHandler, 0
I0409 14:33:20.604410 1 infer_handler.h:1309] Thread started for ModelInferHandler
I0409 14:33:20.604522 1 infer_handler.h:1185] StateNew, 0 Step START
I0409 14:33:20.604533 1 infer_handler.cc:674] New request handler for ModelInferHandler, 0
I0409 14:33:20.604542 1 infer_handler.h:1309] Thread started for ModelInferHandler
I0409 14:33:20.604606 1 infer_handler.h:1185] StateNew, 0 Step START
I0409 14:33:20.604615 1 stream_infer_handler.cc:128] New request handler for ModelStreamInferHandler, 0
I0409 14:33:20.604624 1 infer_handler.h:1309] Thread started for ModelStreamInferHandler
I0409 14:33:20.604631 1 grpc_server.cc:2519] Started GRPCInferenceService at 0.0.0.0:8001
I0409 14:33:20.604824 1 http_server.cc:4623] Started HTTPService at 0.0.0.0:8000
I0409 14:33:20.645724 1 http_server.cc:315] Started Metrics Service at 0.0.0.0:8002
I0409 14:33:21.357261 1 http_server.cc:4509] HTTP request: 0 /v2/health/ready
I0409 14:33:21.357323 1 model_lifecycle.cc:265] ModelStates()
I0409 14:33:21.373833 1 http_server.cc:4509] HTTP request: 2 /v2/repository/models/simple/load
I0409 14:33:21.378079 1 model_config_utils.cc:680] Server side auto-completed config: name: "simple"
platform: "tensorflow_graphdef"
max_batch_size: 8
input {
name: "INPUT0"
data_type: TYPE_INT32
dims: 16
}
input {
name: "INPUT1"
data_type: TYPE_INT32
dims: 16
}
output {
name: "OUTPUT0"
data_type: TYPE_INT32
dims: 16
}
output {
name: "OUTPUT1"
data_type: TYPE_INT32
dims: 16
}
default_model_filename: "model.graphdef"
backend: "tensorflow"
I0409 14:33:21.378206 1 model_lifecycle.cc:430] AsyncLoad() 'simple'
I0409 14:33:21.378312 1 model_lifecycle.cc:461] loading: simple:1
I0409 14:33:21.378438 1 model_lifecycle.cc:539] CreateModel() 'simple' version 1
I0409 14:33:21.378647 1 backend_model.cc:502] Adding default backend config setting: default-max-batch-size,4
I0409 14:33:21.378692 1 shared_library.cc:112] OpenLibraryHandle: /opt/tritonserver/backends/tensorflow/libtriton_tensorflow.so
W0409 14:33:21.604963 1 metrics.cc:631] Unable to get power limit for GPU 0. Status:Success, value:0.000000
2024-04-09 14:33:21.658999: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9360] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-09 14:33:21.659028: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-09 14:33:21.659052: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1537] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
I0409 14:33:21.666817 1 tensorflow.cc:2577] TRITONBACKEND_Initialize: tensorflow
I0409 14:33:21.666835 1 tensorflow.cc:2587] Triton TRITONBACKEND API version: 1.17
I0409 14:33:21.666838 1 tensorflow.cc:2593] 'tensorflow' TRITONBACKEND API version: 1.17
I0409 14:33:21.666841 1 tensorflow.cc:2617] backend configuration:
{"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}}
I0409 14:33:21.667066 1 tensorflow.cc:2683] TRITONBACKEND_ModelInitialize: simple (version 1)
I0409 14:33:21.667443 1 model_config_utils.cc:1902] ModelConfig 64-bit fields:
I0409 14:33:21.667451 1 model_config_utils.cc:1904] ModelConfig::dynamic_batching::default_priority_level
I0409 14:33:21.667453 1 model_config_utils.cc:1904] ModelConfig::dynamic_batching::default_queue_policy::default_timeout_microseconds
I0409 14:33:21.667455 1 model_config_utils.cc:1904] ModelConfig::dynamic_batching::max_queue_delay_microseconds
I0409 14:33:21.667457 1 model_config_utils.cc:1904] ModelConfig::dynamic_batching::priority_levels
I0409 14:33:21.667459 1 model_config_utils.cc:1904] ModelConfig::dynamic_batching::priority_queue_policy::key
I0409 14:33:21.667461 1 model_config_utils.cc:1904] ModelConfig::dynamic_batching::priority_queue_policy::value::default_timeout_microseconds
I0409 14:33:21.667463 1 model_config_utils.cc:1904] ModelConfig::ensemble_scheduling::step::model_version
I0409 14:33:21.667465 1 model_config_utils.cc:1904] ModelConfig::input::dims
I0409 14:33:21.667467 1 model_config_utils.cc:1904] ModelConfig::input::reshape::shape
I0409 14:33:21.667469 1 model_config_utils.cc:1904] ModelConfig::instance_group::secondary_devices::device_id
I0409 14:33:21.667471 1 model_config_utils.cc:1904] ModelConfig::model_warmup::inputs::value::dims
I0409 14:33:21.667473 1 model_config_utils.cc:1904] ModelConfig::optimization::cuda::graph_spec::graph_lower_bound::input::value::dim
I0409 14:33:21.667474 1 model_config_utils.cc:1904] ModelConfig::optimization::cuda::graph_spec::input::value::dim
I0409 14:33:21.667476 1 model_config_utils.cc:1904] ModelConfig::output::dims
I0409 14:33:21.667478 1 model_config_utils.cc:1904] ModelConfig::output::reshape::shape
I0409 14:33:21.667480 1 model_config_utils.cc:1904] ModelConfig::sequence_batching::direct::max_queue_delay_microseconds
I0409 14:33:21.667482 1 model_config_utils.cc:1904] ModelConfig::sequence_batching::max_sequence_idle_microseconds
I0409 14:33:21.667484 1 model_config_utils.cc:1904] ModelConfig::sequence_batching::oldest::max_queue_delay_microseconds
I0409 14:33:21.667486 1 model_config_utils.cc:1904] ModelConfig::sequence_batching::state::dims
I0409 14:33:21.667488 1 model_config_utils.cc:1904] ModelConfig::sequence_batching::state::initial_state::dims
I0409 14:33:21.667491 1 model_config_utils.cc:1904] ModelConfig::version_policy::specific::versions
I0409 14:33:21.667579 1 tensorflow.cc:1833] model configuration:
{
"name": "simple",
"platform": "tensorflow_graphdef",
"backend": "tensorflow",
"runtime": "",
"version_policy": {
"latest": {
"num_versions": 1
}
},
"max_batch_size": 8,
"input": [
{
"name": "INPUT0",
"data_type": "TYPE_INT32",
"format": "FORMAT_NONE",
"dims": [
16
],
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": false
},
{
"name": "INPUT1",
"data_type": "TYPE_INT32",
"format": "FORMAT_NONE",
"dims": [
16
],
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": false
}
],
"output": [
{
"name": "OUTPUT0",
"data_type": "TYPE_INT32",
"dims": [
16
],
"label_filename": "",
"is_shape_tensor": false
},
{
"name": "OUTPUT1",
"data_type": "TYPE_INT32",
"dims": [
16
],
"label_filename": "",
"is_shape_tensor": false
}
],
"batch_input": [],
"batch_output": [],
"optimization": {
"priority": "PRIORITY_DEFAULT",
"input_pinned_memory": {
"enable": true
},
"output_pinned_memory": {
"enable": true
},
"gather_kernel_buffer_threshold": 0,
"eager_batching": false
},
"instance_group": [
{
"name": "simple",
"kind": "KIND_GPU",
"count": 1,
"gpus": [
0
],
"secondary_devices": [],
"profile": [],
"passive": false,
"host_policy": ""
}
],
"default_model_filename": "model.graphdef",
"cc_model_filenames": {},
"metric_tags": {},
"parameters": {},
"model_warmup": []
}
I0409 14:33:21.670116 1 tensorflow.cc:2732] TRITONBACKEND_ModelInstanceInitialize: simple_0 (GPU device 0)
I0409 14:33:21.670231 1 backend_model_instance.cc:106] Creating instance simple_0 on GPU 0 (7.5) using artifact 'model.graphdef'
2024-04-09 14:33:21.674731: I tensorflow/core/platform/cpu_feature_guard.cc:183] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-09 14:33:21.675352: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-04-09 14:33:21.704935: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-04-09 14:33:21.705094: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-04-09 14:33:21.705346: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-04-09 14:33:21.705476: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-04-09 14:33:21.705599: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-04-09 14:33:21.705701: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1883] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 5854 MB memory: -> device: 0, name: NVIDIA GeForce RTX 2070 with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 7.5
2024-04-09 14:33:21.721025: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:382] MLIR V1 optimization pass is not enabled
I0409 14:33:21.721266 1 backend_model_instance.cc:772] Starting backend thread for simple_0 at nice 0 on device 0...
I0409 14:33:21.721356 1 backend_model.cc:674] Created model instance named 'simple_0' with device id '0'
I0409 14:33:21.721379 1 model_lifecycle.cc:684] OnLoadComplete() 'simple' version 1
I0409 14:33:21.721384 1 model_lifecycle.cc:722] OnLoadFinal() 'simple' for all version(s)
I0409 14:33:21.721387 1 model_lifecycle.cc:827] successfully loaded 'simple'
I0409 14:33:21.721404 1 model_lifecycle.cc:286] VersionStates() 'simple'
I0409 14:33:21.721433 1 model_lifecycle.cc:286] VersionStates() 'simple'
I0409 14:33:21.721844 1 http_server.cc:4509] HTTP request: 2 /v2/models/simple/versions/1/infer
I0409 14:33:21.721859 1 model_lifecycle.cc:328] GetModel() 'simple' version 1
I0409 14:33:21.721865 1 model_lifecycle.cc:328] GetModel() 'simple' version 1
I0409 14:33:21.721919 1 infer_request.cc:131] [request id: <id_unknown>] Setting state from INITIALIZED to INITIALIZED
I0409 14:33:21.721928 1 infer_request.cc:893] [request id: <id_unknown>] prepared: [0x0x71d4100100b0] request id: , model: simple, requested version: 1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 8, priority: 0, timeout (us): 0
original inputs:
[0x0x71d410043b38] input: INPUT1, type: INT32, original shape: [8,16], batch + shape: [8,16], shape: [16]
[0x0x71d4100036a8] input: INPUT0, type: INT32, original shape: [8,16], batch + shape: [8,16], shape: [16]
override inputs:
inputs:
[0x0x71d4100036a8] input: INPUT0, type: INT32, original shape: [8,16], batch + shape: [8,16], shape: [16]
[0x0x71d410043b38] input: INPUT1, type: INT32, original shape: [8,16], batch + shape: [8,16], shape: [16]
original requested outputs:
OUTPUT0
OUTPUT1
requested outputs:
OUTPUT0
OUTPUT1
I0409 14:33:21.721940 1 infer_request.cc:131] [request id: <id_unknown>] Setting state from INITIALIZED to PENDING
I0409 14:33:21.721958 1 infer_request.cc:131] [request id: <id_unknown>] Setting state from PENDING to EXECUTING
I0409 14:33:21.721980 1 tensorflow.cc:2803] model simple, instance simple_0, executing 1 requests
I0409 14:33:21.721986 1 tensorflow.cc:1971] TRITONBACKEND_ModelExecute: Running simple_0 with 1 requests
I0409 14:33:21.722021 1 tensorflow.cc:2223] TRITONBACKEND_ModelExecute: input 'INPUT0' is GPU tensor: false
I0409 14:33:21.722029 1 tensorflow.cc:2223] TRITONBACKEND_ModelExecute: input 'INPUT1' is GPU tensor: false
I0409 14:33:21.731327 1 infer_response.cc:167] add response output: output: OUTPUT0, type: INT32, shape: [8,16]
I0409 14:33:21.731352 1 http_server.cc:1232] HTTP using buffer for: 'OUTPUT0', size: 512, addr: 0x71d2c4053230
I0409 14:33:21.731361 1 tensorflow.cc:2497] TRITONBACKEND_ModelExecute: output 'OUTPUT0' is GPU tensor: false
I0409 14:33:21.731366 1 infer_response.cc:167] add response output: output: OUTPUT1, type: INT32, shape: [8,16]
I0409 14:33:21.731372 1 http_server.cc:1232] HTTP using buffer for: 'OUTPUT1', size: 512, addr: 0x71d2c4028e90
I0409 14:33:21.731377 1 tensorflow.cc:2497] TRITONBACKEND_ModelExecute: output 'OUTPUT1' is GPU tensor: false
I0409 14:33:21.731413 1 http_server.cc:1306] HTTP release: size 512, addr 0x71d2c4053230
I0409 14:33:21.731419 1 http_server.cc:1306] HTTP release: size 512, addr 0x71d2c4028e90
I0409 14:33:21.731430 1 infer_request.cc:131] [request id: <id_unknown>] Setting state from EXECUTING to RELEASED
I0409 14:33:21.731444 1 tensorflow.cc:2555] TRITONBACKEND_ModelExecute: model simple_0 released 1 requests
I0409 14:33:21.731924 1 http_server.cc:4509] HTTP request: 0 /v2/models/simple/versions/1/ready
I0409 14:33:21.731942 1 model_lifecycle.cc:328] GetModel() 'simple' version 1
I0409 14:33:21.732167 1 http_server.cc:4509] HTTP request: 2 /v2/repository/index
I0409 14:33:21.732217 1 model_lifecycle.cc:265] ModelStates()
I0409 14:33:21.776010 1 http_server.cc:4509] HTTP request: 0 /v2/health/ready
I0409 14:33:21.776034 1 model_lifecycle.cc:265] ModelStates()
quick update - I believe I'm able to reproduce what you are describing - will investigate -
@nnshah1 logs above obtained by launching everything my way with empty repository
@nnshah1 FYI: I've run your code and got:
POST /v2/repository/models/simple/load, headers {}
{}
<HTTPSocketPoolResponse status=200 headers={'content-type': 'application/json', 'content-length': '0'}>
Loaded model 'simple'
POST /v2/repository/index, headers {}
<HTTPSocketPoolResponse status=200 headers={'content-type': 'application/json', 'content-length': '238'}>
bytearray(b'[{"name":"densenet_onnx"},{"name":"inception_graphdef"},{"name":"simple","version":"1","state":"READY"},{"name":"simple_dyna_sequence"},{"name":"simple_identity"},{"name":"simple_int8"},{"name":"simple_sequence"},{"name":"simple_string"}]')
Sanity test - checked
@nnshah1 FYI: I've run your code and got:
POST /v2/repository/models/simple/load, headers {} {} <HTTPSocketPoolResponse status=200 headers={'content-type': 'application/json', 'content-length': '0'}> Loaded model 'simple' POST /v2/repository/index, headers {} <HTTPSocketPoolResponse status=200 headers={'content-type': 'application/json', 'content-length': '238'}> bytearray(b'[{"name":"densenet_onnx"},{"name":"inception_graphdef"},{"name":"simple","version":"1","state":"READY"},{"name":"simple_dyna_sequence"},{"name":"simple_identity"},{"name":"simple_int8"},{"name":"simple_sequence"},{"name":"simple_string"}]')
Sanity test - checked
Thanks! Appreciate it. I'm suspecting that since the models get loaded into a temp directory and not /home - there is a difference in how they are listed out in the index. Need to investigate if that is by design or a bug ....
I'm suspecting that since the models get loaded into a temp directory and not /home - there is a difference in how they are listed out in the index. Need to investigate if that is by design or a bug ....
Looking forward for an answer too. In any case it would be great to list any model server under the triton
@ogvalt I've filed an internal ticket to track - let us know if there timeline / priority for this
@ogvalt I've filed an internal ticket to track - let us know if there timeline / priority for this
it's not urgent, but I hope it won't take months to see a release with this fix.
@ogvalt - we're discussing internally and will get back on ETA.
@ogvalt For a temporary workaround you can find: https://github.com/triton-inference-server/core/pull/340
Need to finalize the change in behavior - but in case you'd like to see it sooner than later.
@nnshah1 thanks for an update.
I was wondering what kind of side effects to expect after dynamically loaded model was unloaded?
Like some amount of ram or disk space will be left occupied? or it would be completely deleted?
It will generally depend on the backend and how it handles things. For the python backend- model instances are in seperate processes so memory would be reclaimed. For In-Process backends like tensorflow and pytorch mileage can very on how quickly and if all memory is reclaimed. For tensorflow specifically we have seen memory being held.
just checking, how things are going?
@nnshah1 hey, any updates?
Description I've loaded a model via
v2/repository/models/simple/load
endpoint. But when queryingv2/repository/index
endpoint I get a[]
as a responce.Triton Information What version of Triton are you using?
2.42.0
Are you using the Triton container or did you build it yourself? Triton container, versionnvcr.io/nvidia/tritonserver:24.01-py3
To ReproduceLoaded it with with python script using
tritonclient
Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well). Model mentioned above
Expected behavior I expect than this code:
will return responce according to this specification
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/protocol/extension_model_repository.html