triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
660 stars 95 forks source link

"POST v2/repository/models/${MODEL_NAME}/load" failed on 23.10 #174

Open zengqingfu1442 opened 10 months ago

zengqingfu1442 commented 10 months ago

Description image

Triton Information What version of Triton are you using? triton image: nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3 tritonserver version: 2.39.0

Are you using the Triton container or did you build it yourself? I'm using nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3 docker image.

To Reproduce

  1. start tritonserver container there is only mymodel folder under /triton_model_repo, and there is 1/model.py and config.pbtxt under /triton_model_repo/mymodel
    tritonserver --model-repository=/triton_model_repo --model-control-mode=explicit --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix{}_
  2. use postman to call the rest api to load mymodel

mymodel uses python backend, it is not ensemble model, the following is the content of /triton_model_repo/mymodel/config.pbtxt:

name: "mymodel"
backend: "python"
max_batch_size: 0

model_transaction_policy {
  decoupled: False
}

input [
  {
    name: "prompt"
    data_type: TYPE_STRING
    dims: [1]
  }
]
output [
  {
    name: "generated_text"
    data_type: TYPE_STRING
    dims: [1]
  }
]
instance_group [
  {
    count: 1
    kind: KIND_GPU
    gpus: [ 0, 1 ]
  }
]

Expected behavior mymodel can be successfully loaded.

zengqingfu1442 commented 10 months ago

I use curl but also the same error:

curl -X POST http://172.16.11.33:8000/v2/repository/models/mymodel/load -d '{"parameters": {"config": { "name": "mymodel", "backend": "python", "inputs": [{"name": "prompt", "datatype": "TYPE_STRING", "dims": [ 1 ]}], "outputs": [{"name": "generated_text", "datatype": "TYPE_STRING", "dims": [ 1 ]}], "instance_group": [{"count": 1, "kind": "KIND_GPU", "gpus": [ 1 ]}] }}}'

{"error":"attempt to access JSON non-string as string"}
zengqingfu1442 commented 10 months ago

I use the following json and successfully loaded the model, but i found that the model was not loaded as the json specified, its instance_group.passive is still false, not the same as what i give in the following json.

{
    "name": "mymodel",
    "platform": "",
    "backend": "python",
    "version_policy": {
        "latest": {
            "num_versions": 1
        }
    },
    "max_batch_size": 0,
    "input": [
        {
            "name": "prompt",
            "data_type": "TYPE_STRING",
            "format": "FORMAT_NONE",
            "dims": [
                1
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        }
    ],
    "output": [
        {
            "name": "generated_text",
            "data_type": "TYPE_STRING",
            "dims": [
                1
            ],
            "label_filename": "",
            "is_shape_tensor": false
        }
    ],
    "batch_input": [],
    "batch_output": [],
    "optimization": {
        "priority": "PRIORITY_DEFAULT",
        "input_pinned_memory": {
            "enable": true
        },
        "output_pinned_memory": {
            "enable": true
        },
        "gather_kernel_buffer_threshold": 0,
        "eager_batching": false
    },
    "instance_group": [
        {
            "name": "mymodel_0",
            "kind": "KIND_GPU",
            "count": 1,
            "gpus": [
                1
            ],
            "secondary_devices": [],
            "profile": [],
            "passive": true,
            "host_policy": ""
        }
    ],
    "default_model_filename": "model.py",
    "cc_model_filenames": {},
    "metric_tags": {},
    "parameters": {},
    "model_warmup": [],
    "model_transaction_policy": {
        "decoupled": false
    }
}
kthui commented 10 months ago

Hi @zengqingfu1442, I was able to pass model config as JSON via the HTTP load API with "passive": true. I think it could be a format issue on the HTTP payload. Would you be able to use the HTTP client? https://github.com/triton-inference-server/client/blob/main/src/python/library/tritonclient/http/_client.py#L614

zengqingfu1442 commented 10 months ago

but i found that the model was not load

I used curl to call the api. Ok, I would try the triton python client you provided in this link.

zengqingfu1442 commented 10 months ago

Hi @zengqingfu1442, I was able to pass model config as JSON via the HTTP load API with "passive": true. I think it could be a format issue on the HTTP payload. Would you be able to use the HTTP client? https://github.com/triton-inference-server/client/blob/main/src/python/library/tritonclient/http/_client.py#L614

I tried this way and it works for me! Thanks. It seems that using the cli command curl to call the api is different from using the triton client.

zengqingfu1442 commented 10 months ago

@kthui I can use triton python client to successfully load the model, but then I use curl to launch the inference request, the tritonserver process crashed at once.

I1128 11:15:01.563631 2651 model_lifecycle.cc:818] successfully loaded 'mymodel'
Signal (11) received.
 0# 0x000055B9F1E5F13D in /opt/tritonserver/bin/tritonserver
 1# 0x0000152A331A2520 in /usr/lib/x86_64-linux-gnu/libc.so.6
 2# TRITONSERVER_ServerInferAsync in /opt/tritonserver/bin/../lib/libtritonserver.so
 3# 0x000055B9F1FBCFDA in /opt/tritonserver/bin/tritonserver
 4# 0x000055B9F1FBFEAB in /opt/tritonserver/bin/tritonserver
 5# 0x000055B9F2587175 in /opt/tritonserver/bin/tritonserver
 6# 0x000055B9F258B9D5 in /opt/tritonserver/bin/tritonserver
 7# 0x000055B9F2589D8E in /opt/tritonserver/bin/tritonserver
 8# 0x000055B9F2598DF0 in /opt/tritonserver/bin/tritonserver
 9# 0x000055B9F25A1720 in /opt/tritonserver/bin/tritonserver
10# 0x000055B9F25A2197 in /opt/tritonserver/bin/tritonserver
11# 0x000055B9F258DD62 in /opt/tritonserver/bin/tritonserver
12# 0x0000152A331F4AC3 in /usr/lib/x86_64-linux-gnu/libc.so.6
13# clone in /usr/lib/x86_64-linux-gnu/libc.so.6

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
I1128 11:15:21.074413 2886 pb_stub.cc:1815]  Non-graceful termination detected.
I1128 11:15:21.356258 2882 pb_stub.cc:1815]  Non-graceful termination detected.
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node 275475e266c6 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
kthui commented 10 months ago

@zengqingfu1442, can you share the full curl command which triggered the crash?

zengqingfu1442 commented 10 months ago

@zengqingfu1442, can you share the full curl command which triggered the crash?

curl -X POST localhost:8000/v2/models/mymodel/generate -d '{"prompt": "<system></system><user>140+10*2等于几?乘法和加法的优先级哪一个更高?</user><assistent>"}'
kthui commented 9 months ago

This is strange, I tried the command and was not able to replicate any issue:

$ curl -X POST localhost:8000/v2/models/string/generate -d '{"INPUT0": "<system></system><user>140+10*2等于几?乘法和加法的优先级哪一个更高?</user><assistent>"}'
{"OUTPUT0":"<system></system><user>140+10*2等于几?乘法和加法的优先级哪一个更高?</user><assistent>","model_name":"string","model_version":"1"}
$

The model I used was a Python string identity model, which is why the output is the same as the input.

Would you be able to share the bytes that was sent to the server that triggered the crash?

kthui commented 9 months ago

If you change the model.py to the Python string identity model, can you still replicate the crash?

zengqingfu1442 commented 9 months ago

If you change the model.py to the Python string identity model, can you still replicate the crash?

I tried this model and it runs on CPU, and the tritonserver didn't crash after the same steps that I did before.

zengqingfu1442 commented 9 months ago

This is strange, I tried the command and was not able to replicate any issue:

$ curl -X POST localhost:8000/v2/models/string/generate -d '{"INPUT0": "<system></system><user>140+10*2等于几?乘法和加法的优先级哪一个更高?</user><assistent>"}'
{"OUTPUT0":"<system></system><user>140+10*2等于几?乘法和加法的优先级哪一个更高?</user><assistent>","model_name":"string","model_version":"1"}
$

The model I used was a Python string identity model, which is why the output is the same as the input.

Would you be able to share the bytes that was sent to the server that triggered the crash?

https://gist.github.com/zengqingfu1442/6613d47cc119029b4d954509aa412171 here is my costomized model named mymodel using python backend.

kthui commented 9 months ago

If you change the model.py to the Python string identity model, can you still replicate the crash?

I tried this model and it runs on CPU, and the tritonserver didn't crash after the same steps that I did before.

I think the crash happened inside TRT-LLM backend. You were seeing traces back to Triton because TRT-LLM backend uses Triton internally, and the Triton that crashed was launched by mpirun, see

mpirun noticed that process rank 0 with PID 0 on node 275475e266c6 exited on signal 11 (Segmentation fault).

on your log.

I will transfer your issue to the TRT-LLM team for them to take a look at it.

zengqingfu1442 commented 9 months ago

If you change the model.py to the Python string identity model, can you still replicate the crash?

I tried this model and it runs on CPU, and the tritonserver didn't crash after the same steps that I did before.

I think the crash happened inside TRT-LLM backend. You were seeing traces back to Triton because TRT-LLM backend uses Triton internally, and the Triton that crashed was launched by mpirun, see

mpirun noticed that process rank 0 with PID 0 on node 275475e266c6 exited on signal 11 (Segmentation fault).

on your log.

I will transfer your issue to the TRT-LLM team for them to take a look at it.

But the customized model mymodel used python backend rather than trt-llm backend, i am a littile confused.

zengqingfu1442 commented 9 months ago

Hi @zengqingfu1442, I was able to pass model config as JSON via the HTTP load API with "passive": true. I think it could be a format issue on the HTTP payload. Would you be able to use the HTTP client? https://github.com/triton-inference-server/client/blob/main/src/python/library/tritonclient/http/_client.py#L614

So if i want to curl to call the load model api of the tritonserver, then how should i write the curl command and format the http payload? thanks.

zengqingfu1442 commented 9 months ago

Hi @zengqingfu1442, I was able to pass model config as JSON via the HTTP load API with "passive": true. I think it could be a format issue on the HTTP payload. Would you be able to use the HTTP client? https://github.com/triton-inference-server/client/blob/main/src/python/library/tritonclient/http/_client.py#L614

i used the following json and use curl to call the api, but failed with error "error": "failed to parse the request JSON buffer: Invalid escape character in string. at 42"

{
  "parameters": {
        "config": "{
            "name": "mymodel",
            "platform": "",
            "backend": "python",
            "version_policy": {
                "latest": {
                    "num_versions": 1
                }
            },
            "max_batch_size": 0,
            "input": [
                {
                    "name": "prompt",
                    "data_type": "TYPE_STRING",
                    "format": "FORMAT_NONE",
                    "dims": [
                        1
                    ],
                    "is_shape_tensor": false,
                    "allow_ragged_batch": false,
                    "optional": false
                }
            ],
            "output": [
                {
                    "name": "generated_text",
                    "data_type": "TYPE_STRING",
                    "dims": [
                        1
                    ],
                    "label_filename": "",
                    "is_shape_tensor": false
                }
            ],
            "batch_input": [],
            "batch_output": [],
            "optimization": {
                "priority": "PRIORITY_DEFAULT",
                "input_pinned_memory": {
                    "enable": true
                },
                "output_pinned_memory": {
                    "enable": true
                },
                "gather_kernel_buffer_threshold": 0,
                "eager_batching": false
            },
            "instance_group": [
                {
                    "name": "mymodel_0",
                    "kind": "KIND_GPU",
                    "count": 1,
                    "gpus": [
                        1
                    ],
                    "secondary_devices": [],
                    "profile": [],
                    "passive": true,
                    "host_policy": ""
                }
            ],
            "default_model_filename": "model.py",
            "cc_model_filenames": {},
            "metric_tags": {},
            "parameters": {},
            "model_warmup": [],
            "model_transaction_policy": {
                "decoupled": false
            }
        }"
    }
}