triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.37k stars 1.49k forks source link

Unknown TensorRT-LLM model endpoint when using --model-namespacing=true #7823

Open MatteoPagliani opened 1 day ago

MatteoPagliani commented 1 day ago

Hi,

I am trying to serve two LLMs concurrently with TensorRT-LLM backend. The folder structure of the two Triton Model Repositories is the following:

triton_models/
├── gemma2/
│   ├── preprocessing/
│   ├── postprocessing/
│   ├── tensorrt_llm/
│   └── tensorrt_llm_bls/
└── llama3/
    ├── preprocessing/
    ├── postprocessing/
    ├── tensorrt_llm/
    └── tensorrt_llm_bls/

I am running the command tritonserver --model-repository=path_to_triton_models/gemma2 --model-repository=path_to_triton_models/llama3 --model-namespacing=true. All the models are loaded correctly as confirmed by the logs.

At this point I want to send a query to a model. In a single-model deployment scenario, I would use the following curl command:

curl -X POST \
    -s localhost:8000/v2/models/tensorrt_llm_bls/generate \
    -d '{
        "text_input": "What is machine learning?",
        "max_tokens": 512,
    }'

However, if I use the same endpoint (localhost:8000/v2/models/tensorrt_llm_bls/generate) in the two-models deployment scenario I get, as expected, the following error:

{"error":"There are 2 identifiers of model 'tensorrt_llm_bls' in global map, model namespace must be provided to resolve ambiguity."}

The problem is that I don't know how should I change the target endpoint with --model-namespacing enabled. I tried many things but none of them worked and it seems there is no documentation about this.

Can you help me out? Thanks in advance. Tagging @rmccorm4 for support.