triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
7.81k stars 1.42k forks source link

Metrics Port Not Opening with Triton Inference Server's In-Process Python API #7197

Open yucai opened 2 months ago

yucai commented 2 months ago

Description

We are encountering an issue with the Triton Inference Server's in-process Python API where the metrics port (default: 8002) does not open. This results in a 'connection refused' error when attempting to access localhost:8002/metrics. We would appreciate guidance on how to properly enable the metrics port using the in-process Python API.

Triton Version

2.42.0

Steps to reproduce the behavior

  1. Initialize the Triton Inference Server using the in-process Python API with the following code snippet:
    
    import tritonserver

Initialize and start the Triton server

self._triton_server = tritonserver.Server( model_repository=model_repository, model_control_mode=tritonserver.ModelControlMode.EXPLICIT, ) self._triton_server.start(wait_until_ready=True)

2. Attempt to access the metrics endpoint at localhost:8002/metrics.
3. Observe the 'connection refused' error.

**Expected behavior**

The metrics port should be accessible and provide metrics data when the Triton Inference Server is started using the in-process Python API.

**Temporary Workaround**

As a temporary solution, we have started an HTTP server manually to serve the metrics endpoint:

import tritonserver import uvicorn import threading from fastapi import FastAPI from starlette.responses import Response

Initialize and start the Triton server

self._triton_server = tritonserver.Server( model_repository=['/mount/data/models'], model_control_mode=tritonserver.ModelControlMode.EXPLICIT ) self._triton_server.start(wait_until_ready=True) self._triton_server.load('clip') self._model = self._triton_server.model('clip')

Set up a FastAPI application to serve metrics

self.app = FastAPI()

@self.app.get("/metrics") def get_metrics(): output = self._triton_server.metrics() return Response(output, media_type="text/plain")

Run the FastAPI app in a separate thread

def run(): uvicorn.run(self.app, host="0.0.0.0", port=8002)

self.server = threading.Thread(target=run) self.server.start()



We would prefer to use the built-in functionality for serving metrics and avoid maintaining this workaround. Any suggestions or solutions would be greatly appreciated.
yucai commented 2 months ago

@nnshah1 We are using this API in ray data, very similar to what you did for ray serve. Like below: https://github.com/triton-inference-server/tutorials/blob/main/Triton_Inference_Server_Python_API/examples/rayserve/tritonserver_deployment.py