triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
581 stars 81 forks source link

[Question] Best practises to track inputs and predictions? #475

Open FernandoDorado opened 1 month ago

FernandoDorado commented 1 month ago

Hello,

I am seeking advice on the best practices for tracking all inputs and predictions made by a model when using Triton Inference Server. Specifically, I would like to track every interaction the model handles, including input data and the corresponding predictions.

I have reviewed the documentation about Triton Server Trace, but it is unclear if this feature can track predictions as well. You can find the documentation here: Triton Server Trace Documentation.

Additionally, I am concerned about the impact of tracking on system latency. While I am aware that solutions for traditional ML platforms (such as Seldon-Core) often use technologies like KNative and Kafka to store tracking information, it is not clear how these approaches can be integrated with Triton without compromising performance.

I would appreciate recommendations on:

Thank you for your assistance.

byshiue commented 3 weeks ago

I don't really understand your question. Could you explain more? For "track all inputs and predictions", do you mean the input token/context and the generated token? If so, what do you mean the "track"? Currently, we have returned full inputs tokens and output tokens.

FernandoDorado commented 3 weeks ago

Hello @byshiue

I am asking if there is a way to store all the input data (for example, the payload sent to the model to generate a prediction) and the model response using the previous payload in order to analyse the model behaviour and track all the interactions.

This is an example of the requested funtionality but using another tool: https://docs.seldon.io/projects/seldon-core/en/latest/streaming/knative_eventing.html