triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.3k stars 1.48k forks source link

Provide native support for server-side tokenization #7473

Open WilliamOnVoyage opened 3 months ago

WilliamOnVoyage commented 3 months ago

Hi team,

Currently the working pattern for server side tokenization is for users to write a model.py with python backend to perform tokenization, which is great for flexibility and customization.

While given the rise of language models and popularity of some common model / tokenizer architecture, I'm wondering if you plan to provide tokenizer support natively so users can configure tokenizer just through tokenizer artifacts and config.pbtxt

rmccorm4 commented 3 months ago

Hi @WilliamOnVoyage, I believe both the vLLM and TensorRT-LLM backends handle tokenization internally without user-code-changes required, and are configurable through their respective config files or based on the model being used. Does this satisfy your needs?