Open HQ01 opened 7 months ago
In our specific usecase, pip install at runtime
How about building your own image on top of the xx.yy-py3
?
This way you will not run pip at runtime or require conda-pack
Given the prevalence of using triton server for NLP-related workload
In our case, we use triton for computer vision models and would not require transformers installed.
FROM nvcr.io/nvidia/tritonserver:XX.YY-py3
RUN pip install transformers --no-cache-dir
This Dockerfile will do what you need and will not require everyone having transformers installed by default ? Maybe this could work?
Unfortunately, we cannot install these libraries as it can increase the container size significantly and there are many other customers asking for different libraries to be included. If we accommodate all these requests, the container size would be much larger than it already is. Creating conda-pack environments or custom images are our only recommendation at this point. Let us know if you have any other suggestions that might help with this issue.
Is your feature request related to a problem? Please describe.
I find the current docker image
xx.yy-py3
doesn't have commonly use data preprocessing libraries like huggingface transformers for accessing the tokenizer, for example. Missing this single missing package greatly limits our abilities to use triton-inference-server with its ensemble model feature.In our specific usecase, pip install at runtime or using
conda-pack
are highly discouraged for various reasons. This is somewhat similar to https://github.com/triton-inference-server/server/issues/6467 and I believe might be common in many other industrial scenarios too.Describe the solution you'd like
Given the prevalence of using triton server for NLP-related workload, would suggest including the
transformers
library in the pre-built docker image if possible.Describe alternatives you've considered
There are other images like
24.03-trtllm-python-py3
that does come withtransformers
pre-installed, however we need to serve bert-like models and accordding to https://github.com/triton-inference-server/tensorrtllm_backend/issues/368, there is no clear timeline to support this. So we have to rely on other backend (like ORT) to execute our model.Additional context Any thoughts / suggestions will be greatly appreciated!