huggingface / blog

Public repo for HF blog posts
https://hf.co/blog
2.24k stars 702 forks source link

How to package Hugging Face into Nvidia Triton Inference Server for deployment #972

Open nickaggarwal opened 1 year ago

nickaggarwal commented 1 year ago

I was recently deploying hugging face models on the Triton inference server which helped me to increase my GPU utilization and serve multiple models using a single GPU.

Was not able to find good resources during the process.

@sayakpaul

sayakpaul commented 1 year ago

@nickaggarwal thank you so much!

In order for us to better understand this, could you provide an outline of the things that you want to cover in the tutorial?

Also cc: @osanseviero @philschmid

nickaggarwal commented 1 year ago

Hi @sayakpaul

The tutorial would entail, "how to take models from Hugging Face, a machine learning library, and package them into Nvidia Triton, an open-source inference serving software". It would be a detailed 4 step Tutorial :

  1. Getting Started with Hugging Face
  2. Deploying a Huggingface model on Nvidia Triton
  3. Deploying Triton Inference containers in Kubernetes
  4. Efficient utilization of GPUs

Within the tutorial I will also cover how to package and push files to Triton Model Repository, using Hugging Face Pipeline with Template Method to deploy the model, etc.

sayakpaul commented 1 year ago

Sounds good to me, thanks! I will let @osanseviero and @philschmid chime in as well.

nickaggarwal commented 1 year ago

Thanks, @sayakpaul

@philschmid @osanseviero Do let me know your thoughts

ghost commented 1 year ago

I would be interested too in the tutorial. Nice idea @nickaggarwal

nickaggarwal commented 1 year ago

Glad to know @dverdu-freepik

Team, should i submit the tutorial blog here?

cc @sayakpaul

sayakpaul commented 1 year ago

@osanseviero a gentle ping.

alexanderfrey commented 1 year ago

that would indeed be awesome ! Just stumbled upon your blog post "StackLLaMA: A hands-on guide to train LLaMA with RLHF" and the demo on that page. How did you do the deployment in that particular post ? Its incredible fast... Any information would be very welcome :)

ankit-db commented 1 year ago

+1 the docs for this should ideally be much clearer

nickaggarwal commented 1 year ago

Thanks, folks! @osanseviero - a gentle reminder! would love to contribute through this tutorial.

MohamedAliRashad commented 1 year ago

@nickaggarwal Can you also add text streaming for the text output to your tutorial ?

osanseviero commented 1 year ago

Hi there! Thanks a lot for the proposal!

We discussed this with the team, and we're not sure the blog will be the best place for this. This is more like a production guide for very specific hardware/use cases. We've had some blog posts like this in the past, but we realized that they didn't have good visibility for the amount of effort behind them. There are likely better venues to expose this kind of content to the community and we're always happy to amplify it!

timeleft-- commented 1 year ago

I think this is Triton specific information, and would be covered best by the Triton team. Does this Tutorial https://github.com/triton-inference-server/tutorials/tree/main/HuggingFace cover what you need to know?

alexw994 commented 1 year ago

我认为这是Triton特定的信息,Triton团队最好地涵盖这些信息。本教程 https://github.com/triton-inference-server/tutorials/tree/main/HuggingFace 涵盖您需要了解的内容吗?

That's useful.

alexw994 commented 1 year ago

Maybe we must discuss finding a simple way to convert the transformers model to Tensorrt.

n-imas commented 1 year ago

Hi @nickaggarwal Do you have any resources on how to package and deploy Hugging Face into Nvidia Triton Inference? Many thanks!

vzip commented 1 year ago

yes, some;) Вт, 18 июля 2023 г. в 04:34, n-imas @.***>:

Hi @nickaggarwal https://github.com/nickaggarwal Do you have any resources on how to package and deploy Hugging Face into Nvidia Triton Inference? Many thanks!

— Reply to this email directly, view it on GitHub https://github.com/huggingface/blog/issues/972#issuecomment-1639968915, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABTDQNEHG7WNFKVMS4DJBY3XQZRBXANCNFSM6AAAAAAWJ7G73Q . You are receiving this because you are subscribed to this thread.Message ID: @.***>

--

С уважением, Виктор. https://t.me/doggydog https://github.com/vzip

nickaggarwal commented 1 year ago

Thanks, folks for showing interest in the tutorial. We ended up publishing the same on our blog. You can access it here- https://www.inferless.com/learn/nvidia-triton-inference-inferless

satendrakumar commented 1 year ago

I used a Conda pack for packaging the dependencies for the Triton server.

conda create -k -y -n hf-sentiment python=3.10

conda activate hf-sentiment

pip install numpy conda-pack

pip install torch==1.13.1

pip install transformers==4.21.3

conda install -c conda-forge gcc=12.1.0 # optional if you get issue "nvidia triton version `GLIBCXX_3.4.30' not found"

conda pack -o hf-sentiment.tar.gz

Here is a complete example of a running Hugging Face Sentiment Model (cardiffnlp/twitter-roberta-base-sentiment-latest) Code: https://github.com/satendrakumar/huggingface-triton-server Blog : https://satendrakumar.in/2023/08/07/deploying-hugging-face-model-on-nvidia-triton-inference-server/

Jank14 commented 6 months ago

Hi, I followed the same tutorial to deploy a asr model with language model processor. Its a telugu model. It runs fine everywhere. And also in the docker container if I open a python repl the processor gets loaded without any error, but when I try launching the server, I get this error: try to load on triton server it gives the unicode error ascii' codec can't decode byte 0xe0 in position 0 ordinal not in range(128). I tried adding encoding as utf8 in python file as well but it doesnt work. I followed the tutorial for python_vit itself where model and processor simply gets used in python file without exporting to onnx. Can you please provide some guidance what changes should i do

nickaggarwal commented 6 months ago

Hi @Jank14

Seems like the Issue with Input/Output, Make sure the test you are sending to the endpoint is encoded in base 64, If you can share the sample input happy to help

sayakpaul commented 6 months ago

@nickaggarwal would you be interested in authoring a guest post like this https://huggingface.co/blog/mlabonne/merge-models?

Just checking with @osanseviero -- it should be okay no?

nickaggarwal commented 6 months ago

@nickaggarwal would you be interested in authoring a guest post like this https://huggingface.co/blog/mlabonne/merge-models?

Just checking with @osanseviero -- it should be okay no?

@sayakpaul Sure, We do love this for Nvidia triton, with Ensemble models

Jank14 commented 6 months ago

Hi @Jank14

Seems like the Issue with Input/Output, Make sure the test you are sending to the endpoint is encoded in base 64, If you can share the sample input happy to help

Hey actually it was issue with triton server not recognising the telugu tokens, got it resolved by running export PYTHONIOENCODING="utf-8" and apt-get install locales && locale-gen en_US.UTF-8. Thanks

StephennFernandes commented 5 months ago

@Jank14 hey do you mind sharing ASR Triton inference code.

I am pretty much stuck on understand how did you convert the ASR model to work with the config file.