Closed nimishbongale closed 1 month ago
The AWS SageMaker instructions are auto-generated by Hugging Face on the model cards. Hugging Face assumes that each repo contains PyTorch models, which is why you are getting a FileNotFoundError
in this repo containing only ONNX models.
For deploying ONNX models to AWS SageMaker, there are some online guides that you can try to follow. Here is an example one that you can try to follow starting from the "Create an Inference Handler" section. Here is an example with Triton Inference Server.
You can also use Azure ML to deploy ONNX models. Here is a guide you can follow. For a more detailed example, you can look at this guide.
Please note that there are multiple ONNX models uploaded in this repo. You can follow this example to pick one of the ONNX models to load using Hugging Face's Optimum. Then you can use Optimum to manage the generation loop with model.generate(...)
in your inference script. Optimum will use ONNX Runtime under the hood and manage preparing the inputs for you.
Thanks for the help @kunal-vaishnavi
I'm unfortunately bound to using AWS Sagemaker for my deployments, so I'll have to proceed with a custom inference script using Optimum like you mentioned.
Are there plans to include optimum or onnxruntime-genai support in the next tgi images within sagemaker?
I'm unfortunately bound to using AWS Sagemaker for my deployments, so I'll have to proceed with a custom inference script using Optimum like you mentioned.
If you create your own image on top of existing TGI images, you can install and use ONNX Runtime GenAI directly instead of Optimum for the best performance in your custom inference script. Here is an example inference script that you can modify for AWS SageMaker.
Are there plans to include optimum or onnxruntime-genai support in the next tgi images within sagemaker?
According to this issue, Hugging Face's TGI currently doesn't support ONNX models. We will discuss internally to see if adding ONNX Runtime GenAI is possible.
Thanks once again @kunal-vaishnavi
I've actually used the model-generate.py
file for the deployment and that has gone smoothly. However a couple of observations:
"return_full_text":False
, which stops the model from echoing the user prompt. The onnxruntime-genai library however does not allow for setting this variable later on. Do let me know if you have any pointers in this regards, appreciate the help!
The phi3 model prior to conversion supports an environment variable "return_full_text":False, which stops the model from echoing the user prompt. The onnxruntime-genai library however does not allow for setting this variable later on.
There isn't an environment variable to set this in ONNX Runtime GenAI. But you can filter out the user prompt with some additional logic in the inference script.
The input tokens are set here. https://github.com/microsoft/onnxruntime-genai/blob/bcf55a6dc563bc8b356128b47504d59a21c5ef2f/examples/python/model-generate.py#L40
After the output tokens have been generated, you can go through them to remove the first $N_b$ tokens per batch entry where $N_b$ is the length of the input tokens at batch entry $b$. https://github.com/microsoft/onnxruntime-genai/blob/bcf55a6dc563bc8b356128b47504d59a21c5ef2f/examples/python/model-generate.py#L45
Here is some pseudocode for a naive implementation.
output_tokens_without_user_prompt = []
for b in range(len(output_tokens)):
N_b = len(input_tokens[b])
without_user_prompt = output_tokens[b][N_b : ]
output_tokens_without_user_prompt.append(without_user_prompt)
Then, when you print the generated tokens, the user prompt will not be re-printed.
for i in range(len(prompts)):
print(f'Prompt #{i}: {prompts[i]}')
print()
print(tokenizer.decode(output_tokens_without_user_prompt[i]))
print()
Please note that this logic may need to be modified to handle padding in the input tokens when calculating $N_b$.
At times the qa type (using the token streaming) is faster than just the direct generation, but that ideally shouldn't be the case.
We are able to repro this and we will look into it.
Thanks a lot for the detailed response @kunal-vaishnavi! Appreciate it, closing this discussion for now 👍
Describe the bug
The model card here mentions that the phi3-mini-128k-instruct-onnx model is directly deployable to sagemaker runtime using
get_huggingface_llm_image_uri("huggingface",version="2.2.0")
as image uri. However, on deploying, sagemaker fails to recognize the onnx model and attempts to find pytorch weights, thus failing the builds.I'm loading in the model to
/opt/ml/model
folder using s3_uri, and then mentioning the same path asHF_MODEL_ID
.To Reproduce Steps to reproduce the behavior:
Expected behavior Model should get deployed seamlessly, just like how phi-3-mini-128k-instruct (non-onnx) does.
Screenshots
Desktop (please complete the following information):