microsoft / onnxruntime-genai

Generative AI extensions for onnxruntime
MIT License
505 stars 128 forks source link

Phi3 mini 128k instruct conversion to onnx yields model not deployable to AWS Sagemaker #915

Closed nimishbongale closed 1 month ago

nimishbongale commented 1 month ago

Describe the bug

The model card here mentions that the phi3-mini-128k-instruct-onnx model is directly deployable to sagemaker runtime using get_huggingface_llm_image_uri("huggingface",version="2.2.0") as image uri. However, on deploying, sagemaker fails to recognize the onnx model and attempts to find pytorch weights, thus failing the builds.

I'm loading in the model to /opt/ml/model folder using s3_uri, and then mentioning the same path as HF_MODEL_ID.

image

To Reproduce Steps to reproduce the behavior:

  1. Try deploying onnx model to AWS Sagemaker using latest version of sagemaker==2.32.1

Expected behavior Model should get deployed seamlessly, just like how phi-3-mini-128k-instruct (non-onnx) does.

Screenshots

image

Desktop (please complete the following information):

nimishbongale commented 1 month ago
image
kunal-vaishnavi commented 1 month ago

The AWS SageMaker instructions are auto-generated by Hugging Face on the model cards. Hugging Face assumes that each repo contains PyTorch models, which is why you are getting a FileNotFoundError in this repo containing only ONNX models.

For deploying ONNX models to AWS SageMaker, there are some online guides that you can try to follow. Here is an example one that you can try to follow starting from the "Create an Inference Handler" section. Here is an example with Triton Inference Server.

You can also use Azure ML to deploy ONNX models. Here is a guide you can follow. For a more detailed example, you can look at this guide.

Please note that there are multiple ONNX models uploaded in this repo. You can follow this example to pick one of the ONNX models to load using Hugging Face's Optimum. Then you can use Optimum to manage the generation loop with model.generate(...) in your inference script. Optimum will use ONNX Runtime under the hood and manage preparing the inputs for you.

nimishbongale commented 1 month ago

Thanks for the help @kunal-vaishnavi

I'm unfortunately bound to using AWS Sagemaker for my deployments, so I'll have to proceed with a custom inference script using Optimum like you mentioned.

Are there plans to include optimum or onnxruntime-genai support in the next tgi images within sagemaker?

kunal-vaishnavi commented 1 month ago

I'm unfortunately bound to using AWS Sagemaker for my deployments, so I'll have to proceed with a custom inference script using Optimum like you mentioned.

If you create your own image on top of existing TGI images, you can install and use ONNX Runtime GenAI directly instead of Optimum for the best performance in your custom inference script. Here is an example inference script that you can modify for AWS SageMaker.

Are there plans to include optimum or onnxruntime-genai support in the next tgi images within sagemaker?

According to this issue, Hugging Face's TGI currently doesn't support ONNX models. We will discuss internally to see if adding ONNX Runtime GenAI is possible.

nimishbongale commented 1 month ago

Thanks once again @kunal-vaishnavi

I've actually used the model-generate.py file for the deployment and that has gone smoothly. However a couple of observations:

  1. The phi3 model prior to conversion supports an environment variable "return_full_text":False, which stops the model from echoing the user prompt. The onnxruntime-genai library however does not allow for setting this variable later on.
  2. At times the qa type (using the token streaming) is faster than just the direct generation, but that ideally shouldn't be the case.

Do let me know if you have any pointers in this regards, appreciate the help!

kunal-vaishnavi commented 1 month ago

The phi3 model prior to conversion supports an environment variable "return_full_text":False, which stops the model from echoing the user prompt. The onnxruntime-genai library however does not allow for setting this variable later on.

There isn't an environment variable to set this in ONNX Runtime GenAI. But you can filter out the user prompt with some additional logic in the inference script.

The input tokens are set here. https://github.com/microsoft/onnxruntime-genai/blob/bcf55a6dc563bc8b356128b47504d59a21c5ef2f/examples/python/model-generate.py#L40

After the output tokens have been generated, you can go through them to remove the first $N_b$ tokens per batch entry where $N_b$ is the length of the input tokens at batch entry $b$. https://github.com/microsoft/onnxruntime-genai/blob/bcf55a6dc563bc8b356128b47504d59a21c5ef2f/examples/python/model-generate.py#L45

Here is some pseudocode for a naive implementation.

output_tokens_without_user_prompt = []
for b in range(len(output_tokens)):
    N_b = len(input_tokens[b])
    without_user_prompt = output_tokens[b][N_b : ]
    output_tokens_without_user_prompt.append(without_user_prompt)

Then, when you print the generated tokens, the user prompt will not be re-printed.

for i in range(len(prompts)): 
    print(f'Prompt #{i}: {prompts[i]}') 
    print() 
    print(tokenizer.decode(output_tokens_without_user_prompt[i])) 
    print() 

Please note that this logic may need to be modified to handle padding in the input tokens when calculating $N_b$.

At times the qa type (using the token streaming) is faster than just the direct generation, but that ideally shouldn't be the case.

We are able to repro this and we will look into it.

nimishbongale commented 1 month ago

Thanks a lot for the detailed response @kunal-vaishnavi! Appreciate it, closing this discussion for now 👍