huggingface / optimum

🚀 Accelerate training and inference of 🤗 Transformers and 🤗 Diffusers with easy to use hardware optimization tools
https://huggingface.co/docs/optimum/main/
Apache License 2.0
2.54k stars 454 forks source link

Support for Meta LLaMA 3 with ORTModelForCausalLM for Faster Inference #1856

Open saleshwaram opened 5 months ago

saleshwaram commented 5 months ago

Feature request

I would like to request support for using Meta LLaMA 3 with ORTModelForCausalLM for faster inference. This integration would leverage the capabilities of the ONNX Runtime (ORT) to optimize and accelerate the performance of Meta LLaMA 3 models.

Motivation

Currently, there is no direct support for integrating Meta LLaMA 3 with ORTModelForCausalLM on Hugging Face. This lack of integration leads to slower inference times, which can be a significant bottleneck in applications requiring real-time or near-real-time responses. Providing support for this integration would greatly enhance the performance and usability of Meta LLaMA 3 models, particularly in production environments where inference speed is critical.

Your contribution

While I may not have the expertise to implement this feature myself, I am willing to assist with testing and providing feedback on the integration process. Additionally, I can help with documentation and usage examples once the feature is implemented.

IlyasMoutawwakil commented 5 months ago

Hi! are you sure llama3 doesn't work ? it's the same architecture/model_type of llama2 so it should work out of the box I'm running a script locally to export it to see if it works (the export is going smoothly with meta-llama/Meta-Llama-3-8B)