intel / intel-extension-for-pytorch

A Python package for extending the official PyTorch that can easily obtain performance on Intel platform
Apache License 2.0
1.49k stars 230 forks source link

intel-extension-for-pytorch 2.1.0.dev+cpu.llm experimental rehabilitation #540

Open duyanyao opened 5 months ago

duyanyao commented 5 months ago

Describe the issue

Hello, I recently in emersion llama experiment (https://intel.github.io/intel-extension-for-pytorch/llm/cpu/), hope in int8 quantization can reach this article(https://www.intel.com/content/www/us/en/developer/articles/technical/accelerate-llama2-ai-hardware-sw-optimizations.html) mentioned in the 35 ms/token. At present, I am using an example to test on AWS, using an instance of type m7i.metal-24xl with a delay of 45ms/token. May I ask how to replicate the result of 35ms/token on AWS?

The document environment is: 4th Gen Intel Xeon 8480 \2 socket\112 cores\224 threads\

My experimental environment is: AWS instance(m7i.metal-24xl) 4th Gen Intel Xeon 8488C \1 socket\48 cores\96 threads\

kta-intel commented 5 months ago

Hi, I suggest taking a look at https://github.com/intel/intel-extension-for-pytorch/tree/main/examples/cpu/inference/python/llm for best known practices for our llm optimizations as well as https://intel.github.io/intel-extension-for-pytorch/cpu/2.2.0+cpu/tutorials/performance_tuning/tuning_guide.html for performance tuning

In general, performance is dependent on many factors and it may be challenging to completely replicate results unless you have an exact, or very similar set up