Open duyanyao opened 5 months ago
Hi, I suggest taking a look at https://github.com/intel/intel-extension-for-pytorch/tree/main/examples/cpu/inference/python/llm for best known practices for our llm optimizations as well as https://intel.github.io/intel-extension-for-pytorch/cpu/2.2.0+cpu/tutorials/performance_tuning/tuning_guide.html for performance tuning
In general, performance is dependent on many factors and it may be challenging to completely replicate results unless you have an exact, or very similar set up
Describe the issue
Hello, I recently in emersion llama experiment (https://intel.github.io/intel-extension-for-pytorch/llm/cpu/), hope in int8 quantization can reach this article(https://www.intel.com/content/www/us/en/developer/articles/technical/accelerate-llama2-ai-hardware-sw-optimizations.html) mentioned in the 35 ms/token. At present, I am using an example to test on AWS, using an instance of type m7i.metal-24xl with a delay of 45ms/token. May I ask how to replicate the result of 35ms/token on AWS?
The document environment is: 4th Gen Intel Xeon 8480 \2 socket\112 cores\224 threads\
My experimental environment is: AWS instance(m7i.metal-24xl) 4th Gen Intel Xeon 8488C \1 socket\48 cores\96 threads\