intel-extension-for-pytorch 2.1.0.dev+cpu.llm experimental rehabilitation

intel / intel-extension-for-pytorch

A Python package for extending the official PyTorch that can easily obtain performance on Intel platform

Apache License 2.0

1.49k stars 230 forks source link

Hello, I recently in emersion llama experiment (https://intel.github.io/intel-extension-for-pytorch/llm/cpu/), hope in int8 quantization can reach this article(https://www.intel.com/content/www/us/en/developer/articles/technical/accelerate-llama2-ai-hardware-sw-optimizations.html) mentioned in the 35 ms/token. At present, I am using an example to test on AWS, using an instance of type m7i.metal-24xl with a delay of 45ms/token. May I ask how to replicate the result of 35ms/token on AWS?

The document environment is: 4th Gen Intel Xeon 8480 \2 socket\112 cores\224 threads\

My experimental environment is: AWS instance（m7i.metal-24xl） 4th Gen Intel Xeon 8488C \1 socket\48 cores\96 threads\

intel / intel-extension-for-pytorch