refactor CPU llama inference code

faaany commented 1 month ago

What does this PR do?

This PR refactors the current CPU llama inference code to make code clean. The major changes are as follows:

introduce a new class _IPEXLlamaAttention and move the attention-related OPs and attention forward code to _IPEXLlamaAttention
introduce a new class _IPEXLlamaMLP and move the MLP-related OPs and forward code to _IPEXLlamaMLP
simplify _patch_llama_model
rename _IPEXLlamaDecoderLayerRef to _IPEXLlamaDecoderLayer
refactor the forward mtehod of _IPEXLlamaAttention into gemm, rope and sdpa

Please note that this PR is based on the unmerged PR #725 by Jiqing as can be seen in the commit history.

HuggingFaceDocBuilderDev commented 1 month ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

echarlaix commented 1 month ago

https://github.com/huggingface/optimum-intel/pull/725 is now merged, you mind rebasing @faaany ?

faaany commented 1 month ago

725 is now merged, you mind rebasing @faaany ?

Cool! Rebase done, pls have a review. Thx!

faaany commented 1 month ago

Hi @echarlaix , I manually-checked the changes in #725 and fixed the bugs introduced through rebasing. Now all tests passed, I think we are good to go.

huggingface / optimum-intel

refactor CPU llama inference code #728

What does this PR do?

725 is now merged, you mind rebasing @faaany ?