Closed faaany closed 1 month ago
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
https://github.com/huggingface/optimum-intel/pull/725 is now merged, you mind rebasing @faaany ?
725 is now merged, you mind rebasing @faaany ?
Cool! Rebase done, pls have a review. Thx!
Hi @echarlaix , I manually-checked the changes in #725 and fixed the bugs introduced through rebasing. Now all tests passed, I think we are good to go.
What does this PR do?
This PR refactors the current CPU llama inference code to make code clean. The major changes are as follows:
_IPEXLlamaAttention
and move the attention-related OPs and attention forward code to_IPEXLlamaAttention
_IPEXLlamaMLP
and move the MLP-related OPs and forward code to_IPEXLlamaMLP
_patch_llama_model
_IPEXLlamaDecoderLayerRef
to_IPEXLlamaDecoderLayer
_IPEXLlamaAttention
intogemm
,rope
andsdpa
Please note that this PR is based on the unmerged PR #725 by Jiqing as can be seen in the commit history.