Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.73k
stars
1.27k
forks
source link
[NPU] dump prefill IR for further C++ solution #12402
Description
1. Why the change?
https://github.com/analytics-zoo/nano/issues/1716#issue-2628191642 To support pure c++ NPU solution, we need to provide a "compile" tool for user to save all needed files (IR / bin / blob).
2. User API changes
Added two params:
compile_full_model: if set to True, we will save prefill related IR or bin files, default to False
save_directory: directory used to save all needed files (IR / bin / blob), default to None
If we just want to do inference at python side, usage is not changed
If we want to dump files to do further C++ inference
3. Summary of the change
compile_full_model
/save_directory
to dump all needed files for further c++ inference support4. Verify correctness
Only update qwen2 for now, can be extended to other models later.