Support vpm and resampler module of minicpm-v on NPU

intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc

Apache License 2.0

6.75k stars 1.27k forks source link

Description

Update minicpm-v usage on NPU.

2. User API changes

No need to specify torch_dtype=torch.float32 and modules_to_not_convert=['vpm', 'resampler']
lm_head of minicpm_v_2_6 could run on NPU by default

3. Summary of the change

replace conv2d and layernorm with MinicpmVPatchEmbedding and MinicpmVLayerNorm
pad mlp.fc2 and replace forward function to avoid compile error
pad lm_head and replace forward function to avoid compile error
port attention / multi-head-attention / resampler forward functions
update example script

4. How to test?

[x] Application test https://github.com/analytics-zoo/nano/issues/1724#issuecomment-2467282958

intel-analytics / ipex-llm

Support vpm and resampler module of minicpm-v on NPU #12375

Description

2. User API changes

3. Summary of the change

4. How to test?