intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.55k stars 1.25k forks source link

MiniCPM-v 2.6 and llama-cpp can not work and accelerated on A770 dGPU? #11982

Open yangqing-yq opened 1 month ago

JinheTang commented 4 weeks ago

Hi @yangqing-yq , upgrading to ipex-llm[cpp]>=2.2.0b20240827 may solve this problem. Then you may run

./llama-minicpmv-cli -m ../MiniCPM-V-2_6-gguf/ggml-model-Q4_0.gguf --mmproj ../MiniCPM-V-2_6-gguf/mmproj-model-f16.gguf -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image xx.jpg  -p "What is in the image?" -ngl 99

model page: openbmb/MiniCPM-V-2_6-gguf

yangqing-yq commented 2 weeks ago

this is the result for A750. Can you help to confirm if these values are correct? especially the TTFT is 4689 ms?! input image is 1920x1080 " llama_print_timings: load time = 6392.73 ms llama_print_timings: sample time = 43.04 ms / 73 runs ( 0.59 ms per token, 1696.29 tokens per second) llama_print_timings: prompt eval time = 4689.01 ms / 904 tokens ( 5.19 ms per token, 192.79 tokens per second) llama_print_timings: eval time = 1709.74 ms / 72 runs ( 23.75 ms per token, 42.11 tokens per second) llama_print_timings: total time = 8175.84 ms / 976 tokens "

@qiuxin2012

yangqing-yq commented 1 week ago

@JinheTang @qiuxin2012

JinheTang commented 6 days ago

Hi @yangqing-yq , we tested it on our A750 machine and our results were similar to yours. It should be correct.