Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.54k
stars
1.25k
forks
source link
Add multi GPU support in the AutoModelForCausalLM.load_low_bit API. #11407
Every time when I run the test, it will load the original model and covert to lower bit.
If we load a 34B model on 4 ARC card, it will take a long time to covert the model and also need huge number of host CPU memory to load the original model.
I want to save the model into lower bit and load the saved lower bit model to run the test.
But current only the AutoModelForCausalLM.from_pretrained support the pipeline_parallel_stages=args.gpu_num.
the AutoModelForCausalLM.load_low_bit API doesn't support the pipeline_parallel_stages=args.gpu_num.
Every time when I run the test, it will load the original model and covert to lower bit. If we load a 34B model on 4 ARC card, it will take a long time to covert the model and also need huge number of host CPU memory to load the original model.
I want to save the model into lower bit and load the saved lower bit model to run the test.
But current only the AutoModelForCausalLM.from_pretrained support the pipeline_parallel_stages=args.gpu_num. the AutoModelForCausalLM.load_low_bit API doesn't support the pipeline_parallel_stages=args.gpu_num.