intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.56k stars 1.25k forks source link

Quantized model loading method expects the model should be locally available. #11268

Open unrahul opened 4 months ago

unrahul commented 4 months ago

I use ipex-llm to quantize and push models to hub. But it seems load_low_bit expects the model to be locally available and cant take it from huggingface hub.

It would be awesome to allow the model to be loaded from hub as well, so end user doesnt have to quantize it, making shipping the right model for the right platform much easier.

new_model = AutoModelForCausalLM.load_low_bit(model_path)  # breaks if using a remote huggingfacehub link

Path: https://github.com/intel-analytics/ipex-llm/blob/70b17c87be259e2a42481a100b06062efff24bf6/python/llm/src/ipex_llm/optimize.py#L137

jason-dai commented 4 months ago

Do you have an example huggingface hub link that we can test?

unrahul commented 3 months ago

Here you go @jason-dai : https://huggingface.co/unrahul/phi-2-fp4 , i have many models in all quantization formats in https://huggingface/unrahul . All done using ipex-llm .