Quantized model loading method expects the model should be locally available.

intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.

Apache License 2.0

6.56k stars 1.25k forks source link

I use ipex-llm to quantize and push models to hub. But it seems load_low_bit expects the model to be locally available and cant take it from huggingface hub.

It would be awesome to allow the model to be loaded from hub as well, so end user doesnt have to quantize it, making shipping the right model for the right platform much easier.

new_model = AutoModelForCausalLM.load_low_bit(model_path)  # breaks if using a remote huggingfacehub link

Path: https://github.com/intel-analytics/ipex-llm/blob/70b17c87be259e2a42481a100b06062efff24bf6/python/llm/src/ipex_llm/optimize.py#L137

intel-analytics / ipex-llm

Quantized model loading method expects the model should be locally available. #11268