GGUF interaction with Transformers using AutoModel Class

Abdullah-kwl commented 3 months ago

Feature request

https://huggingface.co/docs/transformers/main/en/gguf in above documentation it shows that it loads the gguf model and provided the simple example

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF" filename = "tinyllama-1.1b-chat-v1.0.Q6_K.gguf"

model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=filename) tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)

but when I run the code it shows the error : OSError: TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF does not appear to have a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack. and some time show your transformer may be not updated

even after updating transformer it shows the same error not loading gguf model , add support to load gguf model

Motivation

it will solve to load gguf model with out the help of other library such as llama.cpp and ollama

Your contribution

I do not have a complete implementation in mind, but I suggest to start with previous method which mention in https://huggingface.co/docs/transformers/main/en/gguf

younesbelkada commented 3 months ago

Hi @Abdullah-kwl Thanks for the issue ! Can you make sure you have the latest transformers installed ? pip install -U transformers

Abdullah-kwl commented 3 months ago

yes, I am using an updated transformer library. in collab I am using the same command to download transformers. pip install -U transformers

younesbelkada commented 3 months ago

Thanks @Abdullah-kwl , will try to repro and report back here

younesbelkada commented 3 months ago

Hi @Abdullah-kwl I ran successfully the code on fresh new Google Colab env: https://colab.research.google.com/drive/1rfJZp3DsbavH6IFo-rXUDNvqwFJVIbOb?usp=sharing Note we do run a bunch of tests: https://github.com/huggingface/transformers/blob/main/tests/quantization/ggml/test_ggml.py and they do all pass on our end as of today !

Abdullah-kwl commented 3 months ago

@younesbelkada yes i again run it now its working there was some dependencies conflict with other libraries but now its running.

but now I am facing problem that my session crashed after using all available RAM, I think its loading model on ram but when I use llama-cpp-python it does not load model on ram and we can easily inference with larger models even more then 7B. is there is a way that my session doe not crash due to ram usage.

try out using this model:

model_id = "TheBloke/WestLake-7B-v2-GGUF" filename = "westlake-7b-v2.Q2_K.gguf" model = AutoModel.from_pretrained(model_id, gguf_file=filename) tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)

Devy99 commented 2 months ago

Hello, I have a general question about using GGUF models with AutoModel.

Is there any difference in the implementation if compared to the ctransformers library? Until now, I used GGUF models in Python with ctransformers, and it is really fast in generating responses. Using GGUF from AutoModel, instead, significantly slow down the inference time. Any clue about this? I am not sure if this transformers compatibility with GGUF models is intended for the the same usage as for ctransformers or llama.cpp or I am doing something wrong. Also, is there any way to set up "context_length" and "gpu_layers" offloading from AutoTokenizer?

Thanks!

amyeroberts commented 1 month ago

cc @SunMarc

huggingface / transformers