[Feature Request]: support gguf for embed_model

sirus20x6 commented 8 months ago

Feature Description

it appears that gguf isn't supported for embedding models

...................................................................................................
llama_new_context_with_model: n_ctx      = 4000
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   500.00 MiB
llama_new_context_with_model: KV self size  =  500.00 MiB, K (f16):  250.00 MiB, V (f16):  250.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    15.83 MiB
llama_new_context_with_model:        CPU compute buffer size =   301.40 MiB
llama_new_context_with_model: graph splits (measure): 1
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
Model metadata: {'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.padding_token_id': '2', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '2', 'general.architecture': 'llama', 'llama.rope.freq_base': '10000.000000', 'llama.context_length': '32768', 'general.name': 'cognitivecomputations_dolphin-2.6-mistral-7b-dpo-laser', 'tokenizer.ggml.add_bos_token': 'true', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '32', 'llama.block_count': '32', 'llama.attention.head_count_kv': '8', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.file_type': '7'}
/usr/lib/python3.11/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/usr/lib/python3.11/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
Traceback (most recent call last):
  File "/code/git/ontherag/./myrag.py", line 49, in <module>
    service_context = ServiceContext.from_defaults(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/llama_index/service_context.py", line 191, in from_defaults
    embed_model = resolve_embed_model(embed_model)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/llama_index/embeddings/utils.py", line 84, in resolve_embed_model
    embed_model = HuggingFaceEmbedding(
                  ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/llama_index/embeddings/huggingface.py", line 82, in __init__
    model = AutoModel.from_pretrained(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3194, in from_pretrained
    raise EnvironmentError(
OSError: Error no file named pytorch_model.bin, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory e5-mistral-7b-instruct.

Reason

I don't know enough about llamaindex to answer what's stopping the feature from working.

Value of Feature

gguf is smaller, faster, and all in one file that has almost no perplexity loss vs fp16

dosubot[bot] commented 8 months ago

🤖

Hey @sirus20x6, great to see you back! Diving into another challenge, I see? Let's tackle this one together.

Thank you for your feature request. I understand that you're looking for support for gguf in embedding models in LlamaIndex.

Based on the information provided, it seems that the error you're encountering is due to the absence of a file named pytorch_model.bin, tf_model.h5, model.ckpt.index or flax_model.msgpack in the directory e5-mistral-7b-instruct. This error is raised by the from_pretrained method of the AutoModel class in the transformers library, which is used by the HuggingFaceEmbedding class in LlamaIndex.

As for gguf support, there have been similar requests in the past. For instance, this issue requested support for the Hugging Face Inference API or GGUF file type, and this issue asked about GGUF model support. Both issues were closed with pull requests that seem to address the requested features.

However, it's not clear from the current repository documentation whether gguf is fully supported in the latest version of LlamaIndex (v0.9.45.post1). The error message you're seeing suggests that the HuggingFaceEmbedding class is still expecting one of the file types mentioned above, not a gguf file.

To resolve this issue, you might want to check the following:

Ensure that the model directory e5-mistral-7b-instruct contains a gguf file.
Make sure you're using the latest version of LlamaIndex, as the pull requests addressing gguf support should be included in the latest release.
If you're still encountering issues, it might be helpful to provide more information about how you're using LlamaIndex and the gguf file, so we can better understand the context and provide more targeted assistance.

I hope this helps! If you have any more questions or need further clarification, feel free to ask.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

logan-markewich commented 8 months ago

@sirus20x6 I think huggingface technically supports it, I just haven't implemented the code to detect and load gguf files.

If you can point me towards docs for how to load with huggingface, that will probably help motivate me to implement it haha

logan-markewich commented 8 months ago

As a workaround, you can load the model and tokenizer, and pass it in

model = <load model from huggingface>
tokenizer = <load tokenizer from huggingface>
embed_model = HuggingFaceEmbedding(model=model, tokenizer=tokenizer,  ..)

sirus20x6 commented 8 months ago

so I have to upload the model to hugging face and then use that format? oh wait. I'm not sure what to even use for tokenizer. is that just the tokenizer.json from the original fp16?

logan-markewich commented 8 months ago

You tell me hahaha

I think it probably is that tokenizer yea.

Not saying you have to upload the model to huggingface, just that it has to be loaded with huggingface modules in order to work with HuggingFaceLLM

Really though, for gguf, you should be using ollama or llama.cpp (both of which we have support for)

sirus20x6 commented 8 months ago

how do I use llama.ccp in llamaindex for embeddings?

logan-markewich commented 8 months ago

@sirus20x6 you can't at the moment (and tbh I really wouldn't recommend it either -- using an LLM for embeddings is pretty subpar, using an actual model trained for embeddings is ideal (I.e. bge, etc.)

sirus20x6 commented 8 months ago

according to the embedding leaderboard https://huggingface.co/spaces/mteb/leaderboard

SFR-Embedding-Mistral is the highest ranking embedding model. the gguf is found here https://huggingface.co/dranger003/SFR-Embedding-Mistral-GGUF

logan-markewich commented 8 months ago

If you want to contribute an embedding class to support this, I encourage you to :)

(Imo, maybe a spicy take, a 14GB embedding model that barely beats a 1GB embedding model doesn't feel very worth it)

sirus20x6 commented 8 months ago

well I spent good money on 512GB of ram for a reason

dosubot[bot] commented 5 months ago

Hi, @sirus20x6,

I'm helping the LlamaIndex team manage their backlog and am marking this issue as stale. From what I understand, the issue was raised by you to request support for gguf in embedding models in LlamaIndex. It seems that the lack of support for gguf was addressed with a workaround where users can load the model and tokenizer from HuggingFace and pass it into the HuggingFaceEmbedding class. There was also a discussion about using llama.cpp for embeddings, but it was discouraged in favor of using actual models trained for embeddings.

Could you please confirm if this issue is still relevant to the latest version of the LlamaIndex repository? If it is, please let the LlamaIndex team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you!

run-llama / llama_index