run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.48k stars 5.01k forks source link

[Bug]: HuggingFaceEmbedding with CLIP and other image embeddings #15519

Open alessandro-neri opened 4 weeks ago

alessandro-neri commented 4 weeks ago

Bug Description

While Sentence Transformer enables using embedding models such as CLIP, HuggingFaceEmbedding does not work.

Version

0.10.67.post1

Steps to Reproduce

from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/clip-ViT-B-32")

Relevant Logs/Tracbacks


AttributeError Traceback (most recent call last) Cell In[27], line 13 10 from llama_index.embeddings.huggingface import HuggingFaceEmbedding 12 # Initialize the embedding model ---> 13 embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/clip-ViT-B-32") 15 # Function to download and embed image 16 def embed_image(url):

File ~/anaconda3/envs/xrx-reasoning/lib/python3.10/site-packages/llama_index/embeddings/huggingface/base.py:169, in HuggingFaceEmbedding.init(self, model_name, tokenizer_name, pooling, max_length, query_instruction, text_instruction, normalize, model, tokenizer, embed_batch_size, cache_folder, trust_remote_code, device, callback_manager, parallel_process, target_devices, **model_kwargs) 167 self._model.max_seq_length = max_length 168 else: --> 169 max_length = self._model.max_seq_length 171 super().init( 172 embed_batch_size=embed_batch_size, 173 callback_manager=callback_manager, (...) 178 text_instruction=text_instruction, 179 )

File ~/anaconda3/envs/xrx-reasoning/lib/python3.10/site-packages/torch/nn/modules/module.py:1729, in Module.getattr(self, name) 1727 if name in modules: 1728 return modules[name] -> 1729 raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")

AttributeError: 'SentenceTransformer' object has no attribute 'max_seq_length'

dosubot[bot] commented 4 weeks ago

Hey @alessandro-neri! I'm here to help with bugs, questions, and more while you wait for a human maintainer. Let's squash those bugs together! 🐞

The AttributeError you're encountering with the HuggingFaceEmbedding class when using the model "sentence-transformers/clip-ViT-B-32" is due to the fact that the SentenceTransformer class does not have a max_seq_length attribute for this specific model. The clip-ViT-B-32 model is primarily designed for vision tasks and does not support the max_seq_length attribute, which is typically used for text models.

Here is the relevant code snippet that attempts to set the max_seq_length:

self._model = SentenceTransformer(
    model_name,
    device=self._device,
    cache_folder=cache_folder,
    trust_remote_code=trust_remote_code,
    prompts={
        "query": query_instruction
        or get_query_instruct_for_model_name(model_name),
        "text": text_instruction
        or get_text_instruct_for_model_name(model_name),
    },
    **model_kwargs,
)
if max_length:
    self._model.max_seq_length = max_length
else:
    max_length = self._model.max_seq_length

Since the clip-ViT-B-32 model does not have a max_seq_length attribute, attempting to set or access it results in an AttributeError [1].

To resolve this issue, you can either:

  1. Specify the max_length explicitly: This avoids the need to access the max_seq_length attribute.
  2. Use a different model: Choose a model that supports the max_seq_length attribute.

For models like clip-ViT-B-32, which do not have a max_position_embeddings attribute, the code will raise an error. Therefore, handling such cases by specifying the max_length explicitly or using a different model is necessary [1].

To continue talking to Dosu, mention @dosu.

logan-markewich commented 4 weeks ago

We have a specific class for clip, probably you should be using that?

logan-markewich commented 4 weeks ago

pip install llama-index-embeddings-clip

from llama_index.embeddings.clip import ClipEmbedding

embed_model = ClipEmbedding(model_name="...")
alessandro-neri commented 4 weeks ago

Hi, yes I guess it works, but in my opinion it adds complexity... It would be so much easier if it could be done via sentence transformer or Huggingface embeddings (better documented). Is there some documentation about this specific class? I really couldn't find it anywhere in the docs except an empty API reference.