[Question]: How to use VLMs from HuggingFace for Multimodal rag?

g-hano commented 2 days ago

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

I want to use "Qwen/Qwen2-VL-2B-Instruct" on my multimodal rag app. I tried OllamaMultiModal from llama_index.multi_modal_llms.ollama and it works fine. I want to use huggingface models as well

dosubot[bot] commented 2 days ago

To use HuggingFace models for multimodal retrieval-augmented generation (RAG) in your application, you can leverage the existing support for HuggingFace models in the llama_index.multi_modal_llms module. Here’s a general guide on how to integrate a HuggingFace model like "Qwen/Qwen2-VL-2B-Instruct":

Install Dependencies: Ensure you have the huggingface-hub package installed:
```
pip install huggingface-hub
```

Define Your HuggingFace Model Class: You can create a class similar to OllamaMultiModal but for HuggingFace models. Here’s a simplified example:

from llama_index.multi_modal_llms.base import MultiModalLLM, MultiModalLLMMetadata
from llama_index.llms.huggingface import HuggingFaceLLM
from pydantic import Field
from typing import Any, Dict, Sequence

class HuggingFaceMultiModal(MultiModalLLM):
   model_name: str = Field(description="The HuggingFace model to use.")
   temperature: float = Field(
       default=0.75,
       description="The temperature to use for sampling.",
       ge=0.0,
       le=1.0,
   )
   context_window: int = Field(
       default=DEFAULT_CONTEXT_WINDOW,
       description="The maximum number of context tokens for the model.",
       gt=0,
   )
   additional_kwargs: Dict[str, Any] = Field(
       default_factory=dict,
       description="Additional model parameters for the HuggingFace model.",
   )

   def __init__(self, **kwargs: Any) -> None:
       super().__init__(**kwargs)
       self.model = HuggingFaceLLM(model_name=self.model_name)

   @classmethod
   def class_name(cls) -> str:
       return "HuggingFace_multi_modal_llm"

   @property
   def metadata(self) -> MultiModalLLMMetadata:
       return MultiModalLLMMetadata(
           context_window=self.context_window,
           num_output=DEFAULT_NUM_OUTPUTS,
           model_name=self.model_name,
           is_chat_model=True,
       )

   def complete(self, prompt: str, image_documents: Sequence[ImageNode], **kwargs: Any) -> CompletionResponse:
       # Implement the completion logic using HuggingFace model
       pass

   def stream_complete(self, prompt: str, image_documents: Sequence[ImageNode], **kwargs: Any) -> CompletionResponseGen:
       # Implement the streaming completion logic using HuggingFace model
       pass

   def chat(self, messages: Sequence[ChatMessage], **kwargs: Any) -> ChatResponse:
       # Implement the chat logic using HuggingFace model
       pass

   def stream_chat(self, messages: Sequence[ChatMessage], **kwargs: Any) -> ChatResponseGen:
       # Implement the streaming chat logic using HuggingFace model
       pass

   async def acomplete(self, prompt: str, image_documents: Sequence[ImageNode], **kwargs: Any) -> CompletionResponse:
       # Implement the async completion logic using HuggingFace model
       pass

   async def astream_complete(self, prompt: str, image_documents: Sequence[ImageNode], **kwargs: Any) -> CompletionResponseAsyncGen:
       # Implement the async streaming completion logic using HuggingFace model
       pass

   async def achat(self, messages: Sequence[ChatMessage], **kwargs: Any) -> ChatResponse:
       # Implement the async chat logic using HuggingFace model
       pass

   async def astream_chat(self, messages: Sequence[ChatMessage], **kwargs: Any) -> ChatResponseAsyncGen:
       # Implement the async streaming chat logic using HuggingFace model
       pass

Use the HuggingFace Model: Instantiate and use your HuggingFaceMultiModal class in your application:

model = HuggingFaceMultiModal(
   model_name="Qwen/Qwen2-VL-2B-Instruct",
   temperature=0.7,
   context_window=2048,
   additional_kwargs={"some_param": "value"}
)

# Example usage
response = model.complete(prompt="Your prompt here", image_documents=[])
print(response.text)

This setup allows you to use HuggingFace models in a similar manner to how you used OllamaMultiModal. Make sure to implement the specific logic for completion, streaming completion, chat, and streaming chat methods as per your requirements.

For more details, you can refer to the existing implementations of HuggingFace models in the llama_index.multi_modal_llms module ^[1].

To continue talking to Dosu, mention @dosu.

g-hano commented 2 days ago

from llama_index.core.indices import MultiModalVectorStoreIndex
from llama_index.core import SimpleDirectoryReader
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import SimpleDirectoryReader, StorageContext
import qdrant_client
client = qdrant_client.QdrantClient(path="qdrant_mm_db")

text_store = QdrantVectorStore(
    client=client, collection_name="text_collection"
)
image_store = QdrantVectorStore(
    client=client, collection_name="image_collection"
)
storage_context = StorageContext.from_defaults(
    vector_store=text_store, image_store=image_store
)

documents = SimpleDirectoryReader("imgs").load_data(show_progress=True)
index = MultiModalVectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context,
    image_embed_model="local:openai/clip-vit-large-patch14"
)

Error log:

No sentence-transformers model found with name openai/clip-vit-large-patch14. Creating a new one with MEAN pooling.
{
    "name": "AttributeError",
    "message": "'CLIPConfig' object has no attribute 'hidden_size'",
    "stack": "---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[4], line 1
----> 1 index = MultiModalVectorStoreIndex.from_documents(
      2     documents,
      3     storage_context=storage_context,
      4     image_embed_model=\"local:openai/clip-vit-large-patch14\"
      5 )

File c:\\Users\\Cihan\\Desktop\\llamaindex\\lindex\\Lib\\site-packages\\llama_index\\core\\indices\\base.py:145, in BaseIndex.from_documents(cls, documents, storage_context, show_progress, callback_manager, transformations, service_context, **kwargs)
    136     docstore.set_document_hash(doc.get_doc_id(), doc.hash)
    138 nodes = run_transformations(
    139     documents,  # type: ignore
    140     transformations,
    141     show_progress=show_progress,
    142     **kwargs,
    143 )
--> 145 return cls(
    146     nodes=nodes,
    147     storage_context=storage_context,
    148     callback_manager=callback_manager,
    149     show_progress=show_progress,
    150     transformations=transformations,
    151     service_context=service_context,
    152     **kwargs,
    153 )

File c:\\Users\\Cihan\\Desktop\\llamaindex\\lindex\\Lib\\site-packages\\llama_index\\core\\indices\\multi_modal\\base.py:80, in MultiModalVectorStoreIndex.__init__(self, nodes, index_struct, embed_model, storage_context, use_async, store_nodes_override, show_progress, image_vector_store, image_embed_model, is_image_to_text, is_image_vector_store_empty, is_text_vector_store_empty, service_context, **kwargs)
     56 def __init__(
     57     self,
     58     nodes: Optional[Sequence[BaseNode]] = None,
   (...)
     77     **kwargs: Any,
     78 ) -> None:
     79     \"\"\"Initialize params.\"\"\"
---> 80     image_embed_model = resolve_embed_model(
     81         image_embed_model, callback_manager=kwargs.get(\"callback_manager\", None)
     82     )
     83     assert isinstance(image_embed_model, MultiModalEmbedding)
     84     self._image_embed_model = image_embed_model

File c:\\Users\\Cihan\\Desktop\\llamaindex\\lindex\\Lib\\site-packages\\llama_index\\core\\embeddings\\utils.py:110, in resolve_embed_model(embed_model, callback_manager)
    107     cache_folder = os.path.join(get_cache_dir(), \"models\")
    108     os.makedirs(cache_folder, exist_ok=True)
--> 110     embed_model = HuggingFaceEmbedding(
    111         model_name=model_name, cache_folder=cache_folder
    112     )
    113 except ImportError:
    114     raise ImportError(
    115         \"`llama-index-embeddings-huggingface` package not found, \"
    116         \"please run `pip install llama-index-embeddings-huggingface`\"
    117     )

File c:\\Users\\Cihan\\Desktop\\llamaindex\\lindex\\Lib\\site-packages\\llama_index\\embeddings\\huggingface\\base.py:150, in HuggingFaceEmbedding.__init__(self, model_name, tokenizer_name, pooling, max_length, query_instruction, text_instruction, normalize, model, tokenizer, embed_batch_size, cache_folder, trust_remote_code, device, callback_manager, parallel_process, target_devices, **model_kwargs)
    147 if model_name is None:
    148     raise ValueError(\"The `model_name` argument must be provided.\")
--> 150 model = SentenceTransformer(
    151     model_name,
    152     device=device,
    153     cache_folder=cache_folder,
    154     trust_remote_code=trust_remote_code,
    155     prompts={
    156         \"query\": query_instruction
    157         or get_query_instruct_for_model_name(model_name),
    158         \"text\": text_instruction
    159         or get_text_instruct_for_model_name(model_name),
    160     },
    161     **model_kwargs,
    162 )
    163 if max_length:
    164     model.max_seq_length = max_length

File c:\\Users\\Cihan\\Desktop\\llamaindex\\lindex\\Lib\\site-packages\\sentence_transformers\\SentenceTransformer.py:205, in SentenceTransformer.__init__(self, model_name_or_path, modules, device, prompts, default_prompt_name, cache_folder, trust_remote_code, revision, token, use_auth_token, truncate_dim)
    197         modules = self._load_sbert_model(
    198             model_name_or_path,
    199             token=token,
   (...)
    202             trust_remote_code=trust_remote_code,
    203         )
    204     else:
--> 205         modules = self._load_auto_model(
    206             model_name_or_path,
    207             token=token,
    208             cache_folder=cache_folder,
    209             revision=revision,
    210             trust_remote_code=trust_remote_code,
    211         )
    213 if modules is not None and not isinstance(modules, OrderedDict):
    214     modules = OrderedDict([(str(idx), module) for idx, module in enumerate(modules)])

File c:\\Users\\Cihan\\Desktop\\llamaindex\\lindex\\Lib\\site-packages\\sentence_transformers\\SentenceTransformer.py:1203, in SentenceTransformer._load_auto_model(self, model_name_or_path, token, cache_folder, revision, trust_remote_code)
   1192 logger.warning(
   1193     \"No sentence-transformers model found with name {}. Creating a new one with MEAN pooling.\".format(
   1194         model_name_or_path
   1195     )
   1196 )
   1197 transformer_model = Transformer(
   1198     model_name_or_path,
   1199     cache_dir=cache_folder,
   1200     model_args={\"token\": token, \"trust_remote_code\": trust_remote_code, \"revision\": revision},
   1201     tokenizer_args={\"token\": token, \"trust_remote_code\": trust_remote_code, \"revision\": revision},
   1202 )
-> 1203 pooling_model = Pooling(transformer_model.get_word_embedding_dimension(), \"mean\")
   1204 return [transformer_model, pooling_model]

File c:\\Users\\Cihan\\Desktop\\llamaindex\\lindex\\Lib\\site-packages\\sentence_transformers\\models\\Transformer.py:114, in Transformer.get_word_embedding_dimension(self)
    113 def get_word_embedding_dimension(self) -> int:
--> 114     return self.auto_model.config.hidden_size

File c:\\Users\\Cihan\\Desktop\\llamaindex\\lindex\\Lib\\site-packages\\transformers\\configuration_utils.py:202, in PretrainedConfig.__getattribute__(self, key)
    200 if key != \"attribute_map\" and key in super().__getattribute__(\"attribute_map\"):
    201     key = super().__getattribute__(\"attribute_map\")[key]
--> 202 return super().__getattribute__(key)

AttributeError: 'CLIPConfig' object has no attribute 'hidden_size'"
}

dosubot[bot] commented 2 days ago

To resolve the AttributeError: 'CLIPConfig' object has no attribute 'hidden_size' when using the MultiModalVectorStoreIndex with the image_embed_model set to "local:openai/clip-vit-large-patch14", you need to ensure that the image_embed_model is correctly resolved to an instance of MultiModalEmbedding. The error suggests that the image_embed_model is not being correctly instantiated or configured.

In the provided code, the image_embed_model is resolved using the resolve_embed_model function. Ensure that this function correctly handles the model path "local:openai/clip-vit-large-patch14" and returns an appropriate MultiModalEmbedding instance.

Here is the relevant part of the MultiModalVectorStoreIndex class constructor:

def __init__(
    self,
    nodes: Optional[Sequence[BaseNode]] = None,
    index_struct: Optional[MultiModelIndexDict] = None,
    embed_model: Optional[BaseEmbedding] = None,
    storage_context: Optional[StorageContext] = None,
    use_async: bool = False,
    store_nodes_override: bool = False,
    show_progress: bool = False,
    # Image-related kwargs
    image_vector_store: Optional[BasePydanticVectorStore] = None,
    image_embed_model: EmbedType = "clip:ViT-B/32",
    is_image_to_text: bool = False,
    is_image_vector_store_empty: bool = False,
    is_text_vector_store_empty: bool = False,
    **kwargs: Any,
) -> None:
    """Initialize params."""
    image_embed_model = resolve_embed_model(
        image_embed_model, callback_manager=kwargs.get("callback_manager", None)
    )
    assert isinstance(image_embed_model, MultiModalEmbedding)
    self._image_embed_model = image_embed_model
    self._is_image_to_text = is_image_to_text
    self._is_image_vector_store_empty = is_image_vector_store_empty
    self._is_text_vector_store_empty = is_text_vector_store_empty
    storage_context = storage_context or StorageContext.from_defaults()

    if image_vector_store is not None:
        if self.image_namespace not in storage_context.vector_stores:
            storage_context.add_vector_store(
                image_vector_store, self.image_namespace
            )
        else:
            # overwrite image_store from storage_context
            storage_context.vector_stores[self.image_namespace] = image_vector_store

    if self.image_namespace not in storage_context.vector_stores:
        storage_context.add_vector_store(SimpleVectorStore(), self.image_namespace)

    self._image_vector_store = storage_context.vector_stores[self.image_namespace]

    super().__init__(
        nodes=nodes,
        index_struct=index_struct,
        embed_model=embed_model,
        storage_context=storage_context,
        show_progress=show_progress,
        use_async=use_async,
        store_nodes_override=store_nodes_override,
        **kwargs,
    )

Ensure that the resolve_embed_model function correctly handles the model path and returns an instance of MultiModalEmbedding that has the necessary attributes. If the CLIPConfig object does not have a hidden_size attribute, you may need to check the configuration of the model or modify the resolve_embed_model function to correctly set up the model.

The ClipEmbedding class from the llama_index library provides an example of how to correctly initialize and use a CLIP model for embedding:

class ClipEmbedding(MultiModalEmbedding):
    """CLIP embedding models for encoding text and image for Multi-Modal purpose."""

    embed_batch_size: int = Field(default=DEFAULT_EMBED_BATCH_SIZE, gt=0)

    _clip: Any = PrivateAttr()
    _model: Any = PrivateAttr()
    _preprocess: Any = PrivateAttr()
    _device: Any = PrivateAttr()

    def __init__(
        self,
        *,
        embed_batch_size: int = DEFAULT_EMBED_BATCH_SIZE,
        model_name: str = DEFAULT_CLIP_MODEL,
        **kwargs: Any,
    ):
        """Initializes the ClipEmbedding class."""
        if embed_batch_size <= 0:
            raise ValueError(f"Embed batch size {embed_batch_size}  must be > 0.")

        try:
            import clip
            import torch
        except ImportError:
            raise ImportError(
                "ClipEmbedding requires `pip install git+https://github.com/openai/CLIP.git` and torch."
            )

        super().__init__(
            embed_batch_size=embed_batch_size, model_name=model_name, **kwargs
        )
        self._device = "cuda" if torch.cuda.is_available() else "cpu"
        if self.model_name not in AVAILABLE_CLIP_MODELS:
            raise ValueError(
                f"Model name {self.model_name} is not available in CLIP."
            )
        self._model, self._preprocess = clip.load(
            self.model_name, device=self._device
        )

Ensure that your resolve_embed_model function or equivalent correctly initializes the ClipEmbedding class and sets up the model with the necessary attributes. This should resolve the AttributeError by ensuring the image_embed_model is properly configured ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

logan-markewich commented 1 day ago

@g-hano Hmm, we actually don't have a multi-modal huggingface llm class yet.

If you want to use open-source VLMs, you could look into using Ollama

Contributions also welcome for a multi modal huggingface class!

run-llama / llama_index

[Question]: How to use VLMs from HuggingFace for Multimodal rag? #16056

Question Validation

Question