run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.96k stars 5.3k forks source link

[Bug]: LlamaIndex-DSPy integration issue when using HuggingFace Embeddings #14464

Closed tituslhy closed 1 month ago

tituslhy commented 4 months ago

Bug Description

I was following the cookbook (https://github.com/stanfordnlp/dspy/blob/main/examples/llamaindex/dspy_llamaindex_rag.ipynb) but changed the LLM and embeddings model to non-OpenAI models. I faced an error when trying to compile my dspy training pipeline when using the HuggingFaceEmbeddings class - but didn't face this error when I used any other embeddings model. This is the github issue I opened on DSPy-AI: https://github.com/stanfordnlp/dspy/issues/1209

Version

10.50

Steps to Reproduce

I was following the cookbook (https://github.com/stanfordnlp/dspy/blob/main/examples/llamaindex/dspy_llamaindex_rag.ipynb) but changed the following variables:

turbo = dspy.OllamaLocal(model="llama3")

and I specified the embeddings model to be a HuggingFace model. LlamaIndex does use OpenAI embeddings as its default embeddings model if this specification isn't made.

from llama_index.core import (
    SimpleDirectoryReader, 
    VectorStoreIndex,
    Settings
)
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

Settings.embed_model = HuggingFaceEmbedding()

docs = SimpleDirectoryReader(
    input_files = ["paul_graham_essay.txt"]
).load_data()

index = VectorStoreIndex.from_documents(docs)

retriever = index.as_retriever(similarity_top_k=2)

I faced an error message when running the cell

from dspy.teleprompt import BootstrapFewShot
from llama_index.core.evaluation import SemanticSimilarityEvaluator

evaluator = SemanticSimilarityEvaluator(similarity_threshold=0.5)

def validate_context_and_answer(example, pred, trace=None):
    """We check that the predicted answer is correct, and that
    the retrieved context does contain the answer."""
    result = evaluator.evaluate(response = pred.answer,
                                reference = example.answer)
    return result.passing

teleprompter = BootstrapFewShot(
    max_labeled_demos=0,
    metric=validate_context_and_answer
)

#compile
compiled_dspy_qp = teleprompter.compile(dspy_qp, trainset=train_examples)

Relevant Logs/Tracbacks

Failed to run or to evaluate example Example({'query_str': 'What did the author do growing up?', 'answer': 'The author wrote short stories and also worked on programming.'}) (input_keys={'query_str'}) with <function validate_context_and_answer at 0x7feac38a1f80> due to 'HuggingFaceEmbedding' object has no attribute '_model'. [dspy.teleprompt.bootstrap] filename=bootstrap.py lineno=211
dosubot[bot] commented 4 months ago

To resolve the error "'HuggingFaceEmbedding' object has no attribute '_model'" when using the HuggingFaceEmbedding class in your DSPy training pipeline, ensure that the _model attribute is correctly initialized in the HuggingFaceEmbedding class. Here is the relevant part of the code that shows how _model should be initialized within the __init__ method:

class HuggingFaceEmbedding(BaseEmbedding):
    tokenizer_name: str = Field(description="Tokenizer name from HuggingFace.")
    max_length: int = Field(
        default=DEFAULT_HUGGINGFACE_LENGTH, description="Maximum length of input.", gt=0
    )
    pooling: Pooling = Field(default=None, description="Pooling strategy.")
    normalize: bool = Field(default=True, description="Normalize embeddings or not.")
    query_instruction: Optional[str] = Field(
        description="Instruction to prepend to query text."
    )
    text_instruction: Optional[str] = Field(
        description="Instruction to prepend to text."
    )
    cache_folder: Optional[str] = Field(
        description="Cache folder for huggingface files."
    )

    _model: Any = PrivateAttr()
    _tokenizer: Any = PrivateAttr()
    _device: str = PrivateAttr()

    def __init__(
        self,
        model_name: Optional[str] = None,
        tokenizer_name: Optional[str] = None,
        pooling: Optional[str] = None,
        max_length: Optional[int] = None,
        query_instruction: Optional[str] = None,
        text_instruction: Optional[str] = None,
        normalize: bool = True,
        model: Optional[Any] = None,
        tokenizer: Optional[Any] = None,
        embed_batch_size: int = DEFAULT_EMBED_BATCH_SIZE,
        cache_folder: Optional[str] = None,
        trust_remote_code: bool = False,
        device: Optional[str] = None,
        callback_manager: Optional[CallbackManager] = None,
    ):
        try:
            from transformers import AutoModel, AutoTokenizer
        except ImportError:
            raise ImportError(
                "HuggingFaceEmbedding requires transformers to be installed.\n"
                "Please install transformers with `pip install transformers`."
            )

        self._device = device or infer_torch_device()

        cache_folder = cache_folder or get_cache_dir()
        if model is None:  # Use model_name with AutoModel
            model_name = (
                model_name
                if model_name is not None
                else DEFAULT_HUGGINGFACE_EMBEDDING_MODEL
            )
            model = AutoModel.from_pretrained(
                model_name, cache_dir=cache_folder, trust_remote_code=trust_remote_code
            )
        elif model_name is None:  # Extract model_name from model
            model_name = model.name_or_path
        self._model = model.to(self._device)

        if tokenizer is None:  # Use tokenizer_name with AutoTokenizer
            tokenizer_name = (
                model_name or tokenizer_name or DEFAULT_HUGGINGFACE_EMBEDDING_MODEL
            )
            tokenizer = AutoTokenizer.from_pretrained(
                tokenizer_name, cache_dir=cache_folder
            )
        elif tokenizer_name is None:  # Extract tokenizer_name from model
            tokenizer_name = tokenizer.name_or_path
        self._tokenizer = tokenizer

        if max_length is None:
            try:
                max_length = int(self._model.config.max_position_embeddings)
            except AttributeError as exc:
                raise ValueError(
                    "Unable to find max_length from model config. Please specify max_length."
                ) from exc

        if not pooling:
            pooling = get_pooling_mode(model_name)
        try:
            pooling = Pooling(pooling)
        except ValueError as exc:
            raise NotImplementedError(
                f"Pooling {pooling} unsupported, please pick one in"
                f" {[p.value for p in Pooling]}."
            ) from exc

        super().__init__(
            embed_batch_size=embed_batch_size,
            callback_manager=callback_manager,
            model_name=model_name,
            tokenizer_name=tokenizer_name,
            max_length=max_length,
            pooling=pooling,
            normalize=normalize,
            query_instruction=query_instruction,
            text_instruction=text_instruction,
        )

Ensure that the AutoModel initialization is correctly placed within the __init__ method and that the _model attribute is properly set. This should resolve the attribute error [1][2].

To continue talking to Dosu, mention @dosu.

theta-lin commented 4 months ago

@tituslhy Hi, I am also trying to use HuggingFaceEmbedding with DSPy. Different from the cookbook, I simply used vector retriever inside a DSPy module. Here is a even more minimal example to demonstrate this issue:

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
import dspy
from dspy.teleprompt import BootstrapFewShot

class Rag(dspy.Module):
    def __init__(self):
        super().__init__()
        reader = SimpleDirectoryReader(input_files=["paul_graham_essay.txt"])
        docs = reader.load_data()
        index = VectorStoreIndex.from_documents(docs)
        self.retriever = index.as_retriever()

    def forward(self, question):
        return dspy.Prediction(answer=str(self.retriever.retrieve(question)))

Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5", trust_remote_code=True
)
Settings.llm = None

teleprompter = BootstrapFewShot()
train_examples = [
    dspy.Example(
        question="What did the author do growing up?",
        answer="The author wrote short stories and also worked on programming.",
    ).with_inputs("question"),
    dspy.Example(
        question="What did the author do during his time at YC?",
        answer="organizing a Summer Founders Program, funding startups, writing essays, working on a new version of Arc, creating Hacker News, and developing internal software for YC",
    ).with_inputs("question"),
]
teleprompter.compile(Rag(), trainset=train_examples)

Just download the dataset with

wget https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt -O paul_graham_essay.txt

and you should be able to run the example even without an LLM.

This gives the output of

2024-07-02T08:47:05.960695Z [error    ] Failed to run or to evaluate example Example({'question': 'What did the author do growing up?', 'answer': 'The author wrote short stories and also worked on programming.'}) (input_keys={'question'}) with None due to 'HuggingFaceEmbedding' object has no attribute '_model'. [dspy.teleprompt.bootstrap] filename=bootstrap.py lineno=211
2024-07-02T08:47:05.961326Z [error    ] Failed to run or to evaluate example Example({'question': 'What did the author do during his time at YC?', 'answer': 'organizing a Summer Founders Program, funding startups, writing essays, working on a new version of Arc, creating Hacker News, and developing internal software for YC'}) (input_keys={'question'}) with None due to 'HuggingFaceEmbedding' object has no attribute '_model'. [dspy.teleprompt.bootstrap] filename=bootstrap.py lineno=211

on my side.

While I haven't done more investigation yet, I think this should be related to another issue I opened some time ago #13956. On a related issue #14236, a developer mentioned that you cannot multiprocess with a local embedding model. Therefore, I suspect that maybe there is some kind of pickling (as multiprocessing is not used by DSPy, I think) in teleprompter.compile() that has something to do with this.