Closed tituslhy closed 1 month ago
To resolve the error "'HuggingFaceEmbedding' object has no attribute '_model'" when using the HuggingFaceEmbedding
class in your DSPy training pipeline, ensure that the _model
attribute is correctly initialized in the HuggingFaceEmbedding
class. Here is the relevant part of the code that shows how _model
should be initialized within the __init__
method:
class HuggingFaceEmbedding(BaseEmbedding):
tokenizer_name: str = Field(description="Tokenizer name from HuggingFace.")
max_length: int = Field(
default=DEFAULT_HUGGINGFACE_LENGTH, description="Maximum length of input.", gt=0
)
pooling: Pooling = Field(default=None, description="Pooling strategy.")
normalize: bool = Field(default=True, description="Normalize embeddings or not.")
query_instruction: Optional[str] = Field(
description="Instruction to prepend to query text."
)
text_instruction: Optional[str] = Field(
description="Instruction to prepend to text."
)
cache_folder: Optional[str] = Field(
description="Cache folder for huggingface files."
)
_model: Any = PrivateAttr()
_tokenizer: Any = PrivateAttr()
_device: str = PrivateAttr()
def __init__(
self,
model_name: Optional[str] = None,
tokenizer_name: Optional[str] = None,
pooling: Optional[str] = None,
max_length: Optional[int] = None,
query_instruction: Optional[str] = None,
text_instruction: Optional[str] = None,
normalize: bool = True,
model: Optional[Any] = None,
tokenizer: Optional[Any] = None,
embed_batch_size: int = DEFAULT_EMBED_BATCH_SIZE,
cache_folder: Optional[str] = None,
trust_remote_code: bool = False,
device: Optional[str] = None,
callback_manager: Optional[CallbackManager] = None,
):
try:
from transformers import AutoModel, AutoTokenizer
except ImportError:
raise ImportError(
"HuggingFaceEmbedding requires transformers to be installed.\n"
"Please install transformers with `pip install transformers`."
)
self._device = device or infer_torch_device()
cache_folder = cache_folder or get_cache_dir()
if model is None: # Use model_name with AutoModel
model_name = (
model_name
if model_name is not None
else DEFAULT_HUGGINGFACE_EMBEDDING_MODEL
)
model = AutoModel.from_pretrained(
model_name, cache_dir=cache_folder, trust_remote_code=trust_remote_code
)
elif model_name is None: # Extract model_name from model
model_name = model.name_or_path
self._model = model.to(self._device)
if tokenizer is None: # Use tokenizer_name with AutoTokenizer
tokenizer_name = (
model_name or tokenizer_name or DEFAULT_HUGGINGFACE_EMBEDDING_MODEL
)
tokenizer = AutoTokenizer.from_pretrained(
tokenizer_name, cache_dir=cache_folder
)
elif tokenizer_name is None: # Extract tokenizer_name from model
tokenizer_name = tokenizer.name_or_path
self._tokenizer = tokenizer
if max_length is None:
try:
max_length = int(self._model.config.max_position_embeddings)
except AttributeError as exc:
raise ValueError(
"Unable to find max_length from model config. Please specify max_length."
) from exc
if not pooling:
pooling = get_pooling_mode(model_name)
try:
pooling = Pooling(pooling)
except ValueError as exc:
raise NotImplementedError(
f"Pooling {pooling} unsupported, please pick one in"
f" {[p.value for p in Pooling]}."
) from exc
super().__init__(
embed_batch_size=embed_batch_size,
callback_manager=callback_manager,
model_name=model_name,
tokenizer_name=tokenizer_name,
max_length=max_length,
pooling=pooling,
normalize=normalize,
query_instruction=query_instruction,
text_instruction=text_instruction,
)
Ensure that the AutoModel
initialization is correctly placed within the __init__
method and that the _model
attribute is properly set. This should resolve the attribute error [1][2].
To continue talking to Dosu, mention @dosu.
@tituslhy Hi, I am also trying to use HuggingFaceEmbedding
with DSPy. Different from the cookbook, I simply used vector retriever inside a DSPy module. Here is a even more minimal example to demonstrate this issue:
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
import dspy
from dspy.teleprompt import BootstrapFewShot
class Rag(dspy.Module):
def __init__(self):
super().__init__()
reader = SimpleDirectoryReader(input_files=["paul_graham_essay.txt"])
docs = reader.load_data()
index = VectorStoreIndex.from_documents(docs)
self.retriever = index.as_retriever()
def forward(self, question):
return dspy.Prediction(answer=str(self.retriever.retrieve(question)))
Settings.embed_model = HuggingFaceEmbedding(
model_name="BAAI/bge-small-en-v1.5", trust_remote_code=True
)
Settings.llm = None
teleprompter = BootstrapFewShot()
train_examples = [
dspy.Example(
question="What did the author do growing up?",
answer="The author wrote short stories and also worked on programming.",
).with_inputs("question"),
dspy.Example(
question="What did the author do during his time at YC?",
answer="organizing a Summer Founders Program, funding startups, writing essays, working on a new version of Arc, creating Hacker News, and developing internal software for YC",
).with_inputs("question"),
]
teleprompter.compile(Rag(), trainset=train_examples)
Just download the dataset with
wget https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt -O paul_graham_essay.txt
and you should be able to run the example even without an LLM.
This gives the output of
2024-07-02T08:47:05.960695Z [error ] Failed to run or to evaluate example Example({'question': 'What did the author do growing up?', 'answer': 'The author wrote short stories and also worked on programming.'}) (input_keys={'question'}) with None due to 'HuggingFaceEmbedding' object has no attribute '_model'. [dspy.teleprompt.bootstrap] filename=bootstrap.py lineno=211
2024-07-02T08:47:05.961326Z [error ] Failed to run or to evaluate example Example({'question': 'What did the author do during his time at YC?', 'answer': 'organizing a Summer Founders Program, funding startups, writing essays, working on a new version of Arc, creating Hacker News, and developing internal software for YC'}) (input_keys={'question'}) with None due to 'HuggingFaceEmbedding' object has no attribute '_model'. [dspy.teleprompt.bootstrap] filename=bootstrap.py lineno=211
on my side.
While I haven't done more investigation yet, I think this should be related to another issue I opened some time ago #13956. On a related issue #14236, a developer mentioned that you cannot multiprocess with a local embedding model. Therefore, I suspect that maybe there is some kind of pickling (as multiprocessing is not used by DSPy, I think) in teleprompter.compile()
that has something to do with this.
Bug Description
I was following the cookbook (https://github.com/stanfordnlp/dspy/blob/main/examples/llamaindex/dspy_llamaindex_rag.ipynb) but changed the LLM and embeddings model to non-OpenAI models. I faced an error when trying to compile my dspy training pipeline when using the HuggingFaceEmbeddings class - but didn't face this error when I used any other embeddings model. This is the github issue I opened on DSPy-AI: https://github.com/stanfordnlp/dspy/issues/1209
Version
10.50
Steps to Reproduce
I was following the cookbook (https://github.com/stanfordnlp/dspy/blob/main/examples/llamaindex/dspy_llamaindex_rag.ipynb) but changed the following variables:
and I specified the embeddings model to be a HuggingFace model. LlamaIndex does use OpenAI embeddings as its default embeddings model if this specification isn't made.
I faced an error message when running the cell
Relevant Logs/Tracbacks