Closed NeevrajKB closed 4 months ago
The issue you're encountering stems from using AutoTokenizer.from_pretrained
, which returns a tokenizer, not an embeddings model. Tokenizers prepare text for processing by models but don't generate embeddings themselves. To generate embeddings, you need to use a model designed for this purpose, such as those available from Hugging Face's transformers library.
For your specific need to use open-source models for embeddings, consider using models from the Hugging Face transformers library that are designed for generating embeddings. Here's a simplified approach:
Choose an Embeddings Model: Select a model from Hugging Face that is suitable for generating embeddings, such as sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
.
Load the Model and Tokenizer: Use the AutoModel
and AutoTokenizer
classes to load your model and tokenizer.
from transformers import AutoModel, AutoTokenizer
model_name = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
inputs = tokenizer("Your text here", return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state.mean(dim=1) # Example of obtaining embeddings
This approach allows you to directly use models from Hugging Face for generating embeddings, bypassing the need for a method like get_text_embedding_batch
, which is not a standard method for Hugging Face models or tokenizers.
Remember, the key is to use a model that is capable of generating embeddings, as opposed to a tokenizer, which is only responsible for preparing text inputs for a model.
To continue the conversation, mention @dosu.
@alineberry you set the embedding model to be a tokenizer. Probably you should do this
pip install llama-index-embeddings-huggingface
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(model_name="<...>", embed_batch_size=10, device="<cpu or gpu>")
@logan-markewich I think you tagged me on accident
Oops I did, whoops!
@NeevrajKB
😁 Man you really made my day 😂😂😂
Question Validation
Question
s new_embeddings = embed_model.get_text_embedding_batch( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'BertTokenizerFast' object has no attribute 'get_text_embedding_batch' PS C:\Users\Admin>
This happened because i am using `from llama_index.agent.openai import OpenAIAgent
agent = OpenAIAgent.from_tools(TOOLS, verbose=True)
agent.chat_repl()`
But by setting the embeddings model to embed_model = AutoTokenizer.from_pretrained("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2") Settings.embed_model=embed_model
I am doing this because i want to use open source models for agent like Groq/Huggingface but i cant find a way to do so.