Closed daje0601 closed 3 months ago
🤖
Hello @daje0601! 🚀
I'm Dosu, a friendly bot here to lend a hand while we wait for a human maintainer to arrive. As a bot, I can help you troubleshoot bugs, answer your queries, and even guide you on how to contribute to this project. Let's make your LangChain experience as smooth as possible together! 💻
Don't worry, we'll get to the bottom of your issue in no time!
The error you're encountering is due to the tokenizer used in the Embeddings
object not having a padding token set. This is necessary for the Embeddings
object to correctly process the input texts.
To resolve this issue, you need to ensure that the tokenizer used in your Embeddings
object has a padding token. If you're using a tokenizer from the Hugging Face's Transformers library, you can set the padding token like this:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
tokenizer.pad_token = tokenizer.eos_token
Then, you can use this tokenizer in your Embeddings
object:
from langchain_community.embeddings.openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(tokenizer=tokenizer)
Now, you should be able to use the FAISS.from_documents
function without encountering the ValueError.
The FAISS.from_documents
function in the LangChain framework is a class method that constructs a FAISS (Facebook AI Similarity Search) wrapper from raw documents. It takes an iterable of tuples, where each tuple contains a text and its corresponding embedding vector, and an Embeddings object. It also optionally takes metadata and ids for the documents.
The function is used to initialize a FAISS object with pre-computed embeddings of the documents. This is useful when you want to perform similarity searches on the documents using the FAISS library, which is designed for efficient similarity search and clustering of dense vectors.
Here is the relevant code:
@classmethod
def from_embeddings(
cls,
text_embeddings: Iterable[Tuple[str, List[float]]],
embedding: Embeddings,
metadatas: Optional[Iterable[dict]] = None,
ids: Optional[List[str]] = None,
**kwargs: Any,
) -> FAISS:
"""Construct FAISS wrapper from raw documents."""
texts = [t[0] for t in text_embeddings]
embeddings = [t[1] for t in text_embeddings]
return cls.__from(
texts,
embeddings,
embedding,
metadatas=metadatas,
ids=ids,
**kwargs,
)
In the example provided in the code, the function is used as follows:
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
text_embeddings = embeddings.embed_documents(texts)
text_embedding_pairs = zip(texts, text_embeddings)
faiss = FAISS.from_embeddings(text_embedding_pairs, embeddings)
In this example, the OpenAIEmbeddings object is used to compute the embeddings of the texts, and then the from_embeddings
function is used to create a FAISS object from these embeddings.
I hope this helps! If you have any other questions, feel free to ask.
Best, Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
first, I use HuggingFaceEmbeddings that is not used tokenizer kwargs
Second, I try to tokenizer.pad_token = tokenizer.eos_token
but it is not works
во-первых, я использую HuggingFaceEmbeddings, который не использует токенизатор kwargs. Во-вторых, я пытаюсь,
tokenizer.pad_token = tokenizer.eos_token
но это не работает.
same problem
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True, cache_dir='C:/Users/Timmek/Documents/model', torch_dtype=torch.bfloat16, quantization_config=bnb_config)#.to(DEVICE)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True, cache_dir='C:/Users/Timmek/Documents/model', torch_dtype=torch.bfloat16)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=192)
llm = HuggingFacePipeline(pipeline=pipe, model_kwargs = {"pad_token": 0})
# Загрузка данных и создание векторного пространства
#loader = TextLoader("aniworld_gpt.txt", encoding="UTF-8")
loader = TextLoader("aniworld_gpt3.txt", encoding='UTF-8')
docs = loader.load()
embeddings = HuggingFaceEmbeddings(
model_name=model_name,
model_kwargs = {'device': 'cuda'}
)
text_splitter = RecursiveCharacterTextSplitter()
documents = text_splitter.split_documents(docs)
vector = FAISS.from_docume
```nts(documents, embeddings)
Traceback (most recent call last):
File "C:\Users\Timmek\Documents\NAI_bot\tg_tulpa_4_2_stability.py", line 194, in handle_message
response = generate_response_with_model(chat_id, message_text, chat_message_counters[chat_id], message.from_user.first_name, False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Timmek\Documents\NAI_bot\tg_tulpa_4_2_stability.py", line 88, in generate_response_with_model
vector = FAISS.from_documents(documents, embeddings)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\Users\Timmek\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain_core\vectorstores.py", line 528, in from_documents
return cls.from_texts(texts, embedding, metadatas=metadatas, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\Users\Timmek\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain_community\vectorstores\faiss.py", line 930, in from_texts
embeddings = embedding.embed_documents(texts)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\Users\Timmek\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain_community\embeddings\huggingface.py", line 93, in embed_documents
embeddings = self.client.encode(
^^^^^^^^^^^^^^^^^^^
File "c:\Users\Timmek\AppData\Local\Programs\Python\Python311\Lib\site-packages\sentence_transformers\SentenceTransformer.py", line 345, in encode
features = self.tokenize(sentences_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\Users\Timmek\AppData\Local\Programs\Python\Python311\Lib\site-packages\sentence_transformers\SentenceTransformer.py", line 553, in tokenize
return self._first_module().tokenize(texts)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\Users\Timmek\AppData\Local\Programs\Python\Python311\Lib\site-packages\sentence_transformers\models\Transformer.py", line 146, in tokenize
self.tokenizer(
File "c:\Users\Timmek\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\tokenization_utils_base.py", line 2829, in call
encodings = self._call_one(text=text, text_pair=text_pair, all_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\Users\Timmek\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\tokenization_utils_base.py", line 2915, in _call_one
return self.batch_encode_plus(
^^^^^^^^^^^^^^^^^^^^^^^
File "c:\Users\Timmek\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\tokenization_utils_base.py", line 3097, in batch_encode_plus
padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\Users\Timmek\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\tokenization_utils_base.py", line 2734, in _get_padding_truncation_strategies
raise ValueError(
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as pad_token
(tokenizer.pad_token = tokenizer.eos_token e.g.)
or add a new pad token via tokenizer.add_special_tokens({'pad_token': '[PAD]'})
.
Checked other resources
Example Code
I use this code. I met this error. Why occur this occur? I look around all. I can't fix ti. so Sad ㅠㅠㅠ Could you explain me how to get fix it?
Description
I use this code. I met this error. Why occur this occur? I look around all. I can't fix ti. so Sad ㅠㅠㅠ Could you explain me how to get fix it?
System Info
langchain 0.1.0 langchain-community 0.0.12 langchain-core 0.1.10 langsmith 0.0.80
Related Components