langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
92.47k stars 14.79k forks source link

Using faiss.from_documents require tokenizer pad_token #16036

Closed daje0601 closed 3 months ago

daje0601 commented 8 months ago

Checked other resources

Example Code

embeddings = HuggingFaceEmbeddings(
    model_name="mistralai/Mistral-7B-v0.1",     # Provide the pre-trained model's path
)
cache_embeddings = CacheBackedEmbeddings.from_bytes_store(embeddings, cached_dir)
vectorstore = FAISS.from_documents(docs, cache_embeddings)

I use this code. I met this error. Why occur this occur? I look around all. I can't fix ti. so Sad ㅠㅠㅠ Could you explain me how to get fix it?

ValueError                                Traceback (most recent call last)
Cell In[8], [line 1](vscode-notebook-cell:?execution_count=8&line=1)
----> [1](vscode-notebook-cell:?execution_count=8&line=1) vectorstore = FAISS.from_documents(docs, cache_embeddings)
      [3](vscode-notebook-cell:?execution_count=8&line=3) retriver = vectorstore.as_retriever(reduce_k_below_max_tokens=True)
      [4](vscode-notebook-cell:?execution_count=8&line=4) # retriver = vectorstore.as_retriever()

File [~/anaconda3/envs/rag/lib/python3.11/site-packages/langchain_core/vectorstores.py:508](https://vscode-remote+ssh-002dremote-002bssh-002eddapp-002eio.vscode-resource.vscode-cdn.net/home/datadriven/ds.kang/rag/~/anaconda3/envs/rag/lib/python3.11/site-packages/langchain_core/vectorstores.py:508), in VectorStore.from_documents(cls, documents, embedding, **kwargs)
    [506](https://vscode-remote+ssh-002dremote-002bssh-002eddapp-002eio.vscode-resource.vscode-cdn.net/home/datadriven/ds.kang/rag/~/anaconda3/envs/rag/lib/python3.11/site-packages/langchain_core/vectorstores.py:506) texts = [d.page_content for d in documents]
    [507](https://vscode-remote+ssh-002dremote-002bssh-002eddapp-002eio.vscode-resource.vscode-cdn.net/home/datadriven/ds.kang/rag/~/anaconda3/envs/rag/lib/python3.11/site-packages/langchain_core/vectorstores.py:507) metadatas = [d.metadata for d in documents]
--> [508](https://vscode-remote+ssh-002dremote-002bssh-002eddapp-002eio.vscode-resource.vscode-cdn.net/home/datadriven/ds.kang/rag/~/anaconda3/envs/rag/lib/python3.11/site-packages/langchain_core/vectorstores.py:508) return cls.from_texts(texts, embedding, metadatas=metadatas, **kwargs)

File [~/anaconda3/envs/rag/lib/python3.11/site-packages/langchain_community/vectorstores/faiss.py:913](https://vscode-remote+ssh-002dremote-002bssh-002eddapp-002eio.vscode-resource.vscode-cdn.net/home/datadriven/ds.kang/rag/~/anaconda3/envs/rag/lib/python3.11/site-packages/langchain_community/vectorstores/faiss.py:913), in FAISS.from_texts(cls, texts, embedding, metadatas, ids, **kwargs)
    [886](https://vscode-remote+ssh-002dremote-002bssh-002eddapp-002eio.vscode-resource.vscode-cdn.net/home/datadriven/ds.kang/rag/~/anaconda3/envs/rag/lib/python3.11/site-packages/langchain_community/vectorstores/faiss.py:886) @classmethod
    [887](https://vscode-remote+ssh-002dremote-002bssh-002eddapp-002eio.vscode-resource.vscode-cdn.net/home/datadriven/ds.kang/rag/~/anaconda3/envs/rag/lib/python3.11/site-packages/langchain_community/vectorstores/faiss.py:887) def from_texts(
    [888](https://vscode-remote+ssh-002dremote-002bssh-002eddapp-002eio.vscode-resource.vscode-cdn.net/home/datadriven/ds.kang/rag/~/anaconda3/envs/rag/lib/python3.11/site-packages/langchain_community/vectorstores/faiss.py:888)     cls,
   (...)
    [893](https://vscode-remote+ssh-002dremote-002bssh-002eddapp-002eio.vscode-resource.vscode-cdn.net/home/datadriven/ds.kang/rag/~/anaconda3/envs/rag/lib/python3.11/site-packages/langchain_community/vectorstores/faiss.py:893)     **kwargs: Any,
    [894](https://vscode-remote+ssh-002dremote-002bssh-002eddapp-002eio.vscode-resource.vscode-cdn.net/home/datadriven/ds.kang/rag/~/anaconda3/envs/rag/lib/python3.11/site-packages/langchain_community/vectorstores/faiss.py:894) ) -> FAISS:
    [895](https://vscode-remote+ssh-002dremote-002bssh-002eddapp-002eio.vscode-resource.vscode-cdn.net/home/datadriven/ds.kang/rag/~/anaconda3/envs/rag/lib/python3.11/site-packages/langchain_community/vectorstores/faiss.py:895)     """Construct FAISS wrapper from raw documents.
    [896](https://vscode-remote+ssh-002dremote-002bssh-002eddapp-002eio.vscode-resource.vscode-cdn.net/home/datadriven/ds.kang/rag/~/anaconda3/envs/rag/lib/python3.11/site-packages/langchain_community/vectorstores/faiss.py:896) 
    [897](https://vscode-remote+ssh-002dremote-002bssh-002eddapp-002eio.vscode-resource.vscode-cdn.net/home/datadriven/ds.kang/rag/~/anaconda3/envs/rag/lib/python3.11/site-packages/langchain_community/vectorstores/faiss.py:897)     This is a user friendly interface that:
   (...)
    [911](https://vscode-remote+ssh-002dremote-002bssh-002eddapp-002eio.vscode-resource.vscode-cdn.net/home/datadriven/ds.kang/rag/~/anaconda3/envs/rag/lib/python3.11/site-packages/langchain_community/vectorstores/faiss.py:911)             faiss = FAISS.from_texts(texts, embeddings)
    [912](https://vscode-remote+ssh-002dremote-002bssh-002eddapp-002eio.vscode-resource.vscode-cdn.net/home/datadriven/ds.kang/rag/~/anaconda3/envs/rag/lib/python3.11/site-packages/langchain_community/vectorstores/faiss.py:912)     """
...
   (...)
   [2719](https://vscode-remote+ssh-002dremote-002bssh-002eddapp-002eio.vscode-resource.vscode-cdn.net/home/datadriven/ds.kang/rag/~/anaconda3/envs/rag/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:2719)     and (max_length % pad_to_multiple_of != 0)
   [2720](https://vscode-remote+ssh-002dremote-002bssh-002eddapp-002eio.vscode-resource.vscode-cdn.net/home/datadriven/ds.kang/rag/~/anaconda3/envs/rag/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:2720) ):

ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`

Description

I use this code. I met this error. Why occur this occur? I look around all. I can't fix ti. so Sad ㅠㅠㅠ Could you explain me how to get fix it?

System Info

langchain 0.1.0 langchain-community 0.0.12 langchain-core 0.1.10 langsmith 0.0.80

Related Components

dosubot[bot] commented 8 months ago

🤖

Hello @daje0601! 🚀

I'm Dosu, a friendly bot here to lend a hand while we wait for a human maintainer to arrive. As a bot, I can help you troubleshoot bugs, answer your queries, and even guide you on how to contribute to this project. Let's make your LangChain experience as smooth as possible together! 💻

Don't worry, we'll get to the bottom of your issue in no time!

The error you're encountering is due to the tokenizer used in the Embeddings object not having a padding token set. This is necessary for the Embeddings object to correctly process the input texts.

To resolve this issue, you need to ensure that the tokenizer used in your Embeddings object has a padding token. If you're using a tokenizer from the Hugging Face's Transformers library, you can set the padding token like this:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
tokenizer.pad_token = tokenizer.eos_token

Then, you can use this tokenizer in your Embeddings object:

from langchain_community.embeddings.openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(tokenizer=tokenizer)

Now, you should be able to use the FAISS.from_documents function without encountering the ValueError.

The FAISS.from_documents function in the LangChain framework is a class method that constructs a FAISS (Facebook AI Similarity Search) wrapper from raw documents. It takes an iterable of tuples, where each tuple contains a text and its corresponding embedding vector, and an Embeddings object. It also optionally takes metadata and ids for the documents.

The function is used to initialize a FAISS object with pre-computed embeddings of the documents. This is useful when you want to perform similarity searches on the documents using the FAISS library, which is designed for efficient similarity search and clustering of dense vectors.

Here is the relevant code:

@classmethod
def from_embeddings(
    cls,
    text_embeddings: Iterable[Tuple[str, List[float]]],
    embedding: Embeddings,
    metadatas: Optional[Iterable[dict]] = None,
    ids: Optional[List[str]] = None,
    **kwargs: Any,
) -> FAISS:
    """Construct FAISS wrapper from raw documents."""
    texts = [t[0] for t in text_embeddings]
    embeddings = [t[1] for t in text_embeddings]
    return cls.__from(
        texts,
        embeddings,
        embedding,
        metadatas=metadatas,
        ids=ids,
        **kwargs,
    )

In the example provided in the code, the function is used as follows:

from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()
text_embeddings = embeddings.embed_documents(texts)
text_embedding_pairs = zip(texts, text_embeddings)
faiss = FAISS.from_embeddings(text_embedding_pairs, embeddings)

In this example, the OpenAIEmbeddings object is used to compute the embeddings of the texts, and then the from_embeddings function is used to create a FAISS object from these embeddings.

I hope this helps! If you have any other questions, feel free to ask.

Best, Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

daje0601 commented 8 months ago

first, I use HuggingFaceEmbeddings that is not used tokenizer kwargs Second, I try to tokenizer.pad_token = tokenizer.eos_token but it is not works

TimmekHW commented 6 months ago

во-первых, я использую HuggingFaceEmbeddings, который не использует токенизатор kwargs. Во-вторых, я пытаюсь, tokenizer.pad_token = tokenizer.eos_tokenно это не работает.

same problem

model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True,  cache_dir='C:/Users/Timmek/Documents/model', torch_dtype=torch.bfloat16, quantization_config=bnb_config)#.to(DEVICE)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True,  cache_dir='C:/Users/Timmek/Documents/model', torch_dtype=torch.bfloat16)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=192)
llm = HuggingFacePipeline(pipeline=pipe, model_kwargs = {"pad_token": 0}) 

# Загрузка данных и создание векторного пространства
#loader = TextLoader("aniworld_gpt.txt", encoding="UTF-8")
loader = TextLoader("aniworld_gpt3.txt", encoding='UTF-8')
docs = loader.load()
embeddings = HuggingFaceEmbeddings(
                model_name=model_name,
                model_kwargs = {'device': 'cuda'}
            )
text_splitter = RecursiveCharacterTextSplitter()
    documents = text_splitter.split_documents(docs)
    vector = FAISS.from_docume
```nts(documents, embeddings)

Traceback (most recent call last): File "C:\Users\Timmek\Documents\NAI_bot\tg_tulpa_4_2_stability.py", line 194, in handle_message response = generate_response_with_model(chat_id, message_text, chat_message_counters[chat_id], message.from_user.first_name, False) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Timmek\Documents\NAI_bot\tg_tulpa_4_2_stability.py", line 88, in generate_response_with_model vector = FAISS.from_documents(documents, embeddings) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "c:\Users\Timmek\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain_core\vectorstores.py", line 528, in from_documents return cls.from_texts(texts, embedding, metadatas=metadatas, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "c:\Users\Timmek\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain_community\vectorstores\faiss.py", line 930, in from_texts embeddings = embedding.embed_documents(texts) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "c:\Users\Timmek\AppData\Local\Programs\Python\Python311\Lib\site-packages\langchain_community\embeddings\huggingface.py", line 93, in embed_documents embeddings = self.client.encode( ^^^^^^^^^^^^^^^^^^^ File "c:\Users\Timmek\AppData\Local\Programs\Python\Python311\Lib\site-packages\sentence_transformers\SentenceTransformer.py", line 345, in encode features = self.tokenize(sentences_batch) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "c:\Users\Timmek\AppData\Local\Programs\Python\Python311\Lib\site-packages\sentence_transformers\SentenceTransformer.py", line 553, in tokenize return self._first_module().tokenize(texts) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "c:\Users\Timmek\AppData\Local\Programs\Python\Python311\Lib\site-packages\sentence_transformers\models\Transformer.py", line 146, in tokenize self.tokenizer( File "c:\Users\Timmek\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\tokenization_utils_base.py", line 2829, in call encodings = self._call_one(text=text, text_pair=text_pair, all_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "c:\Users\Timmek\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\tokenization_utils_base.py", line 2915, in _call_one return self.batch_encode_plus( ^^^^^^^^^^^^^^^^^^^^^^^ File "c:\Users\Timmek\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\tokenization_utils_base.py", line 3097, in batch_encode_plus padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "c:\Users\Timmek\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\tokenization_utils_base.py", line 2734, in _get_padding_truncation_strategies raise ValueError( ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as pad_token (tokenizer.pad_token = tokenizer.eos_token e.g.) or add a new pad token via tokenizer.add_special_tokens({'pad_token': '[PAD]'}).