langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
88.44k stars 13.89k forks source link

Error when adding documents to vector_store - Azure AI Search #20283

Open adityakadrekar16 opened 2 months ago

adityakadrekar16 commented 2 months ago

Checked other resources

Example Code

`%pip install --upgrade --quiet azure-search-documents %pip install --upgrade --quiet azure-identity

import os

from langchain_community.vectorstores.azuresearch import AzureSearch from langchain_openai import AzureOpenAIEmbeddings, OpenAIEmbeddings

Option 2: use an Azure OpenAI account with a deployment of an embedding model

azure_endpoint: str = "PLACEHOLDER FOR YOUR AZURE OPENAI ENDPOINT" azure_openai_api_key: str = "PLACEHOLDER FOR YOUR AZURE OPENAI KEY" azure_openai_api_version: str = "2023-05-15" azure_deployment: str = "text-embedding-ada-002"

vector_store_address: str = "YOUR_AZURE_SEARCH_ENDPOINT" vector_store_password: str = "YOUR_AZURE_SEARCH_ADMIN_KEY"

Option 2: Use AzureOpenAIEmbeddings with an Azure account

embeddings: AzureOpenAIEmbeddings = AzureOpenAIEmbeddings( azure_deployment=azure_deployment, openai_api_version=azure_openai_api_version, azure_endpoint=azure_endpoint, api_key=azure_openai_api_key, )

index_name: str = "langchain-vector-demo" vector_store: AzureSearch = AzureSearch( azure_search_endpoint=vector_store_address, azure_search_key=vector_store_password, index_name=index_name, embedding_function=embeddings.embed_query, )

from langchain.text_splitter import ( CharacterTextSplitter, RecursiveCharacterTextSplitter, ) from langchain.document_loaders import DirectoryLoader, PyPDFLoader

Read the PDF file using the langchain loader

pdf_link = "test.pdf" loader = PyPDFLoader(pdf_link, extract_images=False) data = loader.load_and_split()

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0) docs = text_splitter.split_documents(data)

vector_store.add_documents(documents=docs)`

Error Message and Stack Trace (if applicable)


AttributeError Traceback (most recent call last) Cell In[12], line 2 1 for i in range(0, len(docs)): ----> 2 vector_store.add_documents(documents=docs[i]) 3 time.sleep(5)

File ~/anaconda3/envs/rag_azure/lib/python3.10/site-packages/langchain_core/vectorstores.py:136, in VectorStore.add_documents(self, documents, kwargs) 127 """Run more documents through the embeddings and add to the vectorstore. 128 129 Args: (...) 133 List[str]: List of IDs of the added texts. 134 """ 135 # TODO: Handle the case where the user doesn't provide ids on the Collection --> 136 texts = [doc.page_content for doc in documents] 137 metadatas = [doc.metadata for doc in documents] 138 return self.add_texts(texts, metadatas, kwargs)

File ~/anaconda3/envs/rag_azure/lib/python3.10/site-packages/langchain_core/vectorstores.py:136, in (.0) 127 """Run more documents through the embeddings and add to the vectorstore. 128 129 Args: (...) 133 List[str]: List of IDs of the added texts. 134 """ 135 # TODO: Handle the case where the user doesn't provide ids on the Collection --> 136 texts = [doc.page_content for doc in documents] 137 metadatas = [doc.metadata for doc in documents] 138 return self.add_texts(texts, metadatas, **kwargs)

AttributeError: 'tuple' object has no attribute 'page_content'

Description

I am using langchain to connect to Azure AI Search and create vector stores and add documents to them so I can create a RAG application. I tried to replicate the notebook provided by Langchain for Azure AI Search https://python.langchain.com/docs/integrations/vectorstores/azuresearch/ but its failing with the above error

I do see page_content in 'docs' so I am not sure where is the problem. I got langchain_core.documents.base.Document on type(docs[0])

Here is an example of how one of the element of the doc looks print(docs[5]) Document(page_content='Modify the likelihood of specified tokens appearing in the completion. Accepts a json object that maps tokens (specified by their token ID in the GPT tokenizer) to an associated bias value from -100 to 100. You can use this tokenizer tool (which works for both GPT-2 and GPT-3) to convert text to token IDs. Mathematically, the bias is added to the logits generated by the model prior to sampling. The exact effect varies per model, but values between -1 and 1 should decrease or increase likelihood of selection', metadata={'source': 'test.pdf', 'page': 3})

System Info

platform - mac python - 3.10

langchain==0.1.15 langchain-community==0.0.32 langchain-core==0.1.41 langchain-openai==0.0.2.post1 langchain-text-splitters==0.0.1

adityakadrekar16 commented 2 months ago

Hello, can someone help me here?