langchain-ai / langchain

πŸ¦œπŸ”— Build context-aware reasoning applications
https://python.langchain.com
MIT License
94.52k stars 15.29k forks source link

how to increase the response size of chromadb #14880

Closed pmdkr closed 7 months ago

pmdkr commented 10 months ago

I am working in project, where i have to use multiple pdf docs to give respose to the user query.

I have a load method to load pdf from directory.

def loadFiles():

    loader = DirectoryLoader('./static/upload/', glob="./*.pdf", loader_cls=PyPDFLoader)
    documents = loader.load()

    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=50)
    texts = text_splitter.split_documents(documents)
    return texts

I am creating chromadb by below code:

def createDb(load,embeddings,persist_directory):
    max_input_size = 3000
    num_output = 256
    chunk_size_limit = 1000 # token window size per document
    max_chunk_overlap = 80 # overlap for each token fragment

    vectordb = Chroma.from_documents(documents=load, embedding=embeddings, persist_directory=persist_directory)
    vectordb.persist()
    return vectordb

now , I am quering chromadb ,

qa_chain = RetrievalQA.from_chain_type(
        llm=OpenAI(temperature=0,model_name = "text-davinci-003"),
        retriever=vectordb.as_retriever(),chain_type="stuff",
        chain_type_kwargs=chain_type_kwargs,

        return_source_documents=True
        )

However, i am getting the response , but not full response in some cases, like below : My source pdf has following contents: source file

while my response is showing only some parts as shown below: chromadb response

I tried incresing the chunk_ovarlap size as shown in createdb(), but it does not worked. I am expecting from chromadb full response and response should be comming from given pdf.

I am new to this, any help will be thankful.

dosubot[bot] commented 10 months ago

πŸ€–

Hello @pmdkr! I'm Dosu, a friendly bot here to assist you while we await a human maintainer. I can help with solving bugs, answering queries or guiding you to become a contributor to this project. Don't hesitate to ask anything you need!

Based on the information provided, it seems like the issue you're facing might be related to how the text is being split into chunks by the RecursiveCharacterTextSplitter and how these chunks are being processed by ChromaDB.

The chunk_size parameter in the RecursiveCharacterTextSplitter function determines the maximum size of the chunks that the text is split into. If a smaller chunk_size is used, the text will be split into more, smaller chunks. Conversely, if a larger chunk_size is used, the text will be split into fewer, larger chunks. This directly affects the size of the response.

Here is the relevant code from the RecursiveCharacterTextSplitter function:

for d in splits:
    _len = self._length_function(d)
    if (
        total + _len + (separator_len if len(current_doc) > 0 else 0)
        > self._chunk_size
    ):
        if total > self._chunk_size:
            logger.warning(
                f"Created a chunk of size {total}, "
                f"which is longer than the specified {self._chunk_size}"
            )
        if len(current_doc) > 0:
            doc = self._join_docs(current_doc, separator)
            if doc is not None:
                docs.append(doc)
            # Keep on popping if:
            # - we have a larger chunk than in the chunk overlap
            # - or if we still have any chunks and the length is long
            while total > self._chunk_overlap or (
                total + _len + (separator_len if len(current_doc) > 0 else 0)
                > self._chunk_size
                and total > 0
            ):
                total -= self._length_function(current_doc[0]) + (
                    separator_len if len(current_doc) > 1 else 0
                )
                current_doc = current_doc[1:]
    current_doc.append(d)
    total += _len + (separator_len if len(current_doc) > 1 else 0)
doc = self._join_docs(current_doc, separator)
if doc is not None:
    docs.append(doc)
return docs

(Source: libs/langchain/langchain/text_splitter.py)

As for the chunk_size_limit and max_input_size parameters in the createDb function, I wasn't able to find specific information about their roles within the LangChain repository. However, they might be related to the maximum size of the chunks that ChromaDB can process and the maximum size of the input that it can handle, respectively.

If increasing the chunk_overlap size didn't solve the issue, you might want to try adjusting the chunk_size in the RecursiveCharacterTextSplitter function or the chunk_size_limit and max_input_size in the createDb function to see if that helps.

Please note that these are just suggestions based on the information available and might not solve the issue. If you have more information or if there are other parameters or functions that you think might be related to the issue, please let me know.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a πŸ‘ if this is helpful and πŸ‘Ž if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

pmdkr commented 10 months ago

@dosu-bot, I have increased the chunk_size in the RecursiveCharacterTextSplitter function and the chunk_size_limit and max_input_size in the createDb function, but it didn't work.

dosubot[bot] commented 10 months ago

πŸ€–

Based on the context provided, it seems like the issue you're facing might be due to the limitations of the PDF parsing libraries used in the LangChain framework. These libraries include PyPDF, PDFMiner, PyMuPDF, PyPDFium2, PDFPlumber, and AmazonTextractPDF. Each of these libraries has its own limitations and may not be able to extract all the text from a PDF document, especially if the document contains complex layouts, non-standard fonts, or images with embedded text.

Here are a few suggestions to resolve this issue:

  1. Use a different PDF parsing library: If you're currently using one of the PDF parsing libraries and it's not giving you the full response, you can try using a different one. Each library has its own strengths and weaknesses, and a different library might be able to extract the text that the current one is missing.

  2. Extract text from images: If your PDF document contains images with embedded text, you can use the extract_images parameter in the PDF parsing classes to extract text from these images using RapidOCR.

  3. Use Amazon Textract: If your PDF documents are stored on Amazon S3, you can use the AmazonTextractPDFParser class to extract text from your documents using Amazon Textract. Amazon Textract is a service that automatically extracts text and data from scanned documents, and it might be able to extract the text that the other libraries are missing.

Here's an example of how you can use the AmazonTextractPDFParser class:

from langchain_community.document_loaders import AmazonTextractPDFLoader

# Initialize the loader with the path to your PDF document on S3
loader = AmazonTextractPDFLoader("s3://your-bucket/your-document.pdf")

# Load the documents
documents = loader.load()

# Print the text of each document
for document in documents:
    print(document.page_content)

Please note that to use Amazon Textract, you need to have an AWS account and your AWS credentials need to be configured on your machine.

You can find more information about the PDF parsing libraries and their limitations in the PDF parsers test cases in the LangChain repository.

I hope this helps! If you have any other questions or need further clarification, feel free to ask.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a πŸ‘ if this is helpful and πŸ‘Ž if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.