Closed pmdkr closed 7 months ago
π€
Hello @pmdkr! I'm Dosu, a friendly bot here to assist you while we await a human maintainer. I can help with solving bugs, answering queries or guiding you to become a contributor to this project. Don't hesitate to ask anything you need!
Based on the information provided, it seems like the issue you're facing might be related to how the text is being split into chunks by the RecursiveCharacterTextSplitter
and how these chunks are being processed by ChromaDB.
The chunk_size
parameter in the RecursiveCharacterTextSplitter
function determines the maximum size of the chunks that the text is split into. If a smaller chunk_size
is used, the text will be split into more, smaller chunks. Conversely, if a larger chunk_size
is used, the text will be split into fewer, larger chunks. This directly affects the size of the response.
Here is the relevant code from the RecursiveCharacterTextSplitter
function:
for d in splits:
_len = self._length_function(d)
if (
total + _len + (separator_len if len(current_doc) > 0 else 0)
> self._chunk_size
):
if total > self._chunk_size:
logger.warning(
f"Created a chunk of size {total}, "
f"which is longer than the specified {self._chunk_size}"
)
if len(current_doc) > 0:
doc = self._join_docs(current_doc, separator)
if doc is not None:
docs.append(doc)
# Keep on popping if:
# - we have a larger chunk than in the chunk overlap
# - or if we still have any chunks and the length is long
while total > self._chunk_overlap or (
total + _len + (separator_len if len(current_doc) > 0 else 0)
> self._chunk_size
and total > 0
):
total -= self._length_function(current_doc[0]) + (
separator_len if len(current_doc) > 1 else 0
)
current_doc = current_doc[1:]
current_doc.append(d)
total += _len + (separator_len if len(current_doc) > 1 else 0)
doc = self._join_docs(current_doc, separator)
if doc is not None:
docs.append(doc)
return docs
(Source: libs/langchain/langchain/text_splitter.py)
As for the chunk_size_limit
and max_input_size
parameters in the createDb
function, I wasn't able to find specific information about their roles within the LangChain repository. However, they might be related to the maximum size of the chunks that ChromaDB can process and the maximum size of the input that it can handle, respectively.
If increasing the chunk_overlap
size didn't solve the issue, you might want to try adjusting the chunk_size
in the RecursiveCharacterTextSplitter
function or the chunk_size_limit
and max_input_size
in the createDb
function to see if that helps.
Please note that these are just suggestions based on the information available and might not solve the issue. If you have more information or if there are other parameters or functions that you think might be related to the issue, please let me know.
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a π if this is helpful and π if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
@dosu-bot, I have increased the chunk_size in the RecursiveCharacterTextSplitter function and the chunk_size_limit and max_input_size in the createDb function, but it didn't work.
π€
Based on the context provided, it seems like the issue you're facing might be due to the limitations of the PDF parsing libraries used in the LangChain framework. These libraries include PyPDF, PDFMiner, PyMuPDF, PyPDFium2, PDFPlumber, and AmazonTextractPDF. Each of these libraries has its own limitations and may not be able to extract all the text from a PDF document, especially if the document contains complex layouts, non-standard fonts, or images with embedded text.
Here are a few suggestions to resolve this issue:
Use a different PDF parsing library: If you're currently using one of the PDF parsing libraries and it's not giving you the full response, you can try using a different one. Each library has its own strengths and weaknesses, and a different library might be able to extract the text that the current one is missing.
Extract text from images: If your PDF document contains images with embedded text, you can use the extract_images
parameter in the PDF parsing classes to extract text from these images using RapidOCR.
Use Amazon Textract: If your PDF documents are stored on Amazon S3, you can use the AmazonTextractPDFParser
class to extract text from your documents using Amazon Textract. Amazon Textract is a service that automatically extracts text and data from scanned documents, and it might be able to extract the text that the other libraries are missing.
Here's an example of how you can use the AmazonTextractPDFParser
class:
from langchain_community.document_loaders import AmazonTextractPDFLoader
# Initialize the loader with the path to your PDF document on S3
loader = AmazonTextractPDFLoader("s3://your-bucket/your-document.pdf")
# Load the documents
documents = loader.load()
# Print the text of each document
for document in documents:
print(document.page_content)
Please note that to use Amazon Textract, you need to have an AWS account and your AWS credentials need to be configured on your machine.
You can find more information about the PDF parsing libraries and their limitations in the PDF parsers test cases in the LangChain repository.
I hope this helps! If you have any other questions or need further clarification, feel free to ask.
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a π if this is helpful and π if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
I am working in project, where i have to use multiple pdf docs to give respose to the user query.
I have a load method to load pdf from directory.
I am creating chromadb by below code:
now , I am quering chromadb ,
However, i am getting the response , but not full response in some cases, like below : My source pdf has following contents: source file
while my response is showing only some parts as shown below: chromadb response
I tried incresing the chunk_ovarlap size as shown in createdb(), but it does not worked. I am expecting from chromadb full response and response should be comming from given pdf.
I am new to this, any help will be thankful.