Closed BadrinathMJ closed 1 year ago
π€
Hello @BadrinathMJ,
I'm here to assist you with your issue. I'm an AI developed to help solve bugs, answer questions, and guide you in becoming a contributor while you're waiting for a human maintainer.
From the traceback you provided, it seems like the issue is occurring because the PDFReader class is trying to access a page label for a page that doesn't exist in the PDF document. This is likely the cause of the "IndexError: list index out of range" error you're encountering.
Here's a potential solution: you could modify the code to check if a page label exists for a page before trying to access it. Here's an example of how you could do this:
for page in range(num_pages):
# Extract the text from the page
page_text = pdf.pages[page].extract_text()
# Check if a page label exists for the page
if page < len(pdf.page_labels):
page_label = pdf.page_labels[page]
else:
page_label = None
metadata = {"page_label": page_label, "file_name": file.name}
if extra_info is not None:
metadata.update(extra_info)
docs.append(Document(text=page_text, metadata=metadata))
In this modified code, the code checks if a page label exists for a page before trying to access it. If a page label does not exist for a page, the code sets the page label to None.
As for your question about changes in the SimpleDirectoryReader or PDFReader class in the LlamaIndex codebase between version 0.8.50 and the latest version 0.8.53.post3, I'm unable to find any changes that could potentially resolve this issue.
I hope this helps! If you have any other questions or need further clarification, feel free to ask.
Best, Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a π if this is helpful and π if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
Seems like there's something special about your PDF and it's breaking inside of pypdf
.
Maybe try a different pdf loader (e.g. PyMuPDFReader). You can find more at https://llamahub.ai/
Bug Description
While loading pdf document using SimpleDirectoryReader it throws IndexError: list index out of range for some of the pdf documents. It works some pdf documents and not working for few of the documents.
from llama_index import SimpleDirectoryReader,ServiceContext from llama_index.llms import OpenAI
from llama_index.evaluation import DatasetGenerator
documents = SimpleDirectoryReader(input_files = ['../data/judgement3.pdf']).load_data()
shffle documents
import random
random.seed(42) random.shuffle(documents)
gpt_35_context = ServiceContext.from_defaults(llm=OpenAI(model="gpt-3.5-turbo", temperature=0))
Version
0.8.50
Steps to Reproduce
`from llama_index import SimpleDirectoryReader,ServiceContext from llama_index.llms import OpenAI
from llama_index.evaluation import DatasetGenerator
documents = SimpleDirectoryReader(input_files = ['../data/judgement3.pdf']).load_data()
shffle documents
import random
random.seed(42) random.shuffle(documents)
gpt_35_context = ServiceContext.from_defaults(llm=OpenAI(model="gpt-3.5-turbo", temperature=0))`
Relevant Logs/Tracbacks