BadrinathMJ commented 1 year ago

Bug Description

While loading pdf document using SimpleDirectoryReader it throws IndexError: list index out of range for some of the pdf documents. It works some pdf documents and not working for few of the documents.

from llama_index import SimpleDirectoryReader,ServiceContext from llama_index.llms import OpenAI

from llama_index.evaluation import DatasetGenerator

documents = SimpleDirectoryReader(input_files = ['../data/judgement3.pdf']).load_data()

shffle documents

import random

random.seed(42) random.shuffle(documents)

gpt_35_context = ServiceContext.from_defaults(llm=OpenAI(model="gpt-3.5-turbo", temperature=0))

Version

0.8.50

Steps to Reproduce

Load the necessary dependencies !pip install llama-index pypdf sentence-transformers ragas
Run the following code snippet with trial no. of different pdf documents. It will throw the IndexError

`from llama_index import SimpleDirectoryReader,ServiceContext from llama_index.llms import OpenAI

from llama_index.evaluation import DatasetGenerator

documents = SimpleDirectoryReader(input_files = ['../data/judgement3.pdf']).load_data()

shffle documents

import random

random.seed(42) random.shuffle(documents)

gpt_35_context = ServiceContext.from_defaults(llm=OpenAI(model="gpt-3.5-turbo", temperature=0))`

Relevant Logs/Tracbacks

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
d:\Data_Llama2\FineTuning_with_OpenAI\fine_tuning_gpt3.5_turbo1.ipynb Cell 4 line 7
      2 from llama_index.llms import OpenAI
      4 from llama_index.evaluation import DatasetGenerator
----> 7 documents = SimpleDirectoryReader(input_files = ['D:\Data_Llama2\FineTuning_with_OpenAI\data\judgement3.pdf']).load_data()
      9 #shffle documents
     10 import random

File d:\Users\Mayur1\envs\llama\lib\site-packages\llama_index\readers\file\base.py:204, in SimpleDirectoryReader.load_data(self)
    202     self.file_extractor[file_suffix] = reader_cls()
    203 reader = self.file_extractor[file_suffix]
--> 204 docs = reader.load_data(input_file, extra_info=metadata)
    206 # iterate over docs if needed
    207 if self.filename_as_id:

File d:\Users\Mayur1\envs\llama\lib\site-packages\llama_index\readers\file\docs_reader.py:39, in PDFReader.load_data(self, file, extra_info)
     36 docs = []
     37 for page in range(num_pages):
     38     # Extract the text from the page
---> 39     page_text = pdf.pages[page].extract_text()
     40     page_label = pdf.page_labels[page]
     42     metadata = {"page_label": page_label, "file_name": file.name}

File d:\Users\Mayur1\envs\llama\lib\site-packages\pypdf\_page.py:2268, in PageObject.extract_text(self, Tj_sep, TJ_sep, orientations, space_width, visitor_operand_before, visitor_operand_after, visitor_text, *args)
...
--> 480     if words[3] != b"put":
    481         continue
    482     try:

IndexError: list index out of range

dosubot[bot] commented 1 year ago

🤖

Hello @BadrinathMJ,

I'm here to assist you with your issue. I'm an AI developed to help solve bugs, answer questions, and guide you in becoming a contributor while you're waiting for a human maintainer.

From the traceback you provided, it seems like the issue is occurring because the PDFReader class is trying to access a page label for a page that doesn't exist in the PDF document. This is likely the cause of the "IndexError: list index out of range" error you're encountering.

Here's a potential solution: you could modify the code to check if a page label exists for a page before trying to access it. Here's an example of how you could do this:

for page in range(num_pages):
    # Extract the text from the page
    page_text = pdf.pages[page].extract_text()

    # Check if a page label exists for the page
    if page < len(pdf.page_labels):
        page_label = pdf.page_labels[page]
    else:
        page_label = None

    metadata = {"page_label": page_label, "file_name": file.name}
    if extra_info is not None:
        metadata.update(extra_info)

    docs.append(Document(text=page_text, metadata=metadata))

In this modified code, the code checks if a page label exists for a page before trying to access it. If a page label does not exist for a page, the code sets the page label to None.

As for your question about changes in the SimpleDirectoryReader or PDFReader class in the LlamaIndex codebase between version 0.8.50 and the latest version 0.8.53.post3, I'm unable to find any changes that could potentially resolve this issue.

I hope this helps! If you have any other questions or need further clarification, feel free to ask.

Best, Dosu

Sources

llama_index/readers/file/docs_reader.py

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Disiok commented 1 year ago

Seems like there's something special about your PDF and it's breaking inside of pypdf. Maybe try a different pdf loader (e.g. PyMuPDFReader). You can find more at https://llamahub.ai/

run-llama / llama_index

[Bug]: IndexError: list index out of range #8520

Bug Description

shffle documents

Version

Steps to Reproduce

shffle documents

Relevant Logs/Tracbacks

Sources