run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
33.24k stars 4.64k forks source link

[Question]: SimpleDirectoryReader failed to load 1 file (but other 100 files are fine) #13438

Closed 130jd closed 1 month ago

130jd commented 1 month ago

Question Validation

Question

I'm using SimpleDirectoryReader(filename_as_id=True).load_data() on 350 PDF documents. It works fine on all except 1 PDF, which throws this error.

Failed to load file C:\Users\MyUsername\OneDrive\Documents\FolderName\FolderName\**[AuthorFirstName-AuthorLastName] CompanyName, Inc..PDF** with error: RetryError[<Future at 0x#########d0 state=finished raised IndexError>]. Skipping...

I've checked the content of the document, the filename (thought characters like [ ] , .. might cause issues but those appear in other PDFs that get read just fine).

Any suggestions for what else I could investigate about this 1 PDF file that would explain it?

Thank you

dosubot[bot] commented 1 month ago

To address the issue with SimpleDirectoryReader failing to load one specific PDF file, consider the following targeted steps:

  1. Validate the PDF: Ensure the file isn't corrupted by opening it with a PDF reader or using a tool like pdftotext to check its readability.
  2. File Accessibility: Verify the file's existence and accessibility at the given path with os.path.exists() and os.access().
  3. File Metadata: Compare the metadata of the problematic file with those successfully loaded, focusing on attributes like file size and modification dates.
  4. Encoding and Error Handling: Adjust the encoding and errors parameters in SimpleDirectoryReader to accommodate any unique characters or errors in the PDF.
  5. Custom File Extractor Logic: If using a custom file_extractor for .pdf files, ensure it's correctly processing the file.
  6. Debugging: Enable raise_on_error to True to get detailed error messages and add logging to the load_file method to trace the file's processing.

By following these steps, you should be able to pinpoint and resolve the issue with the specific PDF file.

#### Details **Sources** - [llama-index-core/llama_index/core/readers/file/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/readers/file/base.py) - [docs/docs/module_guides/loading/simpledirectoryreader.md](https://github.com/run-llama/llama_index/blob/main/docs/docs/module_guides/loading/simpledirectoryreader.md) - [docs/docs/examples/data_connectors/simple_directory_reader.ipynb](https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/data_connectors/simple_directory_reader.ipynb) - [llama-index-legacy/llama_index/legacy/readers/file/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/readers/file/base.py) - [llama-index-core/tests/test_utils.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/tests/test_utils.py) - [llama-index-core/llama_index/core/query_engine/retry_source_query_engine.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/query_engine/retry_source_query_engine.py) - [llama-index-integrations/llms/llama-index-llms-cohere/llama_index/llms/cohere/utils.py](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/llms/llama-index-llms-cohere/llama_index/llms/cohere/utils.py)
**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

130jd commented 1 month ago

Thanks @dosu

I tried your suggestion and set raise_on_error to True

This was the error message; does it help you diagnose what in my PDF document caused the error?

Traceback (most recent call last):
  File "C:\Users\MyUsername\anaconda3\envs\openai-env\Lib\site-packages\tenacity\__init__.py", line 470, in __call__
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\MyUsername\anaconda3\envs\openai-env\Lib\site-packages\llama_index\readers\file\docs\base.py", line 88, in load_data
    page_text = pdf.pages[page].extract_text()
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\MyUsername\anaconda3\envs\openai-env\Lib\site-packages\pypdf\_page.py", line 2083, in extract_text
    return self._extract_text(
           ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\MyUsername\anaconda3\envs\openai-env\Lib\site-packages\pypdf\_page.py", line 1593, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\MyUsername\anaconda3\envs\openai-env\Lib\site-packages\pypdf\_cmap.py", line 33, in build_char_map
    font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\MyUsername\anaconda3\envs\openai-env\Lib\site-packages\pypdf\_cmap.py", line 58, in build_char_map_from_dict
    map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\MyUsername\anaconda3\envs\openai-env\Lib\site-packages\pypdf\_cmap.py", line 227, in parse_to_unicode
    return type1_alternative(ft, map_dict, space_code, int_entry)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\MyUsername\anaconda3\envs\openai-env\Lib\site-packages\pypdf\_cmap.py", line 498, in type1_alternative
    i = int(words[1])
            ~~~~~^^^
IndexError: list index out of range

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\MyUsername\anaconda3\envs\openai-env\Lib\site-packages\llama_index\core\readers\file\base.py", line 416, in load_file
    docs = reader.load_data(input_file, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\MyUsername\anaconda3\envs\openai-env\Lib\site-packages\tenacity\__init__.py", line 330, in wrapped_f
    return self(f, *args, **kw)
           ^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\MyUsername\anaconda3\envs\openai-env\Lib\site-packages\tenacity\__init__.py", line 467, in __call__
    do = self.iter(retry_state=retry_state)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\MyUsername\anaconda3\envs\openai-env\Lib\site-packages\tenacity\__init__.py", line 368, in iter
    result = action(retry_state)
             ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\MyUsername\anaconda3\envs\openai-env\Lib\site-packages\tenacity\__init__.py", line 411, in exc_check
    raise retry_exc from fut.exception()
dosubot[bot] commented 1 month ago

The error you're encountering with the SimpleDirectoryReader when processing a specific PDF document is due to an IndexError within the pypdf library, specifically in the handling of character mapping or font information during text extraction. This suggests that the library is attempting to access an element in a list that doesn't exist, likely due to unexpected or unsupported PDF structure or content.

To address this issue, consider the following steps:

  1. Ensure the PDF is not empty or encrypted: Verify that the document does not contain unreadable pages or encrypted content that pypdf cannot process.
  2. Update Dependencies: Make sure you're using the latest versions of SimpleDirectoryReader and its dependencies, as updates may contain fixes for handling complex PDFs.
  3. Preprocess the PDF: Try simplifying the PDF's structure by removing complex formatting or images, possibly using tools like Ghostscript or pdftotext, before attempting text extraction again.
  4. Use Alternative Libraries: If the problem persists, explore using other PDF processing libraries like PyMuPDF (fitz), pdfminer.six, or PyPDF2, which may offer more robust handling for the specific issues in your document.
  5. Debugging and Manual Handling: Add detailed logging to identify the exact failure point or consider manually extracting the text for critical documents.

These steps are aimed at resolving the specific character mapping or font handling issue indicated by the error message. For further assistance, you might also explore the documentation or community forums for pypdf and SimpleDirectoryReader for insights into similar issues and their resolutions.

#### Details **Sources** - [[Bug]: IndexError: list index out of range](https://github.com/run-llama/llama_index/discussions/8571) - [llama-index-legacy/llama_index/legacy/readers/file/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/readers/file/base.py) - [llama-index-integrations/readers/llama-index-readers-file/llama_index/readers/file/pymu_pdf/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/readers/llama-index-readers-file/llama_index/readers/file/pymu_pdf/base.py) - [llama-index-integrations/readers/llama-index-readers-file/tests/test_docs.py](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/readers/llama-index-readers-file/tests/test_docs.py) - [llama-index-core/llama_index/core/readers/file/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/readers/file/base.py) - [llama-index-legacy/llama_index/legacy/readers/file/docs_reader.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/readers/file/docs_reader.py) - [llama-index-integrations/readers/llama-index-readers-file/llama_index/readers/file/docs/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/readers/llama-index-readers-file/llama_index/readers/file/docs/base.py) - [llama-index-integrations/readers/llama-index-readers-preprocess/llama_index/readers/preprocess/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/readers/llama-index-readers-preprocess/llama_index/readers/preprocess/base.py)
**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.