Closed 130jd closed 1 month ago
To address the issue with SimpleDirectoryReader
failing to load one specific PDF file, consider the following targeted steps:
pdftotext
to check its readability.os.path.exists()
and os.access()
.encoding
and errors
parameters in SimpleDirectoryReader
to accommodate any unique characters or errors in the PDF.file_extractor
for .pdf
files, ensure it's correctly processing the file.raise_on_error
to True
to get detailed error messages and add logging to the load_file
method to trace the file's processing.By following these steps, you should be able to pinpoint and resolve the issue with the specific PDF file.
To continue the conversation, mention @dosu.
Thanks @dosu
I tried your suggestion and set raise_on_error
to True
This was the error message; does it help you diagnose what in my PDF document caused the error?
Traceback (most recent call last):
File "C:\Users\MyUsername\anaconda3\envs\openai-env\Lib\site-packages\tenacity\__init__.py", line 470, in __call__
result = fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\MyUsername\anaconda3\envs\openai-env\Lib\site-packages\llama_index\readers\file\docs\base.py", line 88, in load_data
page_text = pdf.pages[page].extract_text()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\MyUsername\anaconda3\envs\openai-env\Lib\site-packages\pypdf\_page.py", line 2083, in extract_text
return self._extract_text(
^^^^^^^^^^^^^^^^^^^
File "C:\Users\MyUsername\anaconda3\envs\openai-env\Lib\site-packages\pypdf\_page.py", line 1593, in _extract_text
cmaps[f] = build_char_map(f, space_width, obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\MyUsername\anaconda3\envs\openai-env\Lib\site-packages\pypdf\_cmap.py", line 33, in build_char_map
font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\MyUsername\anaconda3\envs\openai-env\Lib\site-packages\pypdf\_cmap.py", line 58, in build_char_map_from_dict
map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\MyUsername\anaconda3\envs\openai-env\Lib\site-packages\pypdf\_cmap.py", line 227, in parse_to_unicode
return type1_alternative(ft, map_dict, space_code, int_entry)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\MyUsername\anaconda3\envs\openai-env\Lib\site-packages\pypdf\_cmap.py", line 498, in type1_alternative
i = int(words[1])
~~~~~^^^
IndexError: list index out of range
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\MyUsername\anaconda3\envs\openai-env\Lib\site-packages\llama_index\core\readers\file\base.py", line 416, in load_file
docs = reader.load_data(input_file, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\MyUsername\anaconda3\envs\openai-env\Lib\site-packages\tenacity\__init__.py", line 330, in wrapped_f
return self(f, *args, **kw)
^^^^^^^^^^^^^^^^^^^^
File "C:\Users\MyUsername\anaconda3\envs\openai-env\Lib\site-packages\tenacity\__init__.py", line 467, in __call__
do = self.iter(retry_state=retry_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\MyUsername\anaconda3\envs\openai-env\Lib\site-packages\tenacity\__init__.py", line 368, in iter
result = action(retry_state)
^^^^^^^^^^^^^^^^^^^
File "C:\Users\MyUsername\anaconda3\envs\openai-env\Lib\site-packages\tenacity\__init__.py", line 411, in exc_check
raise retry_exc from fut.exception()
The error you're encountering with the SimpleDirectoryReader
when processing a specific PDF document is due to an IndexError
within the pypdf
library, specifically in the handling of character mapping or font information during text extraction. This suggests that the library is attempting to access an element in a list that doesn't exist, likely due to unexpected or unsupported PDF structure or content.
To address this issue, consider the following steps:
pypdf
cannot process.SimpleDirectoryReader
and its dependencies, as updates may contain fixes for handling complex PDFs.These steps are aimed at resolving the specific character mapping or font handling issue indicated by the error message. For further assistance, you might also explore the documentation or community forums for pypdf
and SimpleDirectoryReader
for insights into similar issues and their resolutions.
To continue the conversation, mention @dosu.
Question Validation
Question
I'm using
SimpleDirectoryReader(filename_as_id=True).load_data()
on 350 PDF documents. It works fine on all except 1 PDF, which throws this error.Failed to load file C:\Users\MyUsername\OneDrive\Documents\FolderName\FolderName\**[AuthorFirstName-AuthorLastName] CompanyName, Inc..PDF** with error: RetryError[<Future at 0x#########d0 state=finished raised IndexError>]. Skipping...
I've checked the content of the document, the filename (thought characters like
[
]
,
..
might cause issues but those appear in other PDFs that get read just fine).Any suggestions for what else I could investigate about this 1 PDF file that would explain it?
Thank you