py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.37k stars 1.41k forks source link

KeyError: '/Root' #2410

Closed akshaysatyam2 closed 10 months ago

akshaysatyam2 commented 10 months ago

I was trying to read few pdf files, most of the files worked properly and a file gave following error. Exception handling didn't worked for me.

Environment

Python 3.10.0 pypdf2==3.0.1

Code

This is a minimal, complete example that shows the issue:

    try:
        with open(file_path, 'rb') as file:
            reader = PdfReader(file)
            text = ''
            for page in reader.pages:
                text += page.extract_text()
            return text
    except PdfReadError as e:
        return None

Share here the PDF file(s) that cause the issue. The smaller they are, the better. Let us know if we may add them to our tests!

Traceback

This is the complete traceback I see:


incorrect startxref pointer(3)
Traceback (most recent call last):
  File "C:\Users\AkshayKumar\Documents\Techdome\ResumeReader\blah.py", line 144, in <module>
    process_resume(file)
  File "C:\Users\AkshayKumar\Documents\Techdome\ResumeReader\blah.py", line 85, in process_resume
    text = read_file(file_path)
  File "C:\Users\AkshayKumar\Documents\Techdome\ResumeReader\blah.py", line 38, in read_file
    return read_pdf(file_path)
  File "C:\Users\AkshayKumar\Documents\Techdome\ResumeReader\blah.py", line 50, in read_pdf
    for page in reader.pages:
  File "C:\Users\AkshayKumar\Documents\Techdome\ResumeReader\venv\lib\site-packages\PyPDF2\_page.py", line 2081, in __iter__
    for i in range(len(self)):
  File "C:\Users\AkshayKumar\Documents\Techdome\ResumeReader\venv\lib\site-packages\PyPDF2\_page.py", line 2063, in __len__
    return self.length_function()
  File "C:\Users\AkshayKumar\Documents\Techdome\ResumeReader\venv\lib\site-packages\PyPDF2\_reader.py", line 448, in _get_num_pages
    self._flatten()
  File "C:\Users\AkshayKumar\Documents\Techdome\ResumeReader\venv\lib\site-packages\PyPDF2\_reader.py", line 1101, in _flatten
    catalog = self.trailer[TK.ROOT].get_object()
  File "C:\Users\AkshayKumar\Documents\Techdome\ResumeReader\venv\lib\site-packages\PyPDF2\generic\_data_structures.py", line 266, in __getitem__
    return dict.__getitem__(self, key).get_object()
KeyError: '/Root'```
stefan6419846 commented 10 months ago

PyPDF2 is deprecated, please move to pypdf. If this still is an issue with the latest pypdf version, please provide a corresponding example file as well.

akshaysatyam2 commented 10 months ago

Yes, issue still persists, Now using pypdf==3.17.4, with above mentioned configuration.

Getting

invalid pdf header: b'-----' incorrect startxref pointer(3) Traceback (most recent call last): File "C:\Users\AkshayKumar\Documents\Techdome\ResumeReader\blah.py", line 144, in <module> process_resume(file) File "C:\Users\AkshayKumar\Documents\Techdome\ResumeReader\blah.py", line 85, in process_resume text = read_file(file_path) File "C:\Users\AkshayKumar\Documents\Techdome\ResumeReader\blah.py", line 38, in read_file return read_pdf(file_path) File "C:\Users\AkshayKumar\Documents\Techdome\ResumeReader\blah.py", line 50, in read_pdf for page in reader.pages: File "C:\Users\AkshayKumar\Documents\Techdome\ResumeReader\venv\lib\site-packages\pypdf\_page.py", line 2572, in __iter__ for i in range(len(self)): File "C:\Users\AkshayKumar\Documents\Techdome\ResumeReader\venv\lib\site-packages\pypdf\_page.py", line 2503, in __len__ return self.length_function() File "C:\Users\AkshayKumar\Documents\Techdome\ResumeReader\venv\lib\site-packages\pypdf\_reader.py", line 462, in _get_num_pages self._flatten() File "C:\Users\AkshayKumar\Documents\Techdome\ResumeReader\venv\lib\site-packages\pypdf\_reader.py", line 1234, in _flatten catalog = self.trailer[TK.ROOT].get_object() File "C:\Users\AkshayKumar\Documents\Techdome\ResumeReader\venv\lib\site-packages\pypdf\generic\_data_structures.py", line 333, in __getitem__ return dict.__getitem__(self, key).get_object() KeyError: '/Root'

stefan6419846 commented 10 months ago

Then please provide the corresponding PDF file for further analysis.

akshaysatyam2 commented 10 months ago

sent via mail.

stefan6419846 commented 10 months ago

These are no bugs in pypdf, but data quality issues on your side. Of the four files you sent, two are working correctly, while the other two are multipart messages which contain a PDF file. The corresponding PDF parts have to be unpacked by you before feeding them into pypdf.

You can verify this yourself by opening the PDF files within a text editor for example, which shows something like this:

------WebKitFormBoundarygyvwfd7TBFBliSTO
Content-Disposition: form-data; name="file"; filename="blob"
Content-Type: application/pdf

%PDF-1.7
[...]
%%EOF

------WebKitFormBoundarygyvwfd7TBFBliSTO
Content-Disposition: form-data; name="persist"

true
[...]