Closed akshaysatyam2 closed 10 months ago
PyPDF2 is deprecated, please move to pypdf. If this still is an issue with the latest pypdf version, please provide a corresponding example file as well.
Yes, issue still persists, Now using pypdf==3.17.4, with above mentioned configuration.
Getting
invalid pdf header: b'-----' incorrect startxref pointer(3) Traceback (most recent call last): File "C:\Users\AkshayKumar\Documents\Techdome\ResumeReader\blah.py", line 144, in <module> process_resume(file) File "C:\Users\AkshayKumar\Documents\Techdome\ResumeReader\blah.py", line 85, in process_resume text = read_file(file_path) File "C:\Users\AkshayKumar\Documents\Techdome\ResumeReader\blah.py", line 38, in read_file return read_pdf(file_path) File "C:\Users\AkshayKumar\Documents\Techdome\ResumeReader\blah.py", line 50, in read_pdf for page in reader.pages: File "C:\Users\AkshayKumar\Documents\Techdome\ResumeReader\venv\lib\site-packages\pypdf\_page.py", line 2572, in __iter__ for i in range(len(self)): File "C:\Users\AkshayKumar\Documents\Techdome\ResumeReader\venv\lib\site-packages\pypdf\_page.py", line 2503, in __len__ return self.length_function() File "C:\Users\AkshayKumar\Documents\Techdome\ResumeReader\venv\lib\site-packages\pypdf\_reader.py", line 462, in _get_num_pages self._flatten() File "C:\Users\AkshayKumar\Documents\Techdome\ResumeReader\venv\lib\site-packages\pypdf\_reader.py", line 1234, in _flatten catalog = self.trailer[TK.ROOT].get_object() File "C:\Users\AkshayKumar\Documents\Techdome\ResumeReader\venv\lib\site-packages\pypdf\generic\_data_structures.py", line 333, in __getitem__ return dict.__getitem__(self, key).get_object() KeyError: '/Root'
Then please provide the corresponding PDF file for further analysis.
sent via mail.
These are no bugs in pypdf, but data quality issues on your side. Of the four files you sent, two are working correctly, while the other two are multipart messages which contain a PDF file. The corresponding PDF parts have to be unpacked by you before feeding them into pypdf.
You can verify this yourself by opening the PDF files within a text editor for example, which shows something like this:
------WebKitFormBoundarygyvwfd7TBFBliSTO
Content-Disposition: form-data; name="file"; filename="blob"
Content-Type: application/pdf
%PDF-1.7
[...]
%%EOF
------WebKitFormBoundarygyvwfd7TBFBliSTO
Content-Disposition: form-data; name="persist"
true
[...]
I was trying to read few pdf files, most of the files worked properly and a file gave following error. Exception handling didn't worked for me.
Environment
Python 3.10.0 pypdf2==3.0.1
Code
This is a minimal, complete example that shows the issue:
Share here the PDF file(s) that cause the issue. The smaller they are, the better. Let us know if we may add them to our tests!
Traceback
This is the complete traceback I see: