plangrid / pdf-annotate

Pure-python library for adding annotations to PDFs
MIT License
189 stars 45 forks source link

Missing page #63

Open thomassajot opened 3 years ago

thomassajot commented 3 years ago

Hello, Surprisingly, some pages are missing when using pdf_annotate: Example pdf from https://www.hkexgroup.com/-/media/HKEX-Group-Site/ssd/Investor-Relations/Regulatory-Reports/documents/2016/160321ar_e.pdf?la=en , with 212 pages.

when running the following code, the new files is missing 2 pages. The second and previous to last pages. Any idea why ?

from pdf_annotate import PdfAnnotator
pdf_file = 'file_path_to.pdf'
copy_file = 'copy_file_path_to.pdf'
annotator = PdfAnnotator(pdf_file)
annotator.write(copy_file)
mjbryant commented 3 years ago

This is likely due to pdfrw, the underlying library that pdf_annotate uses to read, edit, and write PDFs. You could try reading in and writing back out that file using just pdfrw and see if the pages are missing.

jerrian commented 3 years ago

I also found the similar problem and it comes from PdfReader as below. (Actually test.pdf has 19 pages)

>>> from pdfrw import PdfReader
>>> from PyPDF2 import PdfFileReader
>>> filename = './test.pdf'
>>> pdf_reader = PdfReader(filename)
>>> len(pdf_reader.pages)
2
>>> pdf_file_reader = PdfFileReader(open(filename, 'rb'))
>>> pdf_file_reader.getNumPages()
19
>>> from PyPDF3 import PdfFileReader
>>> pdf_file_reader = PdfFileReader(open(filename, 'rb'))
>>> pdf_file_reader.getNumPages()
19

I raised this issue on that repo and I'm still waiting for their answer, but I'm wondering if I can get an answer because there have been no changes since 2018.
Can't use preexisting streams like pyPdf while initializing PdfReader

Could you allow or change PdfAnnotator to use PdfFileReader and PdfFileWriter from PyPDF3, which is a fork of PyPDF2 and is still actively improved?