pmaupin / pdfrw

pdfrw is a pure Python library that reads and writes PDFs
Other
1.84k stars 271 forks source link

Can't use preexisting streams like pyPdf while initializing PdfReader #223

Open jerrian opened 3 years ago

jerrian commented 3 years ago

When I tried to get the total pages of "test.pdf" using PdfReader, it said 2 pages, but that pdf file actually has 19 pages. So I tried again with PdfFileReader from PyPDF2, it worked fine.

>>> from pdfrw import PdfReader
>>> from PyPDF2 import PdfFileReader
>>> filename = './test.pdf'
>>> pdf_reader = PdfReader(filename)
>>> len(pdf_reader.pages)
2
>>> pdf_file_reader = PdfFileReader(open(filename, 'rb'))
>>> pdf_file_reader.getNumPages()
19

I don't know why PdfReader doesn't work properly, but I'm trying to use preexisting stream while initializing PdfReader as mentioned in the source code.

# Allow reading preexisting streams like pyPdf
if hasattr(fname, 'read'):
    fdata = fname.read()
else:
    try:
        f = open(fname, 'rb')
        fdata = f.read()
        f.close()

But it also failed because both PdfFileReader classes in pyPdf and pyPDF2 need stream argument as below.

>>> pdf_reader2 = PdfReader(pdf_file_reader)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/data/pdf_test/venv/lib/python3.7/site-packages/pdfrw/pdfreader.py", line 565, in __init__
    fdata = fname.read()
TypeError: read() missing 1 required positional argument: 'stream'

# pyPdf
def read(self, stream):
    # start at the end:
    stream.seek(-1, 2)

# pyPDF2    
def read(self, stream):
    debug = False
    if debug: print(">>read", stream)
    # start at the end:

Could you update your source code to work properly with those streams? Also, I'm adding that "test.pdf" for you to examine what's wrong with the page number.

test.pdf