Merging Blank PDF - Githubissues

msopko81 commented 7 years ago

I am attempting to merge PDFs and resize the pages. When I add a blank page, I get an error. Here is the code that I have:

from pdfrw import PdfReader, PdfWriter, PageMerge

pdfs = [r'C:\Users\msopko\Desktop\tmp.pdf']
save_path = r'C:\Users\msopko\Desktop\new_tmp.pdf'

writer = PdfWriter()
for pdf in pdfs:
    reader = PdfReader(pdf)
    reader.uncompress()

    for i, page in enumerate(reader.pages):
        pp = PageMerge().add(page)

writer.write(save_path)

Here is the error I am seeing:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-70-d823ab082006> in <module>()
     10
     11     for i, page in enumerate(reader.pages):
---> 12         pp = PageMerge().add(page)
     13
     14 writer.write(save_path)

C:\Users\msopko\Envs\kml_internal\lib\site-packages\pdfrw\pagemerge.py in add(self, obj, prepend, **kw)
    169             obj = RectXObj(obj, **kw)
    170         elif obj.Type == PdfName.Page:
--> 171             obj = RectXObj(obj)
    172         if prepend:
    173             self.insert(0, obj)

C:\Users\msopko\Envs\kml_internal\lib\site-packages\pdfrw\pagemerge.py in __init__(self, page, viewinfo, **kw)
     50             viewinfo = ViewInfo(**kw)
     51         viewinfo.cacheable = False
---> 52         base = pagexobj(page, viewinfo)
     53         self.update(base)
     54         self.indirect = True

C:\Users\msopko\Envs\kml_internal\lib\site-packages\pdfrw\buildxobj.py in pagexobj(page, viewinfo, allow_compressed)
    285     mbox, bbox = getrects(inheritable, viewinfo, rotation)
    286     rotation += get_rotation(viewinfo.rotate)
--> 287     contents = _build_cache(page.Contents, allow_compressed)
    288     return _cache_xobj(contents, resources, mbox, bbox, rotation,
    289                        viewinfo.cacheable)

C:\Users\msopko\Envs\kml_internal\lib\site-packages\pdfrw\buildxobj.py in _build_cache(contents, allow_compressed)
    187     # assume that's not a problem until we encounter them...
    188
--> 189     xobj_copy = PdfDict(array[0])
    190     xobj_copy.private.xobj_cachedict = {}
    191     private.xobj_copy = xobj_copy

TypeError: 'NoneType' object is not subscriptable

I believe the error is occurring because the page is blank and does not have any contents to be copied over. Here is the example PDF that I am using: tmp.pdf

tisimst commented 7 years ago

I am running into this error as well. Here is a very simple blank example file: blank-page-1.pdf

So, the question is, how can a file without a /Contents property still get processed? Going off the error trail above, page.Contents is None, so array (inside the _build_cache command) is None and therefore has no length which causes errors all over the place. What should happen so things can proceed?

pmaupin commented 7 years ago

According to the PDF reference, content streams are mandatory for pages:

Each page of a document is represented by one or more content streams. Content streams are also used to package sequences of instructions as self-contained graphical elements, such as forms (see Section 4.9, “Form XObjects”), patterns (Section 4.6, “Patterns”), certain fonts (Section 5.5.4, “Type 3 Fonts”), and annotation appearances (Section 8.4.4, “Appearance Streams”).

The spec allows (and pdfrw already supports) empty content streams. But that's different than a missing content stream. One is zero-length data, and the other is, frankly, a structural file problem. If you are generating these PDFs with something else, complain to the producer software's author that they should stop making broken PDFs.

Obviously, though, there are so many PDFs out there that are broken in this fashion that the major viewers have thrown in the towel and decided to support them. I might accept a patch for this, if and only if it was comprehensive and didn't slow down processing for everybody else who is not working with PDFs generated by non-compliant writers.

In any case, though, you can quite easily fix the PDFs in your own pdfrw client code. For example:

for mypage in allmypages:
    if not mypage.Contents:
        mypage.Contents = PdfDict(stream='')

tisimst commented 7 years ago

That's really good to know. I will contact the company for the software and submit a bug report. And you're right, most viewers that I've tried still handle the problematic files just fine. Interestingly, when I run the file through Ghostscript's pdf2pdf, it adds a content stream (not sure what it actually is since it's compressed), so it's no longer a problem.

In any case, what you've mentioned is perfect for my needs with the recent addition of more decryption/decompression functionality. For any files that were problems before that I needed to run through pdf2pdf or the like, I no longer need to if I read in the files with decrypt=True. So, thanks for helping that get into the core! This library is amazingly helpful!

By the way, why is decrypt=False by default? Most PDFs are encrypted/compressed in some way, so why not make it True by default? Just curious.

msopko81 commented 7 years ago

Thank you for the explanation.

tisimst commented 7 years ago

Can the above fix be incorporated into buildxobj.py? I've just tested it out on all the files I have with blank pages + missing /Content streams and it works beautifully. It would be very nice to not have to do this externally, but will do if needed. Thanks, again!

pmaupin commented 7 years ago

@tisimst It could probably be incorporated in buildxobj with minimal speed impact, so it might be worthwhile investigating that, but I'm busy now (but patches are welcome :)

As far as why decrypt is False by default, there are a lot of useful things you can do with PDFs without decrypting them. See, for instance, all the example scripts...

tisimst commented 7 years ago

Well, there's a bunch of things that NOT having this feature made challenging for me. Notably, when a document's security settings only allows you to print, then many of the example scripts fail. With decrypt=True, however, they all work wonderfully. You're right, however, that many files work just fine without decrypting them. Can't tell you how amazing the speed is of this library (even with decrypt=True). Makes batch-processing entire folders of files absolutely a joyful experience. Thanks for creating it!

As for the patch, I'll see what I can do. Should be easy to incorporate.

pmaupin / pdfrw

Merging Blank PDF #101