Closed msopko81 closed 7 years ago
I am running into this error as well. Here is a very simple blank example file: blank-page-1.pdf
So, the question is, how can a file without a /Contents property still get processed? Going off the error trail above, page.Contents
is None
, so array
(inside the _build_cache
command) is None
and therefore has no length which causes errors all over the place. What should happen so things can proceed?
According to the PDF reference, content streams are mandatory for pages:
Each page of a document is represented by one or more content streams. Content streams are also used to package sequences of instructions as self-contained graphical elements, such as forms (see Section 4.9, “Form XObjects”), patterns (Section 4.6, “Patterns”), certain fonts (Section 5.5.4, “Type 3 Fonts”), and annotation appearances (Section 8.4.4, “Appearance Streams”).
The spec allows (and pdfrw already supports) empty content streams. But that's different than a missing content stream. One is zero-length data, and the other is, frankly, a structural file problem. If you are generating these PDFs with something else, complain to the producer software's author that they should stop making broken PDFs.
Obviously, though, there are so many PDFs out there that are broken in this fashion that the major viewers have thrown in the towel and decided to support them. I might accept a patch for this, if and only if it was comprehensive and didn't slow down processing for everybody else who is not working with PDFs generated by non-compliant writers.
In any case, though, you can quite easily fix the PDFs in your own pdfrw client code. For example:
for mypage in allmypages:
if not mypage.Contents:
mypage.Contents = PdfDict(stream='')
That's really good to know. I will contact the company for the software and submit a bug report. And you're right, most viewers that I've tried still handle the problematic files just fine. Interestingly, when I run the file through Ghostscript's pdf2pdf, it adds a content stream (not sure what it actually is since it's compressed), so it's no longer a problem.
In any case, what you've mentioned is perfect for my needs with the recent addition of more decryption/decompression functionality. For any files that were problems before that I needed to run through pdf2pdf or the like, I no longer need to if I read in the files with decrypt=True. So, thanks for helping that get into the core! This library is amazingly helpful!
By the way, why is decrypt=False by default? Most PDFs are encrypted/compressed in some way, so why not make it True by default? Just curious.
Thank you for the explanation.
Can the above fix be incorporated into buildxobj.py? I've just tested it out on all the files I have with blank pages + missing /Content streams and it works beautifully. It would be very nice to not have to do this externally, but will do if needed. Thanks, again!
@tisimst It could probably be incorporated in buildxobj with minimal speed impact, so it might be worthwhile investigating that, but I'm busy now (but patches are welcome :)
As far as why decrypt is False by default, there are a lot of useful things you can do with PDFs without decrypting them. See, for instance, all the example scripts...
Well, there's a bunch of things that NOT having this feature made challenging for me. Notably, when a document's security settings only allows you to print, then many of the example scripts fail. With decrypt=True, however, they all work wonderfully. You're right, however, that many files work just fine without decrypting them. Can't tell you how amazing the speed is of this library (even with decrypt=True). Makes batch-processing entire folders of files absolutely a joyful experience. Thanks for creating it!
As for the patch, I'll see what I can do. Should be easy to incorporate.
I am attempting to merge PDFs and resize the pages. When I add a blank page, I get an error. Here is the code that I have:
Here is the error I am seeing:
I believe the error is occurring because the page is blank and does not have any contents to be copied over. Here is the example PDF that I am using: tmp.pdf