mfenniak / pyPdf

Pure-Python PDF Library; this repository is no longer maintained, please see https://github.com/knowah/PyPDF2/ insead.
https://github.com/knowah/PyPDF2/
Other
276 stars 85 forks source link

Microsoft Reporting Service workaround #23

Open ghost opened 13 years ago

ghost commented 13 years ago

hey folks :)

on some files generated by Microsoft Reporting Service i get one of the following errors using this script:


from pyPdf import PdfFileWriter, PdfFileReader

output = PdfFileWriter() input1 = PdfFileReader(file("infile.pdf", "rb"))

output.addPage(input1.getPage(0))

outputStream = file("outfile.pdf", "wb")

output.write(outputStream)

Traceback (most recent call last): File "/backup/print/municipality stara zagora/110228/Aitos_1/test.py", line 20, in output.write(outputStream) ..... File "/usr/local/lib/python2.6/site-packages/pyPdf/generic.py", line 232, in readFromStream return NumberObject(name) ValueError: invalid literal for int() with base 10: ''

or using another approach (loading pages in array and then saving them):

Traceback (most recent call last): File "/backup/print/municipality stara zagora/110228/municipality stara zagora pdf combine 110228 start.py", line 60, in outpdf.write(outfile) ..... File "/usr/local/lib/python2.6/site-packages/pyPdf/pdf.py", line 545, in getObject self.stream.seek(start, 0) ValueError: I/O operation on closed file

where the file is (of course) not closed

i workaround it resaving the file using pdftk like this:


from pyPdf import PdfFileWriter, PdfFileReader

import shlex, subprocess pdftkcommand = 'pdftk infile.pdf cat output fixed_infile.pdf' args = shlex.split(pdftkcommand) subprocess.call(args)

output = PdfFileWriter() input1 = PdfFileReader(file("fixed_infile.pdf", "rb"))

output.addPage(input1.getPage(0))

outputStream = file("outfile.pdf", "wb")

output.write(outputStream)

but only when using last pdftk version (1.44 - 1.41 produces blank pdf) - i guess this is what pdftk guys have fixed: 1.43 - September 30, 2010 Fixed a stream parsing bug that was causing page content to disappear after merge of PDFs generated by Microsoft Reporting Services PDF Rendering Extension 10.0.0.0.

unfortunately i can't provide the broken file as contents are confidential

hope this helps :)

georgi

ghost commented 13 years ago

i don't know why the formatting broke - i copy-pasted pure text :( also i can provide the full traceback if needed

johnwhitington commented 10 years ago

I just put a workaround into CamlPDF to fix the same problem.

The malformity is that the streams in files produced by Microsoft Reporting Services put a space character immediately after the 'stream' keyword (before the CR / LF).

The solution is, after reading the stream keyword, to consume all whitespace-characters-other-than-cr-or-lf before looking for the newline as normal.