Open wrkhenddher opened 7 years ago
Hello, Henddher.
Compression and decompression filters have mostly been added as people find issues and offer patches. My use cases for this library do not normally require decompression, so I haven't studied that very much.
So patches or pull requests would be great, but I can't just glance at your PDF object and tell you anything about how it was compressed.
Thanks, Pat
Thank you @pmaupin
I will continue to investigate and share here what I found with the hope that a second pair of eyes and good understanding of the PDF spec will help.
The main thing that threw me off was predictor=12
but later on scanline 32, filter code 4 (paeth).
It seems that such is perfectly valid:
For LZWDecode and FlateDecode, a Predictor value greater than or equal to 10 merely indicates that a PNG predictor is in use; the specific predictor function used is explicitly encoded in the incoming data. The value of Predictor supplied by the decoding filter need not match the value used when the data was encoded if they are both greater than or equal to 10.
Here is the Predictor
in the object declaration I am dealing with:
/Filter/FlateDecode/DecodeParms<</Predictor 12/Columns 6>>
And the output of the snippet I attached.
0 row_index: 0 offset: 0
1 row_index: 7 offset: 0
2 row_index: 14 offset: 0
3 row_index: 21 offset: 2
4 row_index: 28 offset: 2
5 row_index: 35 offset: 0
6 row_index: 42 offset: 0
7 row_index: 49 offset: 0
8 row_index: 56 offset: 0
9 row_index: 63 offset: 0
10 row_index: 70 offset: 0
11 row_index: 77 offset: 2
12 row_index: 84 offset: 2
13 row_index: 91 offset: 2
14 row_index: 98 offset: 2
15 row_index: 105 offset: 2
16 row_index: 112 offset: 2
17 row_index: 119 offset: 2
18 row_index: 126 offset: 1
19 row_index: 133 offset: 2
20 row_index: 140 offset: 0
21 row_index: 147 offset: 0
22 row_index: 154 offset: 0
23 row_index: 161 offset: 2
24 row_index: 168 offset: 2
25 row_index: 175 offset: 2
26 row_index: 182 offset: 2
27 row_index: 189 offset: 2
28 row_index: 196 offset: 2
29 row_index: 203 offset: 2
30 row_index: 210 offset: 2
31 row_index: 217 offset: 0
32 row_index: 224 offset: 4 <<<
33 row_index: 231 offset: 2
34 row_index: 238 offset: 2
35 row_index: 245 offset: 2
36 row_index: 252 offset: 2
37 row_index: 259 offset: 2
38 row_index: 266 offset: 2
39 row_index: 273 offset: 2
40 row_index: 280 offset: 2
41 row_index: 287 offset: 0
42 row_index: 294 offset: 0
43 row_index: 301 offset: 2
44 row_index: 308 offset: 2
45 row_index: 315 offset: 2
46 row_index: 322 offset: 2
47 row_index: 329 offset: 2
48 row_index: 336 offset: 2
49 row_index: 343 offset: 0
50 row_index: 350 offset: 0
51 row_index: 357 offset: 0
52 row_index: 364 offset: 2
53 row_index: 371 offset: 2
54 row_index: 378 offset: 2
According to PDF 1.7 spec, section 3.3. (LZW and Flate Predictor Functions):
http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf
The second supported group of predictor functions, the PNG group, consists of the filters of the World Wide Web Consortium’s Portable Network Graphics recommendation, documented in Internet RFC 2083, PNG (Portable Network Graphics) Specification (see the Bibliography). The term predictors is used here instead of filters to avoid confusion. There are five basic PNG predictor algorithms (and a sixth that chooses the optimum predictor function separately for each row):
- | - |
---|---|
None | No prediction |
Sub | Predicts the same as the sample to the left |
Up | Predicts the same as the sample above |
Average | Predicts the average of the sample to the left and the sample above |
Paeth | A nonlinear function of the sample above, the sample to the left, and the sample to the upper left |
The predictor algorithm to be used, if any, is indicated by the Predictor filter parameter (see Table 3.7), which can have any of the values listed in Table 3.8.
TABLE 3.8 Predictor values
VALUE | MEANING |
---|---|
1 No prediction (the default value) | |
2 | TIFF Predictor 2 |
10 | PNG prediction (on encoding, PNG None on all rows) |
11 | PNG prediction (on encoding, PNG Sub on all rows) |
12 | PNG prediction (on encoding, PNG Up on all rows) |
13 | PNG prediction (on encoding, PNG Average on all rows) |
14 | PNG prediction (on encoding, PNG Paeth on all rows) |
15 | PNG prediction (on encoding, PNG optimum) |
For LZWDecode and FlateDecode, a Predictor value greater than or equal to 10 merely indicates that a PNG predictor is in use; the specific predictor function used is explicitly encoded in the incoming data. The value of Predictor supplied by the decoding filter need not match the value used when the data was encoded if they are both greater than or equal to 10.
The two groups of predictor functions have some commonalities. Both make the following assumptions: • Data is presented in order, from the top row to the bottom row and, within a row, from left to right. • A row occupies a whole number of bytes, rounded up if necessary. • Samples and their components are packed into bytes from high-order to loworder bits. • All color components of samples outside the image (which are necessary for predictions near the boundaries) are 0. The predictor function groups also differ in significant ways: • The postprediction data for each PNG-predicted row begins with an explicit algorithm tag; therefore, different rows can be predicted with different algorithms to improve compression. TIFF Predictor 2 has no such identifier; the same algorithm applies to all rows. • The TIFF function group predicts each color component from the prior instance of that component, taking into account the number of bits per component and components per sample. In contrast, the PNG function group predicts each byte of data as a function of the corresponding byte of one or more previous image samples, regardless of whether there are multiple color components in a byte or whether a single color component spans multiple bytes. This can yield significantly better speed at the cost of somewhat worse compression.
Hello @pmaupin
I implemented 3 PNG filters: Up(2), Average (3) and Paeth (4). I am not sure but I think the existing Sub filter implementation might have been used as Up as well - which might not produce the correct results. I would like to find a set of sample "bitmaps" that I can use to test all filters. Of course, if you have such set available (with input and expected output), it would be perfect.
I ran it against my objectstream.base64
and it did process all scanlines.
I will be pushing a PR so you can take a look?
Thanks for this library of yours!
This is the reference I used for implementing the filters: http://www.libpng.org/pub/png/spec/1.2/PNG-Filters.html
Thanks for looking into this, but sorry -- I don't know that I have any bitmaps. I might, if any of the files in https://github.com/pmaupin/static_pdfs
have a bitmap, but I haven't looked.
Pull requests are certainly welcome -- even if it's not perfect, it might be closer, right?
Thanks @pmaupin
Indeed! I hope this addition will move it closer.
I'll try the static PDFs and see what happens.
@pmaupin
I am having issues running the test-cases ...
Ran 437 tests in 62.324s
FAILED (SKIP=56, errors=49, failures=4)
I did this:
$ cd pdfrw/tests
$ git clone https://github.com/pmaupin/static_pdfs
$ nosetests *
Am I doing correctly?
Also, after "running" them, expected.txt
is changing also.
diff --git a/tests/expected.txt b/tests/expected.txt
index b1b7cca..247179b 100644
--- a/tests/expected.txt
+++ b/tests/expected.txt
@@ -223,3 +223,5 @@ compress/d6fd9567078b48c86710e9c49173781f.pdf cbc8922b8bea08928463b287767ec229
compress/e9ab02aa769f4c040a6fa52f00d6e3f0.pdf e893e407b3c2366d4ca822ce80b45c2c
compress/ec00d5825f47b9d0faa953b1709163c3.pdf 9ba3db0dedec74c3d2a6f033f1b22a81
compress/ed81787b83cc317c9f049643b853bea3.pdf 2ceda401f68a44a3fb1da4e0f9dfc578
+compress/9f98322c243fe67726d56ccfa8e0885b.pdf !f6fcfba92c923f6abf9d374553013493
+decompress/9f98322c243fe67726d56ccfa8e0885b.pdf !4897109976e903a3b3d77af4c5e47d87
I "stashed" my changes and verified that the baseline also produces many errors.
For one thing I don't use nose. I only mention this because expected.txt
should not normally change on its own. Both unittest and py.test work for me.
But the primary problem is probably that the tests test the installed version of pdfrw
rather than what you are working on. This is because, in regression, I want to make sure that pdfrw installs properly and the installed version is what is tested.
So for testing at my workstation, I go into the tests
directory and do ln -s ../pdfrw
Sorry about not documenting that properly.
Thanks, Pat
Thank you @pmaupin
That did it!
(python -m unittest
didn't work for me but pytest
did)
$ pwd
/Users/henddher/Documents/python_workspace/pdfrw_playground/pdfrw/tests
$ ln -s ../pdfrw
$ ls -l pdfrw
lrwxr-xr-x 1 henddher staff 8 Oct 4 16:17 pdfrw -> ../pdfrw
$ pytest
=================================================================================================================================== test session starts ===================================================================================================================================
platform darwin -- Python 2.7.11, pytest-3.2.3, py-1.4.34, pluggy-0.4.0
rootdir: /Users/henddher/Documents/python_workspace/pdfrw_playground/pdfrw, inifile:
collected 215 items
test_examples.py .......FFFF...
test_pdfdict.py .
test_pdfreader_init.py ..
test_pdfstring.py ..........
test_roundtrip.py .....s..................s.....s.....sss....s........s..................s.....s.....sss....s........s..................s.....s.....sss....s........s..................s.....s.....sss....s...
======================================================================================================================================== FAILURES =========================================================================================================================================
...
==================================================================================================================== 4 failed, 183 passed, 28 skipped in 32.50 seconds ====================================================================================================================
I am getting 4 failures running against master
branch though (8774f15b1189657e5c30079b4d658284660ceadc)
Thoughts?
Here is one of the failures:
test_examples.py .......F
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> traceback >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
self = <tests.test_examples.TestOnePdf testMethod=test_rl1_4up>
def test_rl1_4up(self):
if sys.version_info < (2, 7):
return
self.do_test('rl1/4up b1c400de699af29ea3f1983bb26870ab',
> scrub=True)
../../../test_examples.py:171:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../../test_examples.py:99: in do_test
PdfWriter(dstf).addpages(PdfReader(scrub).pages).write()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = {}, fname = '4up.b1c400de699af29ea3f1983bb26870ab.pdf', fdata = None, decompress = False, decrypt = False, password = '', disable_gc = True, verbose = True
def __init__(self, fname=None, fdata=None, decompress=False,
decrypt=False, password='', disable_gc=True, verbose=True):
self.private.verbose = verbose
# Runs a lot faster with GC off.
disable_gc = disable_gc and gc.isenabled()
if disable_gc:
gc.disable()
try:
if fname is not None:
assert fdata is None
# Allow reading preexisting streams like pyPdf
if hasattr(fname, 'read'):
fdata = fname.read()
else:
try:
f = open(fname, 'rb')
fdata = f.read()
f.close()
except IOError:
raise PdfParseError('Could not read PDF file %s' %
> fname)
E PdfParseError: Could not read PDF file 4up.b1c400de699af29ea3f1983bb26870ab.pdf
../../../pdfrw/pdfreader.py:573: PdfParseError
It seems like the four that are failing (off master
) are related to static pdf b1c400de699af29ea3f1983bb26870ab.pdf
That particular one is a reportlab test (see the rl1 in the testname). If the others are as well, it may indicate an unsupported version of reportlab or a missing version of reportlab on your computer.
Thanks, Pat
That was it!
I didn't know reportlab
was required. I pip
ed it and now master
branch passes all tests.
I will add these notes to the README.md test part.
Thanks again!
Back in my branch, I've got 10 failures - which is good because I did change code :P
# Failures
TestOnePdf.test_compress_2ac7c68e26a8ef797aead15e4875cc6d.pdf
TestOnePdf.test_compress_35df0b8cff4afec0c08f08c6a5bc9857.pdf
TestOnePdf.test_compress_9f98322c243fe67726d56ccfa8e0885b.pdf
TestOnePdf.test_decompress_2ac7c68e26a8ef797aead15e4875cc6d.pdf
TestOnePdf.test_decompress_35df0b8cff4afec0c08f08c6a5bc9857.pdf
TestOnePdf.test_decompress_9f98322c243fe67726d56ccfa8e0885b.pdf
TestOnePdf.test_repaginate_2ac7c68e26a8ef797aead15e4875cc6d.pdf
TestOnePdf.test_repaginate_35df0b8cff4afec0c08f08c6a5bc9857.pdf
TestOnePdf.test_simple_2ac7c68e26a8ef797aead15e4875cc6d.pdf
TestOnePdf.test_simple_35df0b8cff4afec0c08f08c6a5bc9857.pdf
Will keep you posted!
$ pytest
=================================================================================================================================== test session starts ===================================================================================================================================
platform darwin -- Python 2.7.11, pytest-3.2.3, py-1.4.34, pluggy-0.4.0
rootdir: /Users/henddher/Documents/python_workspace/pdfrw_playground/pdfrw, inifile:
collected 215 items
test_examples.py ..............
test_pdfdict.py .
test_pdfreader_init.py ..
test_pdfstring.py ..........
test_roundtrip.py .....s.....F...F........s.....s..F..sss....s........s.....F...F........s.....s..F..sss....s........s.....F...F........s.....s.....sss....s........s.....F...F........s.....s.....sss....s...
======================================================================================================================================== FAILURES =========================================================================================================================================
______________________________________________________________________________________________________________ TestOnePdf.test_compress_2ac7c68e26a8ef797aead15e4875cc6d.pdf ______________________________________________________________________________________________________________
self = <tests.test_roundtrip.TestOnePdf testMethod=test_compress_2ac7c68e26a8ef797aead15e4875cc6d.pdf>
def test(self):
> self.roundtrip(*args, **kw)
../../test_roundtrip.py:114:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../test_roundtrip.py:77: in roundtrip
verbose=False)
../../pdfrw/pdfreader.py:671: in __init__
self.Info = trailer.Info
../../pdfrw/objects/pdfdict.py:130: in __getattr__
return self.get(PdfName(name))
../../pdfrw/objects/pdfdict.py:143: in get
value = value.real_value()
../../pdfrw/objects/pdfindirect.py:21: in real_value
value = self.value = self._loader(self)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = {'/Root': {'/Metadata': (1, 0), '/Type': '/Catalog', '/Pages': (5, 0)}}, key = (6, 0), PdfDict = <class 'tests.pdfrw.objects.pdfdict.PdfDict'>, isinstance = <built-in function isinstance>
def loadindirect(self, key, PdfDict=PdfDict,
isinstance=isinstance):
result = self.indirect_objects.get(key)
if not isinstance(result, PdfIndirect):
return result
source = self.source
offset = int(self.source.obj_offsets.get(key, '0'))
if not offset:
source.warning("Did not find PDF object %s", key)
return None
# Read the object header and validate it
objnum, gennum = key
source.floc = offset
objid = source.multiple(3)
ok = len(objid) == 3
ok = ok and objid[0].isdigit() and int(objid[0]) == objnum
ok = ok and objid[1].isdigit() and int(objid[1]) == gennum
ok = ok and objid[2] == 'obj'
if not ok:
source.floc = offset
> source.next()
E StopIteration
../../pdfrw/pdfreader.py:201: StopIteration
______________________________________________________________________________________________________________ TestOnePdf.test_compress_35df0b8cff4afec0c08f08c6a5bc9857.pdf ______________________________________________________________________________________________________________
self = <tests.test_roundtrip.TestOnePdf testMethod=test_compress_35df0b8cff4afec0c08f08c6a5bc9857.pdf>
def test(self):
> self.roundtrip(*args, **kw)
../../test_roundtrip.py:114:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../test_roundtrip.py:77: in roundtrip
verbose=False)
../../pdfrw/pdfreader.py:671: in __init__
self.Info = trailer.Info
../../pdfrw/objects/pdfdict.py:130: in __getattr__
return self.get(PdfName(name))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = {'/Length': '92', '/Size': '36', '/DecodeParms': {'/Columns': '4', '/Predictor...Type': '/Catalog', '/Outlines': (7, 0), '/Metadata': (3, 0)}, '/Info': (12, 0)}, key = '/Info', dictget = <method 'get' of 'dict' objects>, isinstance = <built-in function isinstance>
PdfIndirect = <class 'tests.pdfrw.objects.pdfindirect.PdfIndirect'>
def get(self, key, dictget=dict.get, isinstance=isinstance,
PdfIndirect=PdfIndirect):
''' Get a value out of the dictionary,
after resolving any indirect objects.
'''
value = dictget(self, key)
if isinstance(value, PdfIndirect):
# We used to use self[key] here, but that does an
# unwanted check on the type of the key (github issue #98).
# Python will keep the old key object in the dictionary,
# so that check is not necessary.
value = value.real_value()
if value is not None:
dict.__setitem__(self, key, value)
else:
> del self[name]
E NameError: global name 'name' is not defined
../../pdfrw/objects/pdfdict.py:147: NameError
---------------------------------------------------------------------------------------------------------------------------------- Captured stderr call -----------------------------------------------------------------------------------------------------------------------------------
[WARNING] tokens.py:221 Did not find PDF object (12, 0) (line=27, col=1, token='endobj')
______________________________________________________________________________________________________________ TestOnePdf.test_compress_9f98322c243fe67726d56ccfa8e0885b.pdf ______________________________________________________________________________________________________________
self = <tests.test_roundtrip.TestOnePdf testMethod=test_compress_9f98322c243fe67726d56ccfa8e0885b.pdf>
def test(self):
> self.roundtrip(*args, **kw)
../../test_roundtrip.py:114:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../test_roundtrip.py:98: in roundtrip
self.assertEqual(hash, expects)
E AssertionError: 'f6fcfba92c923f6abf9d374553013493' != '3167fa11a3f1f4a06f90294b21e101b7'
E - f6fcfba92c923f6abf9d374553013493
E + 3167fa11a3f1f4a06f90294b21e101b7
_____________________________________________________________________________________________________________ TestOnePdf.test_decompress_2ac7c68e26a8ef797aead15e4875cc6d.pdf _____________________________________________________________________________________________________________
self = <tests.test_roundtrip.TestOnePdf testMethod=test_decompress_2ac7c68e26a8ef797aead15e4875cc6d.pdf>
def test(self):
> self.roundtrip(*args, **kw)
../../test_roundtrip.py:114:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../test_roundtrip.py:77: in roundtrip
verbose=False)
../../pdfrw/pdfreader.py:671: in __init__
self.Info = trailer.Info
../../pdfrw/objects/pdfdict.py:130: in __getattr__
return self.get(PdfName(name))
../../pdfrw/objects/pdfdict.py:143: in get
value = value.real_value()
../../pdfrw/objects/pdfindirect.py:21: in real_value
value = self.value = self._loader(self)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = {'/Root': {'/Metadata': (1, 0), '/Type': '/Catalog', '/Pages': (5, 0)}}, key = (6, 0), PdfDict = <class 'tests.pdfrw.objects.pdfdict.PdfDict'>, isinstance = <built-in function isinstance>
def loadindirect(self, key, PdfDict=PdfDict,
isinstance=isinstance):
result = self.indirect_objects.get(key)
if not isinstance(result, PdfIndirect):
return result
source = self.source
offset = int(self.source.obj_offsets.get(key, '0'))
if not offset:
source.warning("Did not find PDF object %s", key)
return None
# Read the object header and validate it
objnum, gennum = key
source.floc = offset
objid = source.multiple(3)
ok = len(objid) == 3
ok = ok and objid[0].isdigit() and int(objid[0]) == objnum
ok = ok and objid[1].isdigit() and int(objid[1]) == gennum
ok = ok and objid[2] == 'obj'
if not ok:
source.floc = offset
> source.next()
E StopIteration
../../pdfrw/pdfreader.py:201: StopIteration
_____________________________________________________________________________________________________________ TestOnePdf.test_decompress_35df0b8cff4afec0c08f08c6a5bc9857.pdf _____________________________________________________________________________________________________________
self = <tests.test_roundtrip.TestOnePdf testMethod=test_decompress_35df0b8cff4afec0c08f08c6a5bc9857.pdf>
def test(self):
> self.roundtrip(*args, **kw)
../../test_roundtrip.py:114:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../test_roundtrip.py:77: in roundtrip
verbose=False)
../../pdfrw/pdfreader.py:671: in __init__
self.Info = trailer.Info
../../pdfrw/objects/pdfdict.py:130: in __getattr__
return self.get(PdfName(name))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = {'/Length': '92', '/Size': '36', '/DecodeParms': {'/Columns': '4', '/Predictor...Type': '/Catalog', '/Outlines': (7, 0), '/Metadata': (3, 0)}, '/Info': (12, 0)}, key = '/Info', dictget = <method 'get' of 'dict' objects>, isinstance = <built-in function isinstance>
PdfIndirect = <class 'tests.pdfrw.objects.pdfindirect.PdfIndirect'>
def get(self, key, dictget=dict.get, isinstance=isinstance,
PdfIndirect=PdfIndirect):
''' Get a value out of the dictionary,
after resolving any indirect objects.
'''
value = dictget(self, key)
if isinstance(value, PdfIndirect):
# We used to use self[key] here, but that does an
# unwanted check on the type of the key (github issue #98).
# Python will keep the old key object in the dictionary,
# so that check is not necessary.
value = value.real_value()
if value is not None:
dict.__setitem__(self, key, value)
else:
> del self[name]
E NameError: global name 'name' is not defined
../../pdfrw/objects/pdfdict.py:147: NameError
---------------------------------------------------------------------------------------------------------------------------------- Captured stderr call -----------------------------------------------------------------------------------------------------------------------------------
[WARNING] tokens.py:221 Did not find PDF object (12, 0) (line=27, col=1, token='endobj')
_____________________________________________________________________________________________________________ TestOnePdf.test_decompress_9f98322c243fe67726d56ccfa8e0885b.pdf _____________________________________________________________________________________________________________
self = <tests.test_roundtrip.TestOnePdf testMethod=test_decompress_9f98322c243fe67726d56ccfa8e0885b.pdf>
def test(self):
> self.roundtrip(*args, **kw)
../../test_roundtrip.py:114:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../test_roundtrip.py:98: in roundtrip
self.assertEqual(hash, expects)
E AssertionError: '4897109976e903a3b3d77af4c5e47d87' != '4b53b63b0779b81d8f9569e66ca3d8ee'
E - 4897109976e903a3b3d77af4c5e47d87
E + 4b53b63b0779b81d8f9569e66ca3d8ee
_____________________________________________________________________________________________________________ TestOnePdf.test_repaginate_2ac7c68e26a8ef797aead15e4875cc6d.pdf _____________________________________________________________________________________________________________
self = <tests.test_roundtrip.TestOnePdf testMethod=test_repaginate_2ac7c68e26a8ef797aead15e4875cc6d.pdf>
def test(self):
> self.roundtrip(*args, **kw)
../../test_roundtrip.py:114:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../test_roundtrip.py:77: in roundtrip
verbose=False)
../../pdfrw/pdfreader.py:671: in __init__
self.Info = trailer.Info
../../pdfrw/objects/pdfdict.py:130: in __getattr__
return self.get(PdfName(name))
../../pdfrw/objects/pdfdict.py:143: in get
value = value.real_value()
../../pdfrw/objects/pdfindirect.py:21: in real_value
value = self.value = self._loader(self)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = {'/Root': {'/Metadata': (1, 0), '/Type': '/Catalog', '/Pages': (5, 0)}}, key = (6, 0), PdfDict = <class 'tests.pdfrw.objects.pdfdict.PdfDict'>, isinstance = <built-in function isinstance>
def loadindirect(self, key, PdfDict=PdfDict,
isinstance=isinstance):
result = self.indirect_objects.get(key)
if not isinstance(result, PdfIndirect):
return result
source = self.source
offset = int(self.source.obj_offsets.get(key, '0'))
if not offset:
source.warning("Did not find PDF object %s", key)
return None
# Read the object header and validate it
objnum, gennum = key
source.floc = offset
objid = source.multiple(3)
ok = len(objid) == 3
ok = ok and objid[0].isdigit() and int(objid[0]) == objnum
ok = ok and objid[1].isdigit() and int(objid[1]) == gennum
ok = ok and objid[2] == 'obj'
if not ok:
source.floc = offset
> source.next()
E StopIteration
../../pdfrw/pdfreader.py:201: StopIteration
_____________________________________________________________________________________________________________ TestOnePdf.test_repaginate_35df0b8cff4afec0c08f08c6a5bc9857.pdf _____________________________________________________________________________________________________________
self = <tests.test_roundtrip.TestOnePdf testMethod=test_repaginate_35df0b8cff4afec0c08f08c6a5bc9857.pdf>
def test(self):
> self.roundtrip(*args, **kw)
../../test_roundtrip.py:114:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../test_roundtrip.py:77: in roundtrip
verbose=False)
../../pdfrw/pdfreader.py:671: in __init__
self.Info = trailer.Info
../../pdfrw/objects/pdfdict.py:130: in __getattr__
return self.get(PdfName(name))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = {'/Length': '92', '/Size': '36', '/DecodeParms': {'/Columns': '4', '/Predictor...Type': '/Catalog', '/Outlines': (7, 0), '/Metadata': (3, 0)}, '/Info': (12, 0)}, key = '/Info', dictget = <method 'get' of 'dict' objects>, isinstance = <built-in function isinstance>
PdfIndirect = <class 'tests.pdfrw.objects.pdfindirect.PdfIndirect'>
def get(self, key, dictget=dict.get, isinstance=isinstance,
PdfIndirect=PdfIndirect):
''' Get a value out of the dictionary,
after resolving any indirect objects.
'''
value = dictget(self, key)
if isinstance(value, PdfIndirect):
# We used to use self[key] here, but that does an
# unwanted check on the type of the key (github issue #98).
# Python will keep the old key object in the dictionary,
# so that check is not necessary.
value = value.real_value()
if value is not None:
dict.__setitem__(self, key, value)
else:
> del self[name]
E NameError: global name 'name' is not defined
../../pdfrw/objects/pdfdict.py:147: NameError
---------------------------------------------------------------------------------------------------------------------------------- Captured stderr call -----------------------------------------------------------------------------------------------------------------------------------
[WARNING] tokens.py:221 Did not find PDF object (12, 0) (line=27, col=1, token='endobj')
_______________________________________________________________________________________________________________ TestOnePdf.test_simple_2ac7c68e26a8ef797aead15e4875cc6d.pdf _______________________________________________________________________________________________________________
self = <tests.test_roundtrip.TestOnePdf testMethod=test_simple_2ac7c68e26a8ef797aead15e4875cc6d.pdf>
def test(self):
> self.roundtrip(*args, **kw)
../../test_roundtrip.py:114:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../test_roundtrip.py:77: in roundtrip
verbose=False)
../../pdfrw/pdfreader.py:671: in __init__
self.Info = trailer.Info
../../pdfrw/objects/pdfdict.py:130: in __getattr__
return self.get(PdfName(name))
../../pdfrw/objects/pdfdict.py:143: in get
value = value.real_value()
../../pdfrw/objects/pdfindirect.py:21: in real_value
value = self.value = self._loader(self)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = {'/Root': {'/Metadata': (1, 0), '/Type': '/Catalog', '/Pages': (5, 0)}}, key = (6, 0), PdfDict = <class 'tests.pdfrw.objects.pdfdict.PdfDict'>, isinstance = <built-in function isinstance>
def loadindirect(self, key, PdfDict=PdfDict,
isinstance=isinstance):
result = self.indirect_objects.get(key)
if not isinstance(result, PdfIndirect):
return result
source = self.source
offset = int(self.source.obj_offsets.get(key, '0'))
if not offset:
source.warning("Did not find PDF object %s", key)
return None
# Read the object header and validate it
objnum, gennum = key
source.floc = offset
objid = source.multiple(3)
ok = len(objid) == 3
ok = ok and objid[0].isdigit() and int(objid[0]) == objnum
ok = ok and objid[1].isdigit() and int(objid[1]) == gennum
ok = ok and objid[2] == 'obj'
if not ok:
source.floc = offset
> source.next()
E StopIteration
../../pdfrw/pdfreader.py:201: StopIteration
_______________________________________________________________________________________________________________ TestOnePdf.test_simple_35df0b8cff4afec0c08f08c6a5bc9857.pdf _______________________________________________________________________________________________________________
self = <tests.test_roundtrip.TestOnePdf testMethod=test_simple_35df0b8cff4afec0c08f08c6a5bc9857.pdf>
def test(self):
> self.roundtrip(*args, **kw)
../../test_roundtrip.py:114:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../test_roundtrip.py:77: in roundtrip
verbose=False)
../../pdfrw/pdfreader.py:671: in __init__
self.Info = trailer.Info
../../pdfrw/objects/pdfdict.py:130: in __getattr__
return self.get(PdfName(name))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = {'/Length': '92', '/Size': '36', '/DecodeParms': {'/Columns': '4', '/Predictor...Type': '/Catalog', '/Outlines': (7, 0), '/Metadata': (3, 0)}, '/Info': (12, 0)}, key = '/Info', dictget = <method 'get' of 'dict' objects>, isinstance = <built-in function isinstance>
PdfIndirect = <class 'tests.pdfrw.objects.pdfindirect.PdfIndirect'>
def get(self, key, dictget=dict.get, isinstance=isinstance,
PdfIndirect=PdfIndirect):
''' Get a value out of the dictionary,
after resolving any indirect objects.
'''
value = dictget(self, key)
if isinstance(value, PdfIndirect):
# We used to use self[key] here, but that does an
# unwanted check on the type of the key (github issue #98).
# Python will keep the old key object in the dictionary,
# so that check is not necessary.
value = value.real_value()
if value is not None:
dict.__setitem__(self, key, value)
else:
> del self[name]
E NameError: global name 'name' is not defined
../../pdfrw/objects/pdfdict.py:147: NameError
---------------------------------------------------------------------------------------------------------------------------------- Captured stderr call -----------------------------------------------------------------------------------------------------------------------------------
[WARNING] tokens.py:221 Did not find PDF object (12, 0) (line=27, col=1, token='endobj')
=================================================================================================================== 10 failed, 177 passed, 28 skipped in 32.64 seconds ====================================================================================================================
No dice :(
I tried running all test_roundtrip
cases with breakpoint at pdfrw/uncompress:32
diff --git a/tests/test_roundtrip.py b/tests/test_roundtrip.py
index a8349a6..c878101 100755
--- a/tests/test_roundtrip.py
+++ b/tests/test_roundtrip.py
@@ -109,6 +109,7 @@ class TestOnePdf(unittest.TestCase):
def build_tests():
+ import pdb; pdb.set_trace()
def test_closure(*args, **kw):
def test(self):
self.roundtrip(*args, **kw)
pdfrw/uncompress.py
uncompress()
function is never called.
I must be missing something when it comes to the test cases and the mismatched md5
s.
Help!
BTW - PR is up but it's not ready for code-review because of the failing tests.
1) Obviously I screwed up the fix for #98 -- there shouldn't be an error right there.
2) If I recall correctly, some of the tests use subprocess
so the process where uncompress()
is called might not be the one that you set the breakpoint in.
Bingo. You are correct about the subprocess
. I added the breakpoint to uncompress
and I am about to pdb
it :)
I'll look at #98 changes. I assume you mean they are affecting the del self[name] NameError: global name 'name' is not defined
ones.
Hi @pmaupin
I haven't had much luck with regression.
I don't think #98 is wrong. From what I can tell, it looks ok to me.
What did you mean by right there? where?
> there shouldn't be an error right there.
Never mind :) name
doesn't exist in that context :P
(I committed that fix)
I guess that the fact that all tests passed using master and then some fail using my branch puts the blame on my code. No?
Some failures are MD5 mismatches but there are actual failures processing the PDFs.
The weird thing is why they fail "somewhere" else. The only reason I can think of is that I am overwriting the PDF data during flate_png
.
I will try to test flate_png
alone against some sample data. I will compare results with yours. That will tell what's happening - hopefully. I am still in doubt about your 2 filters (1 and 2) doing Up filtering
.
Hi @pmaupin
Today, I came back to this.
I hope you can take a quick look and shed some light.
Turned out that all failing tests are throwing StopIteration
from loadindirect
.
It seems like the output from uncompress
(after calling flate_png
) produces data
arrays that are larger and since uncompress
is being called with leave_raw = True
, the original obj.stream
is being replaced with the uncompressed and flate png larger data
array.
I wonder, if obj.stream
is replaced with longer array, could that cause that some offset calculation is now off so higher layers in the library read bytes at the wrong location?
### uncompress.py
67 l0 = len(data)
68 if 10 <= predictor <= 15:
69 data, error = flate_png(data, predictor, columns, colors, bpc)
70 l1 = len(data)
71 try:
72 assert error == None
73 assert l1 == l0, "Mismatch len %d vs %d" % (l0, l1)
74 except Exception as e:
75 print "Exception", e
76 pass
77 elif predictor != 1:
78 error = ('Unsupported flatedecode predictor %s' %
79 repr(predictor))
80 if error is None:
81 assert not dco.unconsumed_tail
82 if dco.unused_data.strip():
83 error = ('Unconsumed compression data: %s' %
84 repr(dco.unused_data[:20]))
85 if error is None:
86 obj.Filter = None
87 obj.stream = data if leave_raw else convert_load(data)
Here is the output of one of the test_roundtrip
tests:
______________________________________________________________________________________________________________ TestOnePdf.test_compress_2ac7c68e26a8ef797aead15e4875cc6d.pdf ______________________________________________________________________________________________________________
self = <tests.test_roundtrip.TestOnePdf testMethod=test_compress_2ac7c68e26a8ef797aead15e4875cc6d.pdf>
def test(self):
> self.roundtrip(*args, **kw)
../../test_roundtrip.py:114:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../test_roundtrip.py:77: in roundtrip
verbose=False)
../../pdfrw/pdfreader.py:671: in __init__
self.Info = trailer.Info
../../pdfrw/objects/pdfdict.py:130: in __getattr__
return self.get(PdfName(name))
../../pdfrw/objects/pdfdict.py:143: in get
value = value.real_value()
../../pdfrw/objects/pdfindirect.py:21: in real_value
value = self.value = self._loader(self)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = {'/Root': {'/Metadata': (1, 0), '/Type': '/Catalog', '/Pages': (5, 0)}}, key = (6, 0), PdfDict = <class 'tests.pdfrw.objects.pdfdict.PdfDict'>, isinstance = <built-in function isinstance>
def loadindirect(self, key, PdfDict=PdfDict,
isinstance=isinstance):
result = self.indirect_objects.get(key)
if not isinstance(result, PdfIndirect):
return result
source = self.source
offset = int(self.source.obj_offsets.get(key, '0'))
if not offset:
source.warning("Did not find PDF object %s", key)
return None
# Read the object header and validate it
objnum, gennum = key
source.floc = offset
objid = source.multiple(3)
ok = len(objid) == 3
ok = ok and objid[0].isdigit() and int(objid[0]) == objnum
ok = ok and objid[1].isdigit() and int(objid[1]) == gennum
ok = ok and objid[2] == 'obj'
if not ok:
source.floc = offset
> source.next()
E StopIteration
../../pdfrw/pdfreader.py:201: StopIteration
---------------------------------------------------------------------------------------------------------------------------------- Captured stdout call -----------------------------------------------------------------------------------------------------------------------------------
leave_raw True
Exception Mismatch len 65 vs 52
leave_raw True
Exception Mismatch len 28 vs 21
That seems unlikely.
If I were debugging this, I would probably ask something like pdftk to create an uncompressed copy of the PDF, verify that the uncompressed copy is correct (displays correctly, etc.), and would then verify that my decompression matched that of the other tool.
Thanks @pmaupin
I figured out an issue with the new implementation of flate_png
. When reconstructing the pixels, I was not using the correct previous scanline but instead I was using the unmodified scanline.
With that, I am now passing almost all cases except 3:
These 2 existing test cases are failing ("The Time Machine" pdf):
____ TestOnePdf.test_compress9f98322c243fe67726d56ccfa8e0885b.pdf ____ ___ TestOnePdf.test_decompress_9f98322c243fe67726d56ccfa8e0885b.pdf ____
When I look at the generated PDFs, I can see that the DP logo in the first page is rendered incorrectly (some scanlines bleed yellow horizontally). That seems like incorrect sub
filter ...
The strange thing is that when I compare the output of flate_png_orig
with flate_png
, it seems like the original was not applying the inverse of the filter correctly?
(assuming I am understanding the spec correctly and interpreting the code correctly)
Take a look at my stand alone flate_png test for Sub filter (filter 1):
_ TestFlatePNG.test_flate_png_filter1
# nc: Number of columns
# nr: Number of rows
# bpc: bits per component
# ncolors: Number of colors
e.g. RGBA: 8bit per channel, 4 channels => 32bits => pixel size = 4bytes
# nc=2, nr=3, bpc=8, ncolors=4
ERROR:root:Data: array('B', [0, 0, 4, 8, 12, 16, 20, 24, 28, 1, 8, 12, 16, 20, 24, 28, 32, 36, 1, 16, 20, 24, 28, 32, 36, 40, 44])
ERROR:root: 0 0
ERROR:root: 4 4
ERROR:root: 8 8
ERROR:root: 12 12
ERROR:root: 16 16
ERROR:root: 20 20
ERROR:root: 24 24
ERROR:root: 28 28
ERROR:root: 9 8 <<
ERROR:root: 21 12
ERROR:root: 37 16
ERROR:root: 57 20
...
Type | Name | Filter Function | Reconstruction Function |
---|---|---|---|
1 | Sub | Filt(x) = Orig(x) - Orig(a) | Recon(x) = Filt(x) + Recon(a) |
According to the spec, https://www.w3.org/TR/2003/REC-PNG-20031110/#9FtIntro, the reconstruction is the addition of the current filtered value and the corresponding byte of the pixel to the left. Given that this is the 1st pixel of the scanline, all bytes of the pixel to its left are supposed to be 0.
See <<
: 9
was produced by adding the 1st byte of the first pixel of the second scanline (data[10]
) to the filter
byte (data[9]
).
But I think it should have been data[10] = data[10] + 0
. Which is 8
I think it all stems from pixel size. In this case, it is supposed to be 4 bytes but the original flate was always subtracting 1 from the current index which essentially meant that the pixel size was always 1 byte.
Either I am misunderstanding the specification or I am not understanding what colums
, colors
and bpc
are in the flate_png
function.
# uncompress.py
93 def flate_png(data, predictor=1, columns=1, colors=1, bpc=8):
Isn't columns
the number of pixels in a single scanline?
Isn't colors
the number of channels?
Isn't bpc
the number of bits per channel?
Here is the full output:
The left is the output from the flate_png_orig
and the right is the new flate_png
: byte by byte.
$ pytest -k test_flate
=================================================================================================================================== test session starts ===================================================================================================================================
platform darwin -- Python 2.7.11, pytest-3.2.3, py-1.4.34, pluggy-0.4.0
rootdir: /Users/henddher/Documents/python_workspace/pdfrw_playground/pdfrw, inifile:
collected 221 items
test_flate_png.py ..F...
======================================================================================================================================== FAILURES =========================================================================================================================================
__________________________________________________________________________________________________________________________ TestFlatePNG.test_flate_png_filter_1 ___________________________________________________________________________________________________________________________
self = <tests.test_flate_png.TestFlatePNG testMethod=test_flate_png_filter_1>
def test_flate_png_filter_1(self):
# Sub filter
data, nc, nr, bpc, ncolors = create_data(nc=2, nr=3, bpc=8, ncolors=4, filter_type=1)
d1, error1 = flate_png_orig(data, 12, nc, ncolors, bpc)
data, nc, nr, bpc, ncolors = create_data(nc=2, nr=3, bpc=8, ncolors=4, filter_type=1)
d2, error2 = flate_png(data, 12, nc, ncolors, bpc)
print_data(d1, d2)
> assert d1 == d2
E AssertionError: assert '\x00\x04\x08...y\x9d\xc5\xf1' == '\x00\x04\x08\...4\x18\x1c08@H'
E - \x00\x04\x08\x0c\x10\x14\x18\x1c\t\x15%9Qm\x8d\xb1\x11%=Yy\x9d\xc5\xf1
E + \x00\x04\x08\x0c\x10\x14\x18\x1c\x08\x0c\x10\x14 (08\x10\x14\x18\x1c08@H
test_flate_png.py:79: AssertionError
---------------------------------------------------------------------------------------------------------------------------------- Captured stderr call -----------------------------------------------------------------------------------------------------------------------------------
ERROR:root:Data: array('B', [0, 0, 4, 8, 12, 16, 20, 24, 28, 1, 8, 12, 16, 20, 24, 28, 32, 36, 1, 16, 20, 24, 28, 32, 36, 40, 44])
ERROR:root:Data: array('B', [0, 0, 4, 8, 12, 16, 20, 24, 28, 1, 8, 12, 16, 20, 24, 28, 32, 36, 1, 16, 20, 24, 28, 32, 36, 40, 44])
ERROR:root: 0 0
ERROR:root: 4 4
ERROR:root: 8 8
ERROR:root: 12 12
ERROR:root: 16 16
ERROR:root: 20 20
ERROR:root: 24 24
ERROR:root: 28 28
ERROR:root: 9 8
ERROR:root: 21 12
ERROR:root: 37 16
ERROR:root: 57 20
ERROR:root: 81 32
ERROR:root: 109 40
ERROR:root: 141 48
ERROR:root: 177 56
ERROR:root: 17 16
ERROR:root: 37 20
ERROR:root: 61 24
ERROR:root: 89 28
ERROR:root: 121 48
ERROR:root: 157 56
ERROR:root: 197 64
ERROR:root: 241 72
================================================================================================================================== 215 tests deselected ===================================================================================================================================
=================================================================================================================== 1 failed, 5 passed, 215 deselected in 0.18 seconds ====================================================================================================================
What do you think?
What am I missing?
Hi @pmaupin
I am starting to suspect that the aforementioned PDF might be incorrect. It's the only PDF that looks bad after the new flate_png
implementation.
This is the current output I am obtaining:
Instead of:
The reason I am arriving to this conclusion is because I added tests to each filter implementation and they are all producing expected output.
The test-cases were produced by running the built-in tests of the official PNG lib (libpng-1.6.34). I modified the code of the library to output each scanline (row of pixels) before and after reverse filtering.
/* pngrutil.c */
void /* PRIVATE */
png_read_filter_row(png_structrp pp, png_row_infop row_info, png_bytep row,
png_const_bytep prev_row, int filter)
{
/* OPTIMIZATION: DO NOT MODIFY THIS FUNCTION, instead #define
* PNG_FILTER_OPTIMIZATIONS to a function that overrides the generic
* implementations. See png_init_filter_functions above.
*/
printf("width = %d\n", row_info->width);
printf("bit_depth = %d\n", row_info->bit_depth);
printf("channels = %d\n", row_info->channels);
printf("color_type = %d\n", row_info->color_type);
printf("pixel_depth = %d\n", row_info->pixel_depth);
printf("rowbytes = %zu\n", row_info->rowbytes);
printf("filter = %d\n", filter);
printf("data = [ ");
for (int i = 0; i < row_info->rowbytes; i++) {
printf("0x%.2x,", row[i] & 0xff);
}
printf(" ]\n");
if (filter > PNG_FILTER_VALUE_NONE && filter < PNG_FILTER_VALUE_LAST)
{
if (pp->read_filter[0] == NULL)
png_init_filter_functions(pp);
pp->read_filter[filter-1](row_info, row, prev_row);
}
else
png_debug1(1, "no filter %d", filter);
printf("expected = [ ");
for (int i = 0; i < row_info->rowbytes; i++) {
printf("0x%.2x,", row[i] & 0xff);
}
printf(" ]\n");
}
I took sections of the output with reasonable bit depths and number of channels to compose the test_flate_png_alt_*
in test_flate_png.py
.
I am pretty certain that the new filters work correctly - as they are verified against the tests in the PNG lib.
The only outstanding issue is the difference in length of data coming out from flate_png
.
I am suspecting the conversion that happens at the end, since that depends on python version.
https://github.com/pmaupin/pdfrw/pull/114/files#diff-4a788690953b8096e62f68e4f7e69471R185
Thoughts? Advice?
Hi @pmaupin
🎉 🎉 Good news 🎉 🎉
I refactored my implementation of filters (much simpler now) and added numerous tests and it all looks fine now. Including the questionable aforementioned PDF.
As you can see now, the produced PDF looks correctly; even though its checksum differs (which makes sense since the original sub
filter seemed to be incorrect).
I created an IPython Notebook (./tests/Render Bitmap.py
) that helped me tons visualizing the rasters produced by the flate_png_impl
function.
IMHO, the only thing left to do is to remove the original flate_png_orig
function and update the checksum of the aforementioned PDF.
If you can think of other PDFs (already available in your pool of static_pdf
) that perhaps failed before due to filtering, we could turn those ON so they are added to the tests.
Please take a look at the whole thing - I am pretty certain it's all good functionality-wise 😎
PR #114
That's awesome! I don't have time to look right at the moment, but this should be a good addition for the next release. And you're right; it is probably time to see if other PDFs now work.
Thanks, Pat
After modifying test_roundtrip.py
, I was finally able to let the Encrypted PDFs be tested. Turned out that they didn't produce good output. They all came up as "blank" PDFs although their number of pages were correct. In summary, I ended up revertingexpected.txt
to skip
all the encrypted PDFs.
At this point, I think I am pretty much done with all the changes I first intended 🎉 😄
Take a look at the PR and let me know.
Hi @pmaupin
I just noticed that PR #114 was merged but this issue was never marked as Closed.
Also, there has not been any release since 9/2017 so the FlateDecode for PNGs has never been available to anyone (unless they do pip -e git://github.com/pmaupin/pdfrw.git@6c892160e7e976b243db0c12c3e56ed8c78afc5a#egg=pdfrw
When do you think you'll be able to release a new version?
Yeah, I think we got distracted by you not being able to do the test repo.
Thanks for bugging me! I'll bump it to the top of the list.
Pat
On Wed, May 8, 2019 at 11:55 AM Henddher Pedroza notifications@github.com wrote:
Hi @pmaupin https://github.com/pmaupin
I just noticed that PR #114 https://github.com/pmaupin/pdfrw/pull/114 was merged but this issue was never marked as Closed.
Also, there has not been any release since 9/2017 so the FlateDecode for PNGs has never been available to anyone (unless they do pip -e git:// github.com/pmaupin/pdfrw.git@6c892160e7e976b243db0c12c3e56ed8c78afc5a#egg=pdfrw
When do you think you'll be able to release a new version?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pmaupin/pdfrw/issues/113#issuecomment-490565774, or mute the thread https://github.com/notifications/unsubscribe-auth/AASE2NQ6PFVPFPHR5577VFDPUMAXVANCNFSM4D5SBKPQ .
We are encountering this error too. Has there been any progress?
I am also running into this issue with one of my projects.
I'm running into this issue as well. Pdfrw fails when running against a PDF which includes our company logo. If I remove this sheet it runs as expected. I rather have an updated PDF library than modify the PDF.
We switched to a commercial pdf library, and have had no problems. It is unfortunate that this defect is not being addressed, because it makes it impossible to trust the library or developer with production code.
Having the same issue of "unsupported PNG filter 4" on a couple PDFs from our suppliers.
There is no workaround, we switch to C# and used a commercial product from Foxit... :(
Doug Bower VP Product Bid Retriever 216-612-4870 Let us retriever your project!
------ Original Message ------ From: "bwiltse2620" @.> To: "pmaupin/pdfrw" @.> Cc: "Sunny-Engineer" @.>; "Comment" @.> Sent: 12/10/2021 2:53:02 PM Subject: Re: [pmaupin/pdfrw] Unsupported PNG filter 4 (#113)
Having the same issue of "unsupported PNG filter 4" on a couple PDFs from our suppliers.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pmaupin/pdfrw/issues/113#issuecomment-991252132, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIYYKOTNEOXELJLOFXZJIDTUQJLB5ANCNFSM4D5SBKPQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Did you try latest master? AFAIK, a PR addressing it was merged but there was never a new release
@Henddher - Yes I did end up pulling the latest master and that solved it. I'm a little confused as to why a new release wasn't created, but yes, that did solve the issue. I specifically needed the filter.py and utils.py (for anyone reading this later).
Glad that worked!
I really don’t know why @pmaupin never did a new release. It’s been many years since I fixed this issue.
Perhaps anyone is willing to push a new release from an alternative fork to pipi ? I don’t know.
(Back then) There was an alternative to installing straight from a commit in the repo using pip but I don’t recall the syntax :( pip -e
or something like that
Thanks for letting me know there was a fix! I didn't see that.
Doug
Doug Bower VP Product Bid Retriever 216-612-4870 Let us retriever your project!
------ Original Message ------ From: "Henddher Pedroza" @.> To: "pmaupin/pdfrw" @.> Cc: "Sunny-Engineer" @.>; "Comment" @.> Sent: 12/10/2021 6:53:08 PM Subject: Re: [pmaupin/pdfrw] Unsupported PNG filter 4 (#113)
Glad that worked!
I really don’t know why @pmaupin https://github.com/pmaupin never did a new release. It’s been many years since I fixed this issue.
Perhaps anyone is willing to push a new release from an alternative fork to pipi ? I don’t know.
(Back then) There was an alternative to installing straight from a commit in the repo using pip but I don’t recall the syntax :( pip -e or something like that
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pmaupin/pdfrw/issues/113#issuecomment-991379862, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIYYKOSAJCQDF7IHAN4YEKLUQKHGJANCNFSM4D5SBKPQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Still an issue in 2024
Hello @pmaupin
I am investigating an issue we frequently encounter with certain PDFs.
Basically, the PDF has a
obj
withstream
that appears to have at least one scanline using paeth filter.This is a snippet of the PDF:
I have isolated the error to
pdfrw.uncompress.flate_pnq
. I attached a sample code and data to reproduce.Notice the passed in
predictor
value is 12. Yet, when processing scanlineI could contribute by adding the filter if such is applicable in this case.
Do you think the sample code and data actually demonstrate that the issue is the filter type 4 (i.e. paeth)?
flate_png.py.txt
objstream.base64.txt
I added a
print
statement after calculating theoffset
for each row and this is what I get: