pmaupin / pdfrw

pdfrw is a pure Python library that reads and writes PDFs
Other
1.86k stars 272 forks source link

Unsupported PNG filter 4 #113

Open wrkhenddher opened 7 years ago

wrkhenddher commented 7 years ago

Hello @pmaupin

I am investigating an issue we frequently encounter with certain PDFs.

Basically, the PDF has a obj with stream that appears to have at least one scanline using paeth filter.

This is a snippet of the PDF:

6870 53 0 obj<</Size 55/Info 1 0 R/ID[<831e2b61230407b512d5c1d0c7c32cdd><ed25aaf71ebfd91394dad2315ecffae8>]/Root 2 0 R/Type/XRef/Index[0 55]/W[1 3 2]/Filter/FlateDecode/DecodeParms<</Predictor 12/Columns 6>>/Length 139>>
6871 stream
6872 x<9c><9d>P1^NÂ@^L³Ó"<84>DAíÀÀB§ö^AH¼^L¶Îí3x^L^_á^A<95><98>+<84>®ÉqÙ)·8>ÇV^RØ^K^A^B\´"°^GD¡^E%^@g$ML{Å^VÞ^Tâgfìí,<97><98>Æß<81><94>U×<99><9b><83>§lÁlópV,ÎL^@®+Ý(ç4^A<87>ÿSv÷Æg9-ñ<95>õè¾&^]ù«éu<8f>×<9e>Öùy^B3íÛ^Tñ
6873 endstream
6874 endobj

I have isolated the error to pdfrw.uncompress.flate_pnq. I attached a sample code and data to reproduce.

with open('objstream.base64', 'rb') as f:
    b64 = f.read()
    data = base64.b64decode(b64)
    print len(data)
    predictor, columns, colors, bpc = (12, 6, 1, 8)
    print predictor, columns, colors, bpc
    d2, error = flate_png(data, predictor, columns, colors, bpc)
    print error

Notice the passed in predictor value is 12. Yet, when processing scanline

I could contribute by adding the filter if such is applicable in this case.

Do you think the sample code and data actually demonstrate that the issue is the filter type 4 (i.e. paeth)?

flate_png.py.txt

objstream.base64.txt

I added a print statement after calculating the offset for each row and this is what I get:

row_index: 0 offset: 0
row_index: 7 offset: 0
row_index: 14 offset: 0
row_index: 21 offset: 2
row_index: 28 offset: 2
row_index: 35 offset: 0
row_index: 42 offset: 0
row_index: 49 offset: 0
row_index: 56 offset: 0
row_index: 63 offset: 0
row_index: 70 offset: 0
row_index: 77 offset: 2
row_index: 84 offset: 2
row_index: 91 offset: 2
row_index: 98 offset: 2
row_index: 105 offset: 2
row_index: 112 offset: 2
row_index: 119 offset: 2
row_index: 126 offset: 1
row_index: 133 offset: 2
row_index: 140 offset: 0
row_index: 147 offset: 0
row_index: 154 offset: 0
row_index: 161 offset: 2
row_index: 168 offset: 2
row_index: 175 offset: 2
row_index: 182 offset: 2
row_index: 189 offset: 2
row_index: 196 offset: 2
row_index: 203 offset: 2
row_index: 210 offset: 2
row_index: 217 offset: 0
row_index: 224 offset: 4
Unsupported PNG filter 4
pmaupin commented 7 years ago

Hello, Henddher.

Compression and decompression filters have mostly been added as people find issues and offer patches. My use cases for this library do not normally require decompression, so I haven't studied that very much.

So patches or pull requests would be great, but I can't just glance at your PDF object and tell you anything about how it was compressed.

Thanks, Pat

wrkhenddher commented 7 years ago

Thank you @pmaupin

I will continue to investigate and share here what I found with the hope that a second pair of eyes and good understanding of the PDF spec will help.

The main thing that threw me off was predictor=12 but later on scanline 32, filter code 4 (paeth).

It seems that such is perfectly valid:

For LZWDecode and FlateDecode, a Predictor value greater than or equal to 10 merely indicates that a PNG predictor is in use; the specific predictor function used is explicitly encoded in the incoming data. The value of Predictor supplied by the decoding filter need not match the value used when the data was encoded if they are both greater than or equal to 10.

Here is the Predictor in the object declaration I am dealing with:

/Filter/FlateDecode/DecodeParms<</Predictor 12/Columns 6>>

And the output of the snippet I attached.

0 row_index: 0 offset: 0
1 row_index: 7 offset: 0
2 row_index: 14 offset: 0
3 row_index: 21 offset: 2
4 row_index: 28 offset: 2
5 row_index: 35 offset: 0
6 row_index: 42 offset: 0
7 row_index: 49 offset: 0
8 row_index: 56 offset: 0
9 row_index: 63 offset: 0
10 row_index: 70 offset: 0
11 row_index: 77 offset: 2
12 row_index: 84 offset: 2
13 row_index: 91 offset: 2
14 row_index: 98 offset: 2
15 row_index: 105 offset: 2
16 row_index: 112 offset: 2
17 row_index: 119 offset: 2
18 row_index: 126 offset: 1
19 row_index: 133 offset: 2
20 row_index: 140 offset: 0
21 row_index: 147 offset: 0
22 row_index: 154 offset: 0
23 row_index: 161 offset: 2
24 row_index: 168 offset: 2
25 row_index: 175 offset: 2
26 row_index: 182 offset: 2
27 row_index: 189 offset: 2
28 row_index: 196 offset: 2
29 row_index: 203 offset: 2
30 row_index: 210 offset: 2
31 row_index: 217 offset: 0
32 row_index: 224 offset: 4      <<<
33 row_index: 231 offset: 2
34 row_index: 238 offset: 2
35 row_index: 245 offset: 2
36 row_index: 252 offset: 2
37 row_index: 259 offset: 2
38 row_index: 266 offset: 2
39 row_index: 273 offset: 2
40 row_index: 280 offset: 2
41 row_index: 287 offset: 0
42 row_index: 294 offset: 0
43 row_index: 301 offset: 2
44 row_index: 308 offset: 2
45 row_index: 315 offset: 2
46 row_index: 322 offset: 2
47 row_index: 329 offset: 2
48 row_index: 336 offset: 2
49 row_index: 343 offset: 0
50 row_index: 350 offset: 0
51 row_index: 357 offset: 0
52 row_index: 364 offset: 2
53 row_index: 371 offset: 2
54 row_index: 378 offset: 2

According to PDF 1.7 spec, section 3.3. (LZW and Flate Predictor Functions):

http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf

The second supported group of predictor functions, the PNG group, consists of the filters of the World Wide Web Consortium’s Portable Network Graphics recommendation, documented in Internet RFC 2083, PNG (Portable Network Graphics) Specification (see the Bibliography). The term predictors is used here instead of filters to avoid confusion. There are five basic PNG predictor algorithms (and a sixth that chooses the optimum predictor function separately for each row):

- -
None No prediction
Sub Predicts the same as the sample to the left
Up Predicts the same as the sample above
Average Predicts the average of the sample to the left and the sample above
Paeth A nonlinear function of the sample above, the sample to the left, and the sample to the upper left

The predictor algorithm to be used, if any, is indicated by the Predictor filter parameter (see Table 3.7), which can have any of the values listed in Table 3.8.

TABLE 3.8 Predictor values

VALUE MEANING
1 No prediction (the default value)
2 TIFF Predictor 2
10 PNG prediction (on encoding, PNG None on all rows)
11 PNG prediction (on encoding, PNG Sub on all rows)
12 PNG prediction (on encoding, PNG Up on all rows)
13 PNG prediction (on encoding, PNG Average on all rows)
14 PNG prediction (on encoding, PNG Paeth on all rows)
15 PNG prediction (on encoding, PNG optimum)

For LZWDecode and FlateDecode, a Predictor value greater than or equal to 10 merely indicates that a PNG predictor is in use; the specific predictor function used is explicitly encoded in the incoming data. The value of Predictor supplied by the decoding filter need not match the value used when the data was encoded if they are both greater than or equal to 10.

The two groups of predictor functions have some commonalities. Both make the following assumptions: • Data is presented in order, from the top row to the bottom row and, within a row, from left to right. • A row occupies a whole number of bytes, rounded up if necessary. • Samples and their components are packed into bytes from high-order to loworder bits. • All color components of samples outside the image (which are necessary for predictions near the boundaries) are 0. The predictor function groups also differ in significant ways: • The postprediction data for each PNG-predicted row begins with an explicit algorithm tag; therefore, different rows can be predicted with different algorithms to improve compression. TIFF Predictor 2 has no such identifier; the same algorithm applies to all rows. • The TIFF function group predicts each color component from the prior instance of that component, taking into account the number of bits per component and components per sample. In contrast, the PNG function group predicts each byte of data as a function of the corresponding byte of one or more previous image samples, regardless of whether there are multiple color components in a byte or whether a single color component spans multiple bytes. This can yield significantly better speed at the cost of somewhat worse compression.

wrkhenddher commented 7 years ago

Hello @pmaupin

I implemented 3 PNG filters: Up(2), Average (3) and Paeth (4). I am not sure but I think the existing Sub filter implementation might have been used as Up as well - which might not produce the correct results. I would like to find a set of sample "bitmaps" that I can use to test all filters. Of course, if you have such set available (with input and expected output), it would be perfect.

I ran it against my objectstream.base64 and it did process all scanlines.

I will be pushing a PR so you can take a look?

Thanks for this library of yours!

This is the reference I used for implementing the filters: http://www.libpng.org/pub/png/spec/1.2/PNG-Filters.html

pmaupin commented 7 years ago

Thanks for looking into this, but sorry -- I don't know that I have any bitmaps. I might, if any of the files in https://github.com/pmaupin/static_pdfs have a bitmap, but I haven't looked.

Pull requests are certainly welcome -- even if it's not perfect, it might be closer, right?

wrkhenddher commented 7 years ago

Thanks @pmaupin

Indeed! I hope this addition will move it closer.

I'll try the static PDFs and see what happens.

wrkhenddher commented 7 years ago

@pmaupin

I am having issues running the test-cases ...

Ran 437 tests in 62.324s

FAILED (SKIP=56, errors=49, failures=4)

I did this:

$ cd pdfrw/tests
$ git clone https://github.com/pmaupin/static_pdfs
$ nosetests *

Am I doing correctly?

Also, after "running" them, expected.txt is changing also.

diff --git a/tests/expected.txt b/tests/expected.txt
index b1b7cca..247179b 100644
--- a/tests/expected.txt
+++ b/tests/expected.txt
@@ -223,3 +223,5 @@ compress/d6fd9567078b48c86710e9c49173781f.pdf cbc8922b8bea08928463b287767ec229
 compress/e9ab02aa769f4c040a6fa52f00d6e3f0.pdf e893e407b3c2366d4ca822ce80b45c2c
 compress/ec00d5825f47b9d0faa953b1709163c3.pdf 9ba3db0dedec74c3d2a6f033f1b22a81
 compress/ed81787b83cc317c9f049643b853bea3.pdf 2ceda401f68a44a3fb1da4e0f9dfc578
+compress/9f98322c243fe67726d56ccfa8e0885b.pdf !f6fcfba92c923f6abf9d374553013493
+decompress/9f98322c243fe67726d56ccfa8e0885b.pdf !4897109976e903a3b3d77af4c5e47d87

I "stashed" my changes and verified that the baseline also produces many errors.

pmaupin commented 7 years ago

For one thing I don't use nose. I only mention this because expected.txt should not normally change on its own. Both unittest and py.test work for me.

But the primary problem is probably that the tests test the installed version of pdfrw rather than what you are working on. This is because, in regression, I want to make sure that pdfrw installs properly and the installed version is what is tested.

So for testing at my workstation, I go into the tests directory and do ln -s ../pdfrw

Sorry about not documenting that properly.

Thanks, Pat

wrkhenddher commented 7 years ago

Thank you @pmaupin

That did it!

(python -m unittest didn't work for me but pytest did)

$ pwd
/Users/henddher/Documents/python_workspace/pdfrw_playground/pdfrw/tests

$ ln -s ../pdfrw
$ ls -l pdfrw
lrwxr-xr-x  1 henddher  staff  8 Oct  4 16:17 pdfrw -> ../pdfrw

$ pytest 
=================================================================================================================================== test session starts ===================================================================================================================================
platform darwin -- Python 2.7.11, pytest-3.2.3, py-1.4.34, pluggy-0.4.0
rootdir: /Users/henddher/Documents/python_workspace/pdfrw_playground/pdfrw, inifile:
collected 215 items                                                                                                                                                                                                                                                                        

test_examples.py .......FFFF...
test_pdfdict.py .
test_pdfreader_init.py ..
test_pdfstring.py ..........
test_roundtrip.py .....s..................s.....s.....sss....s........s..................s.....s.....sss....s........s..................s.....s.....sss....s........s..................s.....s.....sss....s...

======================================================================================================================================== FAILURES =========================================================================================================================================

...

==================================================================================================================== 4 failed, 183 passed, 28 skipped in 32.50 seconds ====================================================================================================================

I am getting 4 failures running against master branch though (8774f15b1189657e5c30079b4d658284660ceadc)

Thoughts?

wrkhenddher commented 7 years ago

Here is one of the failures:

test_examples.py .......F
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> traceback >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

self = <tests.test_examples.TestOnePdf testMethod=test_rl1_4up>

    def test_rl1_4up(self):
        if sys.version_info < (2, 7):
            return
        self.do_test('rl1/4up     b1c400de699af29ea3f1983bb26870ab',
>                    scrub=True)

../../../test_examples.py:171: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../../test_examples.py:99: in do_test
    PdfWriter(dstf).addpages(PdfReader(scrub).pages).write()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = {}, fname = '4up.b1c400de699af29ea3f1983bb26870ab.pdf', fdata = None, decompress = False, decrypt = False, password = '', disable_gc = True, verbose = True

    def __init__(self, fname=None, fdata=None, decompress=False,
                 decrypt=False, password='', disable_gc=True, verbose=True):
        self.private.verbose = verbose

        # Runs a lot faster with GC off.
        disable_gc = disable_gc and gc.isenabled()
        if disable_gc:
            gc.disable()

        try:
            if fname is not None:
                assert fdata is None
                # Allow reading preexisting streams like pyPdf
                if hasattr(fname, 'read'):
                    fdata = fname.read()
                else:
                    try:
                        f = open(fname, 'rb')
                        fdata = f.read()
                        f.close()
                    except IOError:
                        raise PdfParseError('Could not read PDF file %s' %
>                                           fname)
E                                           PdfParseError: Could not read PDF file 4up.b1c400de699af29ea3f1983bb26870ab.pdf

../../../pdfrw/pdfreader.py:573: PdfParseError

It seems like the four that are failing (off master) are related to static pdf b1c400de699af29ea3f1983bb26870ab.pdf

pmaupin commented 7 years ago

That particular one is a reportlab test (see the rl1 in the testname). If the others are as well, it may indicate an unsupported version of reportlab or a missing version of reportlab on your computer.

Thanks, Pat

wrkhenddher commented 7 years ago

That was it!

I didn't know reportlab was required. I piped it and now master branch passes all tests.

I will add these notes to the README.md test part.

Thanks again!

wrkhenddher commented 7 years ago

Back in my branch, I've got 10 failures - which is good because I did change code :P

# Failures
TestOnePdf.test_compress_2ac7c68e26a8ef797aead15e4875cc6d.pdf
TestOnePdf.test_compress_35df0b8cff4afec0c08f08c6a5bc9857.pdf
TestOnePdf.test_compress_9f98322c243fe67726d56ccfa8e0885b.pdf
TestOnePdf.test_decompress_2ac7c68e26a8ef797aead15e4875cc6d.pdf
TestOnePdf.test_decompress_35df0b8cff4afec0c08f08c6a5bc9857.pdf
TestOnePdf.test_decompress_9f98322c243fe67726d56ccfa8e0885b.pdf
TestOnePdf.test_repaginate_2ac7c68e26a8ef797aead15e4875cc6d.pdf
TestOnePdf.test_repaginate_35df0b8cff4afec0c08f08c6a5bc9857.pdf
TestOnePdf.test_simple_2ac7c68e26a8ef797aead15e4875cc6d.pdf
TestOnePdf.test_simple_35df0b8cff4afec0c08f08c6a5bc9857.pdf

Will keep you posted!

$ pytest
=================================================================================================================================== test session starts ===================================================================================================================================
platform darwin -- Python 2.7.11, pytest-3.2.3, py-1.4.34, pluggy-0.4.0
rootdir: /Users/henddher/Documents/python_workspace/pdfrw_playground/pdfrw, inifile:
collected 215 items                                                                                                                                                                                                                                                                        

test_examples.py ..............
test_pdfdict.py .
test_pdfreader_init.py ..
test_pdfstring.py ..........
test_roundtrip.py .....s.....F...F........s.....s..F..sss....s........s.....F...F........s.....s..F..sss....s........s.....F...F........s.....s.....sss....s........s.....F...F........s.....s.....sss....s...

======================================================================================================================================== FAILURES =========================================================================================================================================
______________________________________________________________________________________________________________ TestOnePdf.test_compress_2ac7c68e26a8ef797aead15e4875cc6d.pdf ______________________________________________________________________________________________________________

self = <tests.test_roundtrip.TestOnePdf testMethod=test_compress_2ac7c68e26a8ef797aead15e4875cc6d.pdf>

    def test(self):
>       self.roundtrip(*args, **kw)

../../test_roundtrip.py:114: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../test_roundtrip.py:77: in roundtrip
    verbose=False)
../../pdfrw/pdfreader.py:671: in __init__
    self.Info = trailer.Info
../../pdfrw/objects/pdfdict.py:130: in __getattr__
    return self.get(PdfName(name))
../../pdfrw/objects/pdfdict.py:143: in get
    value = value.real_value()
../../pdfrw/objects/pdfindirect.py:21: in real_value
    value = self.value = self._loader(self)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = {'/Root': {'/Metadata': (1, 0), '/Type': '/Catalog', '/Pages': (5, 0)}}, key = (6, 0), PdfDict = <class 'tests.pdfrw.objects.pdfdict.PdfDict'>, isinstance = <built-in function isinstance>

    def loadindirect(self, key, PdfDict=PdfDict,
                     isinstance=isinstance):
        result = self.indirect_objects.get(key)
        if not isinstance(result, PdfIndirect):
            return result
        source = self.source
        offset = int(self.source.obj_offsets.get(key, '0'))
        if not offset:
            source.warning("Did not find PDF object %s", key)
            return None

        # Read the object header and validate it
        objnum, gennum = key
        source.floc = offset
        objid = source.multiple(3)
        ok = len(objid) == 3
        ok = ok and objid[0].isdigit() and int(objid[0]) == objnum
        ok = ok and objid[1].isdigit() and int(objid[1]) == gennum
        ok = ok and objid[2] == 'obj'
        if not ok:
            source.floc = offset
>           source.next()
E           StopIteration

../../pdfrw/pdfreader.py:201: StopIteration
______________________________________________________________________________________________________________ TestOnePdf.test_compress_35df0b8cff4afec0c08f08c6a5bc9857.pdf ______________________________________________________________________________________________________________

self = <tests.test_roundtrip.TestOnePdf testMethod=test_compress_35df0b8cff4afec0c08f08c6a5bc9857.pdf>

    def test(self):
>       self.roundtrip(*args, **kw)

../../test_roundtrip.py:114: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../test_roundtrip.py:77: in roundtrip
    verbose=False)
../../pdfrw/pdfreader.py:671: in __init__
    self.Info = trailer.Info
../../pdfrw/objects/pdfdict.py:130: in __getattr__
    return self.get(PdfName(name))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = {'/Length': '92', '/Size': '36', '/DecodeParms': {'/Columns': '4', '/Predictor...Type': '/Catalog', '/Outlines': (7, 0), '/Metadata': (3, 0)}, '/Info': (12, 0)}, key = '/Info', dictget = <method 'get' of 'dict' objects>, isinstance = <built-in function isinstance>
PdfIndirect = <class 'tests.pdfrw.objects.pdfindirect.PdfIndirect'>

    def get(self, key, dictget=dict.get, isinstance=isinstance,
            PdfIndirect=PdfIndirect):
        ''' Get a value out of the dictionary,
                after resolving any indirect objects.
            '''
        value = dictget(self, key)
        if isinstance(value, PdfIndirect):
            # We used to use self[key] here, but that does an
            # unwanted check on the type of the key (github issue #98).
            # Python will keep the old key object in the dictionary,
            # so that check is not necessary.
            value = value.real_value()
            if value is not None:
                dict.__setitem__(self, key, value)
            else:
>               del self[name]
E               NameError: global name 'name' is not defined

../../pdfrw/objects/pdfdict.py:147: NameError
---------------------------------------------------------------------------------------------------------------------------------- Captured stderr call -----------------------------------------------------------------------------------------------------------------------------------
[WARNING] tokens.py:221 Did not find PDF object (12, 0) (line=27, col=1, token='endobj')
______________________________________________________________________________________________________________ TestOnePdf.test_compress_9f98322c243fe67726d56ccfa8e0885b.pdf ______________________________________________________________________________________________________________

self = <tests.test_roundtrip.TestOnePdf testMethod=test_compress_9f98322c243fe67726d56ccfa8e0885b.pdf>

    def test(self):
>       self.roundtrip(*args, **kw)

../../test_roundtrip.py:114: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../test_roundtrip.py:98: in roundtrip
    self.assertEqual(hash, expects)
E   AssertionError: 'f6fcfba92c923f6abf9d374553013493' != '3167fa11a3f1f4a06f90294b21e101b7'
E   - f6fcfba92c923f6abf9d374553013493
E   + 3167fa11a3f1f4a06f90294b21e101b7
_____________________________________________________________________________________________________________ TestOnePdf.test_decompress_2ac7c68e26a8ef797aead15e4875cc6d.pdf _____________________________________________________________________________________________________________

self = <tests.test_roundtrip.TestOnePdf testMethod=test_decompress_2ac7c68e26a8ef797aead15e4875cc6d.pdf>

    def test(self):
>       self.roundtrip(*args, **kw)

../../test_roundtrip.py:114: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../test_roundtrip.py:77: in roundtrip
    verbose=False)
../../pdfrw/pdfreader.py:671: in __init__
    self.Info = trailer.Info
../../pdfrw/objects/pdfdict.py:130: in __getattr__
    return self.get(PdfName(name))
../../pdfrw/objects/pdfdict.py:143: in get
    value = value.real_value()
../../pdfrw/objects/pdfindirect.py:21: in real_value
    value = self.value = self._loader(self)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = {'/Root': {'/Metadata': (1, 0), '/Type': '/Catalog', '/Pages': (5, 0)}}, key = (6, 0), PdfDict = <class 'tests.pdfrw.objects.pdfdict.PdfDict'>, isinstance = <built-in function isinstance>

    def loadindirect(self, key, PdfDict=PdfDict,
                     isinstance=isinstance):
        result = self.indirect_objects.get(key)
        if not isinstance(result, PdfIndirect):
            return result
        source = self.source
        offset = int(self.source.obj_offsets.get(key, '0'))
        if not offset:
            source.warning("Did not find PDF object %s", key)
            return None

        # Read the object header and validate it
        objnum, gennum = key
        source.floc = offset
        objid = source.multiple(3)
        ok = len(objid) == 3
        ok = ok and objid[0].isdigit() and int(objid[0]) == objnum
        ok = ok and objid[1].isdigit() and int(objid[1]) == gennum
        ok = ok and objid[2] == 'obj'
        if not ok:
            source.floc = offset
>           source.next()
E           StopIteration

../../pdfrw/pdfreader.py:201: StopIteration
_____________________________________________________________________________________________________________ TestOnePdf.test_decompress_35df0b8cff4afec0c08f08c6a5bc9857.pdf _____________________________________________________________________________________________________________

self = <tests.test_roundtrip.TestOnePdf testMethod=test_decompress_35df0b8cff4afec0c08f08c6a5bc9857.pdf>

    def test(self):
>       self.roundtrip(*args, **kw)

../../test_roundtrip.py:114: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../test_roundtrip.py:77: in roundtrip
    verbose=False)
../../pdfrw/pdfreader.py:671: in __init__
    self.Info = trailer.Info
../../pdfrw/objects/pdfdict.py:130: in __getattr__
    return self.get(PdfName(name))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = {'/Length': '92', '/Size': '36', '/DecodeParms': {'/Columns': '4', '/Predictor...Type': '/Catalog', '/Outlines': (7, 0), '/Metadata': (3, 0)}, '/Info': (12, 0)}, key = '/Info', dictget = <method 'get' of 'dict' objects>, isinstance = <built-in function isinstance>
PdfIndirect = <class 'tests.pdfrw.objects.pdfindirect.PdfIndirect'>

    def get(self, key, dictget=dict.get, isinstance=isinstance,
            PdfIndirect=PdfIndirect):
        ''' Get a value out of the dictionary,
                after resolving any indirect objects.
            '''
        value = dictget(self, key)
        if isinstance(value, PdfIndirect):
            # We used to use self[key] here, but that does an
            # unwanted check on the type of the key (github issue #98).
            # Python will keep the old key object in the dictionary,
            # so that check is not necessary.
            value = value.real_value()
            if value is not None:
                dict.__setitem__(self, key, value)
            else:
>               del self[name]
E               NameError: global name 'name' is not defined

../../pdfrw/objects/pdfdict.py:147: NameError
---------------------------------------------------------------------------------------------------------------------------------- Captured stderr call -----------------------------------------------------------------------------------------------------------------------------------
[WARNING] tokens.py:221 Did not find PDF object (12, 0) (line=27, col=1, token='endobj')
_____________________________________________________________________________________________________________ TestOnePdf.test_decompress_9f98322c243fe67726d56ccfa8e0885b.pdf _____________________________________________________________________________________________________________

self = <tests.test_roundtrip.TestOnePdf testMethod=test_decompress_9f98322c243fe67726d56ccfa8e0885b.pdf>

    def test(self):
>       self.roundtrip(*args, **kw)

../../test_roundtrip.py:114: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../test_roundtrip.py:98: in roundtrip
    self.assertEqual(hash, expects)
E   AssertionError: '4897109976e903a3b3d77af4c5e47d87' != '4b53b63b0779b81d8f9569e66ca3d8ee'
E   - 4897109976e903a3b3d77af4c5e47d87
E   + 4b53b63b0779b81d8f9569e66ca3d8ee
_____________________________________________________________________________________________________________ TestOnePdf.test_repaginate_2ac7c68e26a8ef797aead15e4875cc6d.pdf _____________________________________________________________________________________________________________

self = <tests.test_roundtrip.TestOnePdf testMethod=test_repaginate_2ac7c68e26a8ef797aead15e4875cc6d.pdf>

    def test(self):
>       self.roundtrip(*args, **kw)

../../test_roundtrip.py:114: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../test_roundtrip.py:77: in roundtrip
    verbose=False)
../../pdfrw/pdfreader.py:671: in __init__
    self.Info = trailer.Info
../../pdfrw/objects/pdfdict.py:130: in __getattr__
    return self.get(PdfName(name))
../../pdfrw/objects/pdfdict.py:143: in get
    value = value.real_value()
../../pdfrw/objects/pdfindirect.py:21: in real_value
    value = self.value = self._loader(self)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = {'/Root': {'/Metadata': (1, 0), '/Type': '/Catalog', '/Pages': (5, 0)}}, key = (6, 0), PdfDict = <class 'tests.pdfrw.objects.pdfdict.PdfDict'>, isinstance = <built-in function isinstance>

    def loadindirect(self, key, PdfDict=PdfDict,
                     isinstance=isinstance):
        result = self.indirect_objects.get(key)
        if not isinstance(result, PdfIndirect):
            return result
        source = self.source
        offset = int(self.source.obj_offsets.get(key, '0'))
        if not offset:
            source.warning("Did not find PDF object %s", key)
            return None

        # Read the object header and validate it
        objnum, gennum = key
        source.floc = offset
        objid = source.multiple(3)
        ok = len(objid) == 3
        ok = ok and objid[0].isdigit() and int(objid[0]) == objnum
        ok = ok and objid[1].isdigit() and int(objid[1]) == gennum
        ok = ok and objid[2] == 'obj'
        if not ok:
            source.floc = offset
>           source.next()
E           StopIteration

../../pdfrw/pdfreader.py:201: StopIteration
_____________________________________________________________________________________________________________ TestOnePdf.test_repaginate_35df0b8cff4afec0c08f08c6a5bc9857.pdf _____________________________________________________________________________________________________________

self = <tests.test_roundtrip.TestOnePdf testMethod=test_repaginate_35df0b8cff4afec0c08f08c6a5bc9857.pdf>

    def test(self):
>       self.roundtrip(*args, **kw)

../../test_roundtrip.py:114: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../test_roundtrip.py:77: in roundtrip
    verbose=False)
../../pdfrw/pdfreader.py:671: in __init__
    self.Info = trailer.Info
../../pdfrw/objects/pdfdict.py:130: in __getattr__
    return self.get(PdfName(name))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = {'/Length': '92', '/Size': '36', '/DecodeParms': {'/Columns': '4', '/Predictor...Type': '/Catalog', '/Outlines': (7, 0), '/Metadata': (3, 0)}, '/Info': (12, 0)}, key = '/Info', dictget = <method 'get' of 'dict' objects>, isinstance = <built-in function isinstance>
PdfIndirect = <class 'tests.pdfrw.objects.pdfindirect.PdfIndirect'>

    def get(self, key, dictget=dict.get, isinstance=isinstance,
            PdfIndirect=PdfIndirect):
        ''' Get a value out of the dictionary,
                after resolving any indirect objects.
            '''
        value = dictget(self, key)
        if isinstance(value, PdfIndirect):
            # We used to use self[key] here, but that does an
            # unwanted check on the type of the key (github issue #98).
            # Python will keep the old key object in the dictionary,
            # so that check is not necessary.
            value = value.real_value()
            if value is not None:
                dict.__setitem__(self, key, value)
            else:
>               del self[name]
E               NameError: global name 'name' is not defined

../../pdfrw/objects/pdfdict.py:147: NameError
---------------------------------------------------------------------------------------------------------------------------------- Captured stderr call -----------------------------------------------------------------------------------------------------------------------------------
[WARNING] tokens.py:221 Did not find PDF object (12, 0) (line=27, col=1, token='endobj')
_______________________________________________________________________________________________________________ TestOnePdf.test_simple_2ac7c68e26a8ef797aead15e4875cc6d.pdf _______________________________________________________________________________________________________________

self = <tests.test_roundtrip.TestOnePdf testMethod=test_simple_2ac7c68e26a8ef797aead15e4875cc6d.pdf>

    def test(self):
>       self.roundtrip(*args, **kw)

../../test_roundtrip.py:114: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../test_roundtrip.py:77: in roundtrip
    verbose=False)
../../pdfrw/pdfreader.py:671: in __init__
    self.Info = trailer.Info
../../pdfrw/objects/pdfdict.py:130: in __getattr__
    return self.get(PdfName(name))
../../pdfrw/objects/pdfdict.py:143: in get
    value = value.real_value()
../../pdfrw/objects/pdfindirect.py:21: in real_value
    value = self.value = self._loader(self)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = {'/Root': {'/Metadata': (1, 0), '/Type': '/Catalog', '/Pages': (5, 0)}}, key = (6, 0), PdfDict = <class 'tests.pdfrw.objects.pdfdict.PdfDict'>, isinstance = <built-in function isinstance>

    def loadindirect(self, key, PdfDict=PdfDict,
                     isinstance=isinstance):
        result = self.indirect_objects.get(key)
        if not isinstance(result, PdfIndirect):
            return result
        source = self.source
        offset = int(self.source.obj_offsets.get(key, '0'))
        if not offset:
            source.warning("Did not find PDF object %s", key)
            return None

        # Read the object header and validate it
        objnum, gennum = key
        source.floc = offset
        objid = source.multiple(3)
        ok = len(objid) == 3
        ok = ok and objid[0].isdigit() and int(objid[0]) == objnum
        ok = ok and objid[1].isdigit() and int(objid[1]) == gennum
        ok = ok and objid[2] == 'obj'
        if not ok:
            source.floc = offset
>           source.next()
E           StopIteration

../../pdfrw/pdfreader.py:201: StopIteration
_______________________________________________________________________________________________________________ TestOnePdf.test_simple_35df0b8cff4afec0c08f08c6a5bc9857.pdf _______________________________________________________________________________________________________________

self = <tests.test_roundtrip.TestOnePdf testMethod=test_simple_35df0b8cff4afec0c08f08c6a5bc9857.pdf>

    def test(self):
>       self.roundtrip(*args, **kw)

../../test_roundtrip.py:114: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../test_roundtrip.py:77: in roundtrip
    verbose=False)
../../pdfrw/pdfreader.py:671: in __init__
    self.Info = trailer.Info
../../pdfrw/objects/pdfdict.py:130: in __getattr__
    return self.get(PdfName(name))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = {'/Length': '92', '/Size': '36', '/DecodeParms': {'/Columns': '4', '/Predictor...Type': '/Catalog', '/Outlines': (7, 0), '/Metadata': (3, 0)}, '/Info': (12, 0)}, key = '/Info', dictget = <method 'get' of 'dict' objects>, isinstance = <built-in function isinstance>
PdfIndirect = <class 'tests.pdfrw.objects.pdfindirect.PdfIndirect'>

    def get(self, key, dictget=dict.get, isinstance=isinstance,
            PdfIndirect=PdfIndirect):
        ''' Get a value out of the dictionary,
                after resolving any indirect objects.
            '''
        value = dictget(self, key)
        if isinstance(value, PdfIndirect):
            # We used to use self[key] here, but that does an
            # unwanted check on the type of the key (github issue #98).
            # Python will keep the old key object in the dictionary,
            # so that check is not necessary.
            value = value.real_value()
            if value is not None:
                dict.__setitem__(self, key, value)
            else:
>               del self[name]
E               NameError: global name 'name' is not defined

../../pdfrw/objects/pdfdict.py:147: NameError
---------------------------------------------------------------------------------------------------------------------------------- Captured stderr call -----------------------------------------------------------------------------------------------------------------------------------
[WARNING] tokens.py:221 Did not find PDF object (12, 0) (line=27, col=1, token='endobj')
=================================================================================================================== 10 failed, 177 passed, 28 skipped in 32.64 seconds ====================================================================================================================
wrkhenddher commented 7 years ago

No dice :(

I tried running all test_roundtrip cases with breakpoint at pdfrw/uncompress:32

diff --git a/tests/test_roundtrip.py b/tests/test_roundtrip.py
index a8349a6..c878101 100755
--- a/tests/test_roundtrip.py
+++ b/tests/test_roundtrip.py
@@ -109,6 +109,7 @@ class TestOnePdf(unittest.TestCase):

 def build_tests():
+    import pdb; pdb.set_trace()
     def test_closure(*args, **kw):
         def test(self):
             self.roundtrip(*args, **kw)

pdfrw/uncompress.py uncompress() function is never called.

I must be missing something when it comes to the test cases and the mismatched md5s.

Help!

BTW - PR is up but it's not ready for code-review because of the failing tests.

pmaupin commented 7 years ago

1) Obviously I screwed up the fix for #98 -- there shouldn't be an error right there.

2) If I recall correctly, some of the tests use subprocess so the process where uncompress() is called might not be the one that you set the breakpoint in.

wrkhenddher commented 7 years ago

Bingo. You are correct about the subprocess. I added the breakpoint to uncompress and I am about to pdb it :)

I'll look at #98 changes. I assume you mean they are affecting the del self[name] NameError: global name 'name' is not defined ones.

wrkhenddher commented 7 years ago

Hi @pmaupin

I haven't had much luck with regression.

I don't think #98 is wrong. From what I can tell, it looks ok to me. What did you mean by right there? where? > there shouldn't be an error right there. Never mind :) name doesn't exist in that context :P (I committed that fix)

I guess that the fact that all tests passed using master and then some fail using my branch puts the blame on my code. No?

Some failures are MD5 mismatches but there are actual failures processing the PDFs.

The weird thing is why they fail "somewhere" else. The only reason I can think of is that I am overwriting the PDF data during flate_png.

I will try to test flate_png alone against some sample data. I will compare results with yours. That will tell what's happening - hopefully. I am still in doubt about your 2 filters (1 and 2) doing Up filtering.

wrkhenddher commented 6 years ago

Hi @pmaupin

Today, I came back to this.

I hope you can take a quick look and shed some light.

Turned out that all failing tests are throwing StopIteration from loadindirect.

It seems like the output from uncompress (after calling flate_png) produces data arrays that are larger and since uncompress is being called with leave_raw = True, the original obj.stream is being replaced with the uncompressed and flate png larger data array.

I wonder, if obj.stream is replaced with longer array, could that cause that some offset calculation is now off so higher layers in the library read bytes at the wrong location?

### uncompress.py

 67                     l0 = len(data)
 68                     if 10 <= predictor <= 15:
 69                         data, error = flate_png(data, predictor, columns, colors, bpc)
 70                         l1 = len(data)
 71                         try:
 72                             assert error == None
 73                             assert l1 == l0, "Mismatch len %d vs %d" % (l0, l1)
 74                         except Exception as e:
 75                             print "Exception", e
 76                             pass
 77                     elif predictor != 1:
 78                         error = ('Unsupported flatedecode predictor %s' %
 79                                  repr(predictor))
 80             if error is None:
 81                 assert not dco.unconsumed_tail
 82                 if dco.unused_data.strip():
 83                     error = ('Unconsumed compression data: %s' %
 84                              repr(dco.unused_data[:20]))
 85             if error is None:
 86                 obj.Filter = None
 87                 obj.stream = data if leave_raw else convert_load(data)

Here is the output of one of the test_roundtrip tests:

______________________________________________________________________________________________________________ TestOnePdf.test_compress_2ac7c68e26a8ef797aead15e4875cc6d.pdf ______________________________________________________________________________________________________________

self = <tests.test_roundtrip.TestOnePdf testMethod=test_compress_2ac7c68e26a8ef797aead15e4875cc6d.pdf>

    def test(self):
>       self.roundtrip(*args, **kw)

../../test_roundtrip.py:114: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../test_roundtrip.py:77: in roundtrip
    verbose=False)
../../pdfrw/pdfreader.py:671: in __init__
    self.Info = trailer.Info
../../pdfrw/objects/pdfdict.py:130: in __getattr__
    return self.get(PdfName(name))
../../pdfrw/objects/pdfdict.py:143: in get
    value = value.real_value()
../../pdfrw/objects/pdfindirect.py:21: in real_value
    value = self.value = self._loader(self)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = {'/Root': {'/Metadata': (1, 0), '/Type': '/Catalog', '/Pages': (5, 0)}}, key = (6, 0), PdfDict = <class 'tests.pdfrw.objects.pdfdict.PdfDict'>, isinstance = <built-in function isinstance>

    def loadindirect(self, key, PdfDict=PdfDict,
                     isinstance=isinstance):
        result = self.indirect_objects.get(key)
        if not isinstance(result, PdfIndirect):
            return result
        source = self.source
        offset = int(self.source.obj_offsets.get(key, '0'))
        if not offset:
            source.warning("Did not find PDF object %s", key)
            return None

        # Read the object header and validate it
        objnum, gennum = key
        source.floc = offset
        objid = source.multiple(3)
        ok = len(objid) == 3
        ok = ok and objid[0].isdigit() and int(objid[0]) == objnum
        ok = ok and objid[1].isdigit() and int(objid[1]) == gennum
        ok = ok and objid[2] == 'obj'
        if not ok:
            source.floc = offset
>           source.next()
E           StopIteration

../../pdfrw/pdfreader.py:201: StopIteration
---------------------------------------------------------------------------------------------------------------------------------- Captured stdout call -----------------------------------------------------------------------------------------------------------------------------------
leave_raw True
Exception Mismatch len 65 vs 52
leave_raw True
Exception Mismatch len 28 vs 21
pmaupin commented 6 years ago

That seems unlikely.

If I were debugging this, I would probably ask something like pdftk to create an uncompressed copy of the PDF, verify that the uncompressed copy is correct (displays correctly, etc.), and would then verify that my decompression matched that of the other tool.

wrkhenddher commented 6 years ago

Thanks @pmaupin

I figured out an issue with the new implementation of flate_png. When reconstructing the pixels, I was not using the correct previous scanline but instead I was using the unmodified scanline.

With that, I am now passing almost all cases except 3:

These 2 existing test cases are failing ("The Time Machine" pdf):

____ TestOnePdf.test_compress9f98322c243fe67726d56ccfa8e0885b.pdf ____ ___ TestOnePdf.test_decompress_9f98322c243fe67726d56ccfa8e0885b.pdf ____

When I look at the generated PDFs, I can see that the DP logo in the first page is rendered incorrectly (some scanlines bleed yellow horizontally). That seems like incorrect sub filter ...

The strange thing is that when I compare the output of flate_png_orig with flate_png, it seems like the original was not applying the inverse of the filter correctly? (assuming I am understanding the spec correctly and interpreting the code correctly)

Take a look at my stand alone flate_png test for Sub filter (filter 1):

_ TestFlatePNG.test_flate_png_filter1

# nc: Number of columns
# nr: Number of rows
# bpc: bits per component
# ncolors: Number of colors 
e.g. RGBA: 8bit per channel, 4 channels => 32bits => pixel size = 4bytes

# nc=2, nr=3, bpc=8, ncolors=4

ERROR:root:Data: array('B', [0, 0, 4, 8, 12, 16, 20, 24, 28, 1, 8, 12, 16, 20, 24, 28, 32, 36, 1, 16, 20, 24, 28, 32, 36, 40, 44])
ERROR:root:   0    0
ERROR:root:   4    4
ERROR:root:   8    8
ERROR:root:  12   12
ERROR:root:  16   16
ERROR:root:  20   20
ERROR:root:  24   24
ERROR:root:  28   28
ERROR:root:   9    8  << 
ERROR:root:  21   12 
ERROR:root:  37   16 
ERROR:root:  57   20 
...
Type Name Filter Function Reconstruction Function
1 Sub Filt(x) = Orig(x) - Orig(a) Recon(x) = Filt(x) + Recon(a)

According to the spec, https://www.w3.org/TR/2003/REC-PNG-20031110/#9FtIntro, the reconstruction is the addition of the current filtered value and the corresponding byte of the pixel to the left. Given that this is the 1st pixel of the scanline, all bytes of the pixel to its left are supposed to be 0.

See <<: 9 was produced by adding the 1st byte of the first pixel of the second scanline (data[10]) to the filter byte (data[9]).

But I think it should have been data[10] = data[10] + 0. Which is 8

I think it all stems from pixel size. In this case, it is supposed to be 4 bytes but the original flate was always subtracting 1 from the current index which essentially meant that the pixel size was always 1 byte.

Either I am misunderstanding the specification or I am not understanding what colums, colors and bpc are in the flate_png function.

# uncompress.py
93   def flate_png(data, predictor=1, columns=1, colors=1, bpc=8):

Isn't columns the number of pixels in a single scanline? Isn't colors the number of channels? Isn't bpc the number of bits per channel?

Here is the full output: The left is the output from the flate_png_orig and the right is the new flate_png: byte by byte.

$ pytest -k test_flate
=================================================================================================================================== test session starts ===================================================================================================================================
platform darwin -- Python 2.7.11, pytest-3.2.3, py-1.4.34, pluggy-0.4.0
rootdir: /Users/henddher/Documents/python_workspace/pdfrw_playground/pdfrw, inifile:
collected 221 items                                                                                                                                                                                                                                                                        

test_flate_png.py ..F...

======================================================================================================================================== FAILURES =========================================================================================================================================
__________________________________________________________________________________________________________________________ TestFlatePNG.test_flate_png_filter_1 ___________________________________________________________________________________________________________________________

self = <tests.test_flate_png.TestFlatePNG testMethod=test_flate_png_filter_1>

    def test_flate_png_filter_1(self):
        # Sub filter
        data, nc, nr, bpc, ncolors = create_data(nc=2, nr=3, bpc=8, ncolors=4, filter_type=1)
        d1, error1 = flate_png_orig(data, 12, nc, ncolors, bpc)

        data, nc, nr, bpc, ncolors = create_data(nc=2, nr=3, bpc=8, ncolors=4, filter_type=1)
        d2, error2 = flate_png(data, 12, nc, ncolors, bpc)

        print_data(d1, d2)
>       assert d1 == d2
E       AssertionError: assert '\x00\x04\x08...y\x9d\xc5\xf1' == '\x00\x04\x08\...4\x18\x1c08@H'
E         - \x00\x04\x08\x0c\x10\x14\x18\x1c\t\x15%9Qm\x8d\xb1\x11%=Yy\x9d\xc5\xf1
E         + \x00\x04\x08\x0c\x10\x14\x18\x1c\x08\x0c\x10\x14 (08\x10\x14\x18\x1c08@H

test_flate_png.py:79: AssertionError
---------------------------------------------------------------------------------------------------------------------------------- Captured stderr call -----------------------------------------------------------------------------------------------------------------------------------
ERROR:root:Data: array('B', [0, 0, 4, 8, 12, 16, 20, 24, 28, 1, 8, 12, 16, 20, 24, 28, 32, 36, 1, 16, 20, 24, 28, 32, 36, 40, 44])
ERROR:root:Data: array('B', [0, 0, 4, 8, 12, 16, 20, 24, 28, 1, 8, 12, 16, 20, 24, 28, 32, 36, 1, 16, 20, 24, 28, 32, 36, 40, 44])
ERROR:root:   0    0
ERROR:root:   4    4
ERROR:root:   8    8
ERROR:root:  12   12
ERROR:root:  16   16
ERROR:root:  20   20
ERROR:root:  24   24
ERROR:root:  28   28
ERROR:root:   9    8
ERROR:root:  21   12
ERROR:root:  37   16
ERROR:root:  57   20
ERROR:root:  81   32
ERROR:root: 109   40
ERROR:root: 141   48
ERROR:root: 177   56
ERROR:root:  17   16
ERROR:root:  37   20
ERROR:root:  61   24
ERROR:root:  89   28
ERROR:root: 121   48
ERROR:root: 157   56
ERROR:root: 197   64
ERROR:root: 241   72
================================================================================================================================== 215 tests deselected ===================================================================================================================================
=================================================================================================================== 1 failed, 5 passed, 215 deselected in 0.18 seconds ====================================================================================================================

What do you think?

What am I missing?

wrkhenddher commented 6 years ago

Hi @pmaupin

I am starting to suspect that the aforementioned PDF might be incorrect. It's the only PDF that looks bad after the new flate_png implementation.

This is the current output I am obtaining:

screen shot 2017-11-14 at 4 30 24 pm

Instead of:

screen shot 2017-11-14 at 4 57 37 pm

The reason I am arriving to this conclusion is because I added tests to each filter implementation and they are all producing expected output.

The test-cases were produced by running the built-in tests of the official PNG lib (libpng-1.6.34). I modified the code of the library to output each scanline (row of pixels) before and after reverse filtering.

/* pngrutil.c */

void /* PRIVATE */
png_read_filter_row(png_structrp pp, png_row_infop row_info, png_bytep row,
    png_const_bytep prev_row, int filter)
{
   /* OPTIMIZATION: DO NOT MODIFY THIS FUNCTION, instead #define
    * PNG_FILTER_OPTIMIZATIONS to a function that overrides the generic
    * implementations.  See png_init_filter_functions above.
    */

    printf("width = %d\n", row_info->width);
    printf("bit_depth = %d\n", row_info->bit_depth);
    printf("channels = %d\n", row_info->channels);
    printf("color_type = %d\n", row_info->color_type);
    printf("pixel_depth = %d\n", row_info->pixel_depth);
    printf("rowbytes = %zu\n", row_info->rowbytes);
    printf("filter = %d\n", filter);
    printf("data = [ ");
    for (int i = 0; i < row_info->rowbytes; i++) {
        printf("0x%.2x,", row[i] & 0xff);
    }
    printf(" ]\n");

   if (filter > PNG_FILTER_VALUE_NONE && filter < PNG_FILTER_VALUE_LAST)
   {
      if (pp->read_filter[0] == NULL)
         png_init_filter_functions(pp);

      pp->read_filter[filter-1](row_info, row, prev_row);
   }
   else
       png_debug1(1, "no filter %d", filter);

    printf("expected = [ ");
    for (int i = 0; i < row_info->rowbytes; i++) {
        printf("0x%.2x,", row[i] & 0xff);
    }
    printf(" ]\n");
}

I took sections of the output with reasonable bit depths and number of channels to compose the test_flate_png_alt_* in test_flate_png.py.

I am pretty certain that the new filters work correctly - as they are verified against the tests in the PNG lib.

The only outstanding issue is the difference in length of data coming out from flate_png. I am suspecting the conversion that happens at the end, since that depends on python version.

https://github.com/pmaupin/pdfrw/pull/114/files#diff-4a788690953b8096e62f68e4f7e69471R185

Thoughts? Advice?

wrkhenddher commented 6 years ago

Hi @pmaupin

🎉 🎉 Good news 🎉 🎉

I refactored my implementation of filters (much simpler now) and added numerous tests and it all looks fine now. Including the questionable aforementioned PDF.

As you can see now, the produced PDF looks correctly; even though its checksum differs (which makes sense since the original sub filter seemed to be incorrect).

screen shot 2017-11-16 at 12 46 20 pm

I created an IPython Notebook (./tests/Render Bitmap.py) that helped me tons visualizing the rasters produced by the flate_png_impl function.

IMHO, the only thing left to do is to remove the original flate_png_orig function and update the checksum of the aforementioned PDF.

If you can think of other PDFs (already available in your pool of static_pdf) that perhaps failed before due to filtering, we could turn those ON so they are added to the tests.

Please take a look at the whole thing - I am pretty certain it's all good functionality-wise 😎

PR #114

pmaupin commented 6 years ago

That's awesome! I don't have time to look right at the moment, but this should be a good addition for the next release. And you're right; it is probably time to see if other PDFs now work.

Thanks, Pat

wrkhenddher commented 6 years ago

After modifying test_roundtrip.py, I was finally able to let the Encrypted PDFs be tested. Turned out that they didn't produce good output. They all came up as "blank" PDFs although their number of pages were correct. In summary, I ended up revertingexpected.txt to skip all the encrypted PDFs.

At this point, I think I am pretty much done with all the changes I first intended 🎉 😄

Take a look at the PR and let me know.

wrkhenddher commented 5 years ago

Hi @pmaupin

I just noticed that PR #114 was merged but this issue was never marked as Closed.

Also, there has not been any release since 9/2017 so the FlateDecode for PNGs has never been available to anyone (unless they do pip -e git://github.com/pmaupin/pdfrw.git@6c892160e7e976b243db0c12c3e56ed8c78afc5a#egg=pdfrw

When do you think you'll be able to release a new version?

pmaupin commented 5 years ago

Yeah, I think we got distracted by you not being able to do the test repo.

Thanks for bugging me! I'll bump it to the top of the list.

Pat

On Wed, May 8, 2019 at 11:55 AM Henddher Pedroza notifications@github.com wrote:

Hi @pmaupin https://github.com/pmaupin

I just noticed that PR #114 https://github.com/pmaupin/pdfrw/pull/114 was merged but this issue was never marked as Closed.

Also, there has not been any release since 9/2017 so the FlateDecode for PNGs has never been available to anyone (unless they do pip -e git:// github.com/pmaupin/pdfrw.git@6c892160e7e976b243db0c12c3e56ed8c78afc5a#egg=pdfrw

When do you think you'll be able to release a new version?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pmaupin/pdfrw/issues/113#issuecomment-490565774, or mute the thread https://github.com/notifications/unsubscribe-auth/AASE2NQ6PFVPFPHR5577VFDPUMAXVANCNFSM4D5SBKPQ .

Sunny-Engineer commented 4 years ago

We are encountering this error too. Has there been any progress?

psolin commented 3 years ago

I am also running into this issue with one of my projects.

WassawBound commented 3 years ago

I'm running into this issue as well. Pdfrw fails when running against a PDF which includes our company logo. If I remove this sheet it runs as expected. I rather have an updated PDF library than modify the PDF.

Sunny-Engineer commented 3 years ago

We switched to a commercial pdf library, and have had no problems. It is unfortunate that this defect is not being addressed, because it makes it impossible to trust the library or developer with production code.

bwiltse2620 commented 2 years ago

Having the same issue of "unsupported PNG filter 4" on a couple PDFs from our suppliers.

Sunny-Engineer commented 2 years ago

There is no workaround, we switch to C# and used a commercial product from Foxit... :(

Doug Bower VP Product Bid Retriever 216-612-4870 Let us retriever your project!

------ Original Message ------ From: "bwiltse2620" @.> To: "pmaupin/pdfrw" @.> Cc: "Sunny-Engineer" @.>; "Comment" @.> Sent: 12/10/2021 2:53:02 PM Subject: Re: [pmaupin/pdfrw] Unsupported PNG filter 4 (#113)

Having the same issue of "unsupported PNG filter 4" on a couple PDFs from our suppliers.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pmaupin/pdfrw/issues/113#issuecomment-991252132, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIYYKOTNEOXELJLOFXZJIDTUQJLB5ANCNFSM4D5SBKPQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Henddher commented 2 years ago

Did you try latest master? AFAIK, a PR addressing it was merged but there was never a new release

bwiltse2620 commented 2 years ago

@Henddher - Yes I did end up pulling the latest master and that solved it. I'm a little confused as to why a new release wasn't created, but yes, that did solve the issue. I specifically needed the filter.py and utils.py (for anyone reading this later).

Henddher commented 2 years ago

Glad that worked!

I really don’t know why @pmaupin never did a new release. It’s been many years since I fixed this issue.

Perhaps anyone is willing to push a new release from an alternative fork to pipi ? I don’t know.

(Back then) There was an alternative to installing straight from a commit in the repo using pip but I don’t recall the syntax :( pip -e or something like that

Sunny-Engineer commented 2 years ago

Thanks for letting me know there was a fix! I didn't see that.

Doug

Doug Bower VP Product Bid Retriever 216-612-4870 Let us retriever your project!

------ Original Message ------ From: "Henddher Pedroza" @.> To: "pmaupin/pdfrw" @.> Cc: "Sunny-Engineer" @.>; "Comment" @.> Sent: 12/10/2021 6:53:08 PM Subject: Re: [pmaupin/pdfrw] Unsupported PNG filter 4 (#113)

Glad that worked!

I really don’t know why @pmaupin https://github.com/pmaupin never did a new release. It’s been many years since I fixed this issue.

Perhaps anyone is willing to push a new release from an alternative fork to pipi ? I don’t know.

(Back then) There was an alternative to installing straight from a commit in the repo using pip but I don’t recall the syntax :( pip -e or something like that

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pmaupin/pdfrw/issues/113#issuecomment-991379862, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIYYKOSAJCQDF7IHAN4YEKLUQKHGJANCNFSM4D5SBKPQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

ReedGraff commented 6 months ago

Still an issue in 2024