pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.07k stars 489 forks source link

help with a script to PDF decompression #194

Closed ousia closed 6 years ago

ousia commented 6 years ago

@JorjMcKie,

sorry for begging your help, but I’d really appreciate if you were willing to help me with this.

in order to learn how to improve some aspects in PDF generation for TeX (ConTeXt and mainly multimedia-related matters), I need a script that does what mutool clean -d does, plus an special feature.

This feature is to remove all contents from the streams.

I mean, when writing the clean file with the stream parts as:

stream
endstream
endobject

I need this to make PDF code more readable and to be able to use graphical diff tools (such as http://meldmerge.org/ (or tkdiff.sf.net for Windows) in order to compare files.

Many thanks for your help.

JorjMcKie commented 6 years ago

No problem - you are very welcome!

Several questions for my understanding:

JorjMcKie commented 6 years ago

@ousia to get started at least somehow, here is a script that empties the /Contents of all pages. Please note that I can only set the contents buffer to length 1 (not length 0) unfortunately.

import sys
import fitz 
fname = sys.argv[1]
doc = fitz.open(fname)

for page in doc:
    clist = page._getContents() # get list of /Content xrefs
    for xref in clist:
        doc._updateStream(xref, b" ") # currently, stream cannot be length 0

doc.save("emptied-" + fname)
ousia commented 6 years ago

@JorjMcKie,

I need the emptied documents to read them myself and to check how certain objects are related to each other in the PDF document.

Here you have a sample document: minimal-document.pdf.

To make the source easier to read, I need these linebreaks (the ones mutool clean -d uses):

%PDF-1.7
%µ¶

7 0 obj
<<
  /Font <<
    /F9 10 0 R
  >>
  /ProcSet [ /PDF /Text ]
>>
endobj

8 0 obj
<<
  /Type /Page
  /Contents 9 0 R
  /Resources 7 0 R
  /MediaBox [ 0 0 131.12039 33.11796 ]
  /CropBox [ 0 0 131.12039 33.117967 ]
  /TrimBox [ 0 0 131.12039 33.117967 ]
  /Parent 11 0 R
>>
endobj

I hope it is clearer now. Many thanks for your help.

JorjMcKie commented 6 years ago

We need to continue working on our communication ... ;-)

If I understand correctly, you want to read the object definition of certain PDF objects (their "source"), and, potentially, also their content stream (if they are "stream" objects).

And you want to do this before and after something has happened to a PDF.

The key methods for this are Document._getXrefString(xref) and Document._getXrefStream(xref). There also is a method to get the number of PDF objects Document._getXrefLength(). So the following is possible for example:

doc = fitz.open() # make an empty PDF
doc.insertPage() # make an empty page
doc.save("minimal.pdf") # save a minimal PDF

This PDF looks like so:

%PDF-1.4
%µ¶

1 0 obj
<</Type/Catalog/Pages 2 0 R>>
endobj

2 0 obj
<</Type/Pages/Count 1/Kids[5 0 R]>>
endobj

3 0 obj
<<>>
endobj

4 0 obj
<</Length 0>>
stream

endstream
endobj

5 0 obj
<</Type/Page/MediaBox[0 0 595 842]/Rotate 0/Resources 3 0 R/Contents 4 0 R/Parent 2 0 R>>
endobj

xref
0 6
0000000000 00001 f 
0000000016 00000 n 
0000000062 00000 n 
0000000114 00000 n 
0000000135 00000 n 
0000000183 00000 n 

trailer
<</Size 6/Root 1 0 R>>
startxref
289
%%EOF

We now browse through all PDF objects, reading their "source" and checking their stream length.

>>> doc = fitz.open("minimal.pdf")
>>> xreflen = doc._getXrefLength() # = number of objects + 1 (index 0 not usable)
>>> for xref in range(1, doc._getXrefLength()):
    print("xref", xref, "=", doc._getXrefString(xref))
    try: # check object stream
        c = doc._getXrefStream(xref)
        streamlen = len(c)
    except:
        streamlen = -1 # indicate non-stream object
    print("stream length", streamlen)
    print("")

xref 1 = <</Type/Catalog/Pages 2 0 R>>
stream length -1

xref 2 = <</Type/Pages/Count 1/Kids[5 0 R]>>
stream length -1

xref 3 = <<>>
stream length -1

xref 4 = <</Length 0>>
stream length 0

xref 5 = <</Type/Page/MediaBox[0 0 595 842]/Rotate 0/Resources 3 0 R/Contents 4 0 R/Parent 2 0 R>>
stream length -1

>>>

I hope this demonstrates the posibilities. Maybe you can send me an example in its before and after state?

ousia commented 6 years ago

We need to continue working on our communication ... 😉.

Sorry. @JorjMcKie, I thought I was being clear before 😔.

If I understand correctly, you want to read the object definition of certain PDF objects (their "source"), and, potentially, also their content stream (if they are "stream" objects).

I want to read the definition of all PDF objects.

And you want to do this before and after something has happened to a PDF.

I want to decompress the PDF documents only once they have been generated.

The only requirement added to that decompression (on the top of what mutool clean -d does) is that streams are empty.

I hope this demonstrates the possibilities.

I’m afraid that the document would be perfect for me (line breaks and indenting left aside), because their streams are empty in the original document. No need to remove the streams, since they are already empty.

Maybe you can send me an example in its before and after state?

  1. From the document I provided above, here you have its uncompressed version, which reads:

    %PDF-1.7
    %µ¶
    
    7 0 obj
    <<
      /Font <<
        /F9 10 0 R
      >>
      /ProcSet [ /PDF /Text ]
    >>
    endobj
    
    8 0 obj
    <<
      /Type /Page
      /Contents 9 0 R
      /Resources 7 0 R
      /MediaBox [ 0 0 131.12039 33.11796 ]
      /CropBox [ 0 0 131.12039 33.117967 ]
      /TrimBox [ 0 0 131.12039 33.117967 ]
      /Parent 11 0 R
    >>
    endobj
    
    9 0 obj
    <<
      /Length 149
    >>
    stream
    0 g 0 G
    0 g 0 G
    BT
    /F9 11.955168 Tf 1 0 0 1 12.3471 12.47854 Tm [<004A0042004D0042004B001C0048>-366<002F0051>-31<002B006D004B0032004D00690058>]TJ
    ET
    
    endstream
    endobj
    
    10 0 obj
    <<
      /Type /Font
      /Subtype /Type0
      /Encoding /Identity-H
      /BaseFont /DHPIWR+LMSans10-Bold
      /DescendantFonts [ 19 0 R ]
      /ToUnicode 18 0 R
    >>
    endobj
    
    11 0 obj
    <<
      /Type /Pages
      /Count 1
      /Kids [ 8 0 R ]
    >>
    endobj
    
    13 0 obj
    <<
      /Subtype /XML
      /Type /Metadata
      /Length 2012
    >>
    stream
    <?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?><x:xmpmeta xmlns:x="adobe:ns:meta/"><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"><rdf:Description rdf:about="" xmlns:dc="http://purl.org/dc/elements/1.1/"><dc:format>application/pdf</dc:format><dc:creator><rdf:Seq><rdf:li xml:lang="x-default"/></rdf:Seq></dc:creator><dc:description><rdf:Alt><rdf:li xml:lang="x-default"/></rdf:Alt></dc:description><dc:title><rdf:Alt><rdf:li xml:lang="x-default">a</rdf:li></rdf:Alt></dc:title></rdf:Description><rdf:Description rdf:about="" xmlns:pdfx="http://ns.adobe.com/pdfx/1.3/"><pdfx:ID>a | 2018-08-18T10:45:47+02:00</pdfx:ID><pdfx:ConTeXt.Jobname>a</pdfx:ConTeXt.Jobname><pdfx:ConTeXt.Time>2018-08-18 10:45</pdfx:ConTeXt.Time><pdfx:ConTeXt.Url>www.pragma-ade.com</pdfx:ConTeXt.Url><pdfx:ConTeXt.Support>contextgarden.net</pdfx:ConTeXt.Support><pdfx:ConTeXt.Version>2018.08.16 10:17</pdfx:ConTeXt.Version><pdfx:TeX.Support>tug.org</pdfx:TeX.Support><pdfx:LuaTeX.Version>1.08</pdfx:LuaTeX.Version><pdfx:LuaTeX.Functionality>6731</pdfx:LuaTeX.Functionality><pdfx:LuaTeX.LuaVersion>5.3</pdfx:LuaTeX.LuaVersion><pdfx:LuaTeX.Platform>linux-64</pdfx:LuaTeX.Platform></rdf:Description><rdf:Description rdf:about="" xmlns:xmp="http://ns.adobe.com/xap/1.0/"><xmp:CreateDate>2018-08-18T10:45:47+02:00</xmp:CreateDate><xmp:CreatorTool>LuaTeX 1.08 6731 + ConTeXt MkIV 2018.08.16 10:17</xmp:CreatorTool><xmp:ModifyDate>2018-08-18T10:45:47+02:00</xmp:ModifyDate><xmp:MetadataDate>2018-08-18T10:45:47+02:00</xmp:MetadataDate></rdf:Description><rdf:Description rdf:about="" xmlns:pdf="http://ns.adobe.com/pdf/1.3/"><pdf:Keywords/><pdf:Producer>LuaTeX-1.08</pdf:Producer><pdf:Trapped>False</pdf:Trapped></rdf:Description><rdf:Description rdf:about="" xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"><xmpMM:DocumentID>uuid:988b3129-4dbf-b79d-dfff-4c5034b8aeb8</xmpMM:DocumentID><xmpMM:InstanceID>uuid:1bdc8445-40ca-90ba-ea6a-b99587fd17b0</xmpMM:InstanceID></rdf:Description></rdf:RDF></x:xmpmeta><?xpacket end="w"?>
    endstream
    endobj
    
    14 0 obj
    [ 28 [ 524.7 ] 43 [ 488.7 ] 47 [ 560.7 ] 50 [ 510.7 ] 66 [ 255.9 ]
      72 [ 255.9 ] 74 [ 977.5 866.5 ] 77 [ 560.7 ] 81 [ 549.7 ] 88
      [ 305.8 ] 105 [ 403.8 ] 109 [ 560.7 ] ]
    endobj
    
    15 0 obj
    <<
      /Length 14
    >>
    stream
           ´@€ D
    endstream
    endobj
    
    16 0 obj
    <<
      /Subtype /CIDFontType0C
      /Length 1851
    >>
    stream
      DHPIWR+LMSans10-Bold Gø ø!‹ø øøøøû&Þü`û½án  ‡  ª÷"  ñ$  ¢%     é ö þDHPIWR+LMSans10-Bold2.004Copyright 2003, 2009 B. Jackowski and J. M. Nowacki (on behalf of TeX users groups). This work is released under the GUST Font License --  see http://tug.org/fonts/licenses/GUST-FONT-LICENSE.txt for details.LMSans10-BoldLMSans10AdobeIdentity     + / 2 B H J K M Q X i m         ~ é^Û5dÿí3^Í'÷¬ø¡€Ó÷ðÍ´÷÷;÷øn»÷ ÷2û,ŒY\Y‡oHv‚†ˆ€„‘QŒƒ…Œ†’ŽŽŽŽ¸¶½¡ÉÁ›jPhhûš‰û'EÃAã­Ö•Ðºwk’{´°±•˜®ûõ.,‹ˆap¯±í÷’§Œø}€Ù÷ÞÙ°÷ø]½–‡¹‰˜Ž‰•‡‰‹|z\b9#‹ðËܐßñ̦{m¯€—‹•”Œ“•Ã”„z“¡[m•Eû).EûCû:á@÷+¦ÇŒ­Î§šŒŒ•øŀÓ÷äÓ÷uw°÷÷R÷û÷ôø‡»øê«„›bfe~hûw¶^Y’lû_‹ûTZ]‹ûU÷Zص¶—¹¶f™­³±•˜®ûØz‹‚nw{rt‰&‹çÖ֋é÷¨¦v¡ø“€Ó÷-À÷Óª÷Š÷í÷(øs÷˜÷Oáû)û4<1û,û8ê;÷1±Ä¬Í¡–’Ž—’ˆ¨Š“ª‡ŠŒ‚‡‰‹€XPO„kLûäˈ÷¤¢¤Œ¸û’PûdlȎŸÖâۚQ=Ž÷” vø^w÷÷Á÷'û ÷È÷V»÷þ«„›bfg€fûþg–°°±•˜®0’øž«„›bXe~hxnx¶¾µ‘©÷” vùJwÈ÷÷V»øê«„›bfg€füêg–°°±•˜®úf vùJwæ÷[÷øn÷Xú
    »øê«„›b7rr‰bzûûØvXf/ƒhŠ‚¯bîvÀû÷дzk‹y8 g€füêg–°¦¨±•˜®ø™Œ–^¸û£P÷û±`£‹­­£‹¶÷÷Ε£Á÷’¥Œü™k’{´¦±•˜®ù÷ vø!ÎÇ÷÷=÷÷=÷8ùº»÷©ò_´%DTjIgë{6Žaxa‹j^ltwk„|Š½«„›big€füg–°³±•˜®÷qåȱ¼»–vVû§g–°³±•˜®÷qåȱ¼»–vVû§g–°³±•˜®øÅ v÷ÙӋÓÇ÷÷<÷Høˆ»÷©¸ò_´%"^?p~Š½«„›big€füg–°³±•˜®÷qâÅ´¿»•tXû§k’{´³±•˜®øº€Ù÷äÓª÷÷k÷øš÷u÷8;áû7û;>2û5û7áB÷1÷1âÓ÷8û—<‹0 û ‹åÛ؋äö÷ ‹3=÷Ƌ÷â÷÷n¦Ø¥ŠŒq>qŠŠq>qŒŠ¥Ø¥ŒŒ¥ø(€Ù–Ó÷€Ó÷wã÷ø  ºŽ‰’… Ÿ†‰‘‚†ŠŠ……ƒ€puas|œÍ÷€ä—¯‹¯¯g‹2Þ«„›blg€f8t€f‹gg¯‹—Ÿû‘&¯`Д·‹›¿œÅžøŀÈø,wÇ÷÷<÷øˆ»÷þ«„›bce~hûIhRDS…™À÷¹«„›bcg€fû° Åmó¦Ò‹àºqk’{´°±•˜® ø&   6  u¡ø^œ÷i§ŽŒJTl/ ‹Ó
    ÷Àž‘‘ŒŒŽŒ‘ŽŒ‘ŒŒŒŒŽŽŽ
    endstream
    endobj
    
    17 0 obj
    <<
      /Type /FontDescriptor
      /FontName /DHPIWR+LMSans10-Bold
      /Flags 4
      /FontBBox [ -460 -297 1761 1134 ]
      /Ascent 1134
      /CapHeight 694
      /Descent -297
      /ItalicAngle 0
      /StemV 102
      /XHeight 458
      /FontFile3 16 0 R
      /CIDSet 15 0 R
    >>
    endobj
    
    18 0 obj
    <<
      /Length 841
    >>
    stream
    %!PS-Adobe-3.0 Resource-CMap
    %%DocumentNeededResources: ProcSet (CIDInit)
    %%IncludeResource: ProcSet (CIDInit)
    %%BeginResource: CMap (TeX-DHPIWR-LMSans10-Bold-0)
    %%Title: (TeX-DHPIWR-LMSans10-Bold-0 TeX DHPIWR-LMSans10-Bold 0)
    %%Version: 1.000
    %%EndComments
    /CIDInit /ProcSet findresource begin
    12 dict begin
    begincmap
    /CIDSystemInfo
    << /Registry (TeX)
    /Ordering (DHPIWR-LMSans10-Bold)
    /Supplement 0
    >> def
    /CMapName /TeX-Identity-DHPIWR-LMSans10-Bold def
    /CMapType 2 def
    1 begincodespacerange
    <0000> <FFFF>
    endcodespacerange
    0 beginbfrange
    endbfrange
    13 beginbfchar
    <001C> <0061>
    <002B> <0063>
    <002F> <0064>
    <0032> <0065>
    <0042> <0069>
    <0048> <006C>
    <004A> <004D>
    <004B> <006D>
    <004D> <006E>
    <0051> <006F>
    <0058> <002E>
    <0069> <0074>
    <006D> <0075>
    endbfchar
    endcmap
    CMapName currentdict /CMap defineresource pop
    end
    end
    %%EndResource
    %%EOF
    
    endstream
    endobj
    
    19 0 obj
    <<
      /Type /Font
      /Subtype /CIDFontType0
      /BaseFont /DHPIWR+LMSans10-Bold
      /FontDescriptor 17 0 R
      /W 14 0 R
      /CIDSystemInfo <<
        /Registry (Adobe)
        /Ordering (Identity)
        /Supplement 0
      >>
    >>
    endobj
    
    20 0 obj
    <<
      /Type /Catalog
      /Pages 11 0 R
      /Lang (en)
      /Metadata 13 0 R
      /PageLabels <<
        /Nums [ 0 <<
              /S /D
              /St 1
            >> ]
      >>
      /PageMode /UseNone
      /Version /1.7
    >>
    endobj
    
    21 0 obj
    <<
      /ConTeXt.Jobname (a)
      /ConTeXt.Support (contextgarden.net)
      /ConTeXt.Time (2018-08-18 10:45)
      /ConTeXt.Url (www.pragma-ade.com)
      /ConTeXt.Version (2018.08.16 10:17)
      /CreationDate (D:20180818104547+02'00')
      /Creator (\376\377\000L\000u\000a\000T\000e\000X\000 \0001\000.\0000\0008\000 \0006\0007\0003\0001\000 \000+\000 \000C\000o\000n\000T\000e\000X\000t\000 \000M\000k\000I\000V\000 \0002\0000\0001\0008\000.\0000\0008\000.\0001\0006\000 \0001\0000\000:\0001\0007)
      /ID (a | 2018-08-18T10:45:47+02:00)
      /ModDate (D:20180818104547+02'00')
      /Producer (LuaTeX-1.08)
      /TeX.Support (tug.org)
      /Title <FEFF0061>
      /Trapped /False
    >>
    endobj
    
    xref
    0 23
    0000000001 00256 f
    0000000002 00256 f
    0000000003 00256 f
    0000000004 00256 f
    0000000005 00256 f
    0000000006 00256 f
    0000000012 00256 f
    0000000016 00000 n
    0000000095 00000 n
    0000000302 00000 n
    0000000505 00000 n
    0000000668 00000 n
    0000000022 00001 f
    0000000735 00000 n
    0000002837 00000 n
    0000003026 00000 n
    0000003094 00000 n
    0000005027 00000 n
    0000005285 00000 n
    0000006181 00000 n
    0000006404 00000 n
    0000006613 00000 n
    0000007275 00001 f
    
    trailer
    <<
      /Size 23
      /Info 21 0 R
      /Root 20 0 R
      /ID [ (\253\232\2760\341{\210p,\210\\C\026\251\255\274) ([\251?"\215k\356OG\275\035\354h\365\017') ]
    >>
    startxref
    7275
    %%EOF
  2. The emptied version (by hand) is in the emptied minimal document, which contains the following source:

    %PDF-1.7
    %µ¶
    
    7 0 obj
    <<
      /Font <<
        /F9 10 0 R
      >>
      /ProcSet [ /PDF /Text ]
    >>
    endobj
    
    8 0 obj
    <<
      /Type /Page
      /Contents 9 0 R
      /Resources 7 0 R
      /MediaBox [ 0 0 131.12039 33.11796 ]
      /CropBox [ 0 0 131.12039 33.117967 ]
      /TrimBox [ 0 0 131.12039 33.117967 ]
      /Parent 11 0 R
    >>
    endobj
    
    9 0 obj
    <<
      /Length 149
    >>
    stream
    endstream
    endobj
    
    10 0 obj
    <<
      /Type /Font
      /Subtype /Type0
      /Encoding /Identity-H
      /BaseFont /DHPIWR+LMSans10-Bold
      /DescendantFonts [ 19 0 R ]
      /ToUnicode 18 0 R
    >>
    endobj
    
    11 0 obj
    <<
      /Type /Pages
      /Count 1
      /Kids [ 8 0 R ]
    >>
    endobj
    
    13 0 obj
    <<
      /Subtype /XML
      /Type /Metadata
      /Length 2012
    >>
    stream
    endstream
    endobj
    
    14 0 obj
    [ 28 [ 524.7 ] 43 [ 488.7 ] 47 [ 560.7 ] 50 [ 510.7 ] 66 [ 255.9 ]
      72 [ 255.9 ] 74 [ 977.5 866.5 ] 77 [ 560.7 ] 81 [ 549.7 ] 88
      [ 305.8 ] 105 [ 403.8 ] 109 [ 560.7 ] ]
    endobj
    
    15 0 obj
    <<
      /Length 14
    >>
    stream
    endstream
    endobj
    
    16 0 obj
    <<
      /Subtype /CIDFontType0C
      /Length 1851
    >>
    stream
    endstream
    endobj
    
    17 0 obj
    <<
      /Type /FontDescriptor
      /FontName /DHPIWR+LMSans10-Bold
      /Flags 4
      /FontBBox [ -460 -297 1761 1134 ]
      /Ascent 1134
      /CapHeight 694
      /Descent -297
      /ItalicAngle 0
      /StemV 102
      /XHeight 458
      /FontFile3 16 0 R
      /CIDSet 15 0 R
    >>
    endobj
    
    18 0 obj
    <<
      /Length 841
    >>
    stream
    endstream
    endobj
    
    19 0 obj
    <<
      /Type /Font
      /Subtype /CIDFontType0
      /BaseFont /DHPIWR+LMSans10-Bold
      /FontDescriptor 17 0 R
      /W 14 0 R
      /CIDSystemInfo <<
        /Registry (Adobe)
        /Ordering (Identity)
        /Supplement 0
      >>
    >>
    endobj
    
    20 0 obj
    <<
      /Type /Catalog
      /Pages 11 0 R
      /Lang (en)
      /Metadata 13 0 R
      /PageLabels <<
        /Nums [ 0 <<
              /S /D
              /St 1
            >> ]
      >>
      /PageMode /UseNone
      /Version /1.7
    >>
    endobj
    
    21 0 obj
    <<
      /ConTeXt.Jobname (a)
      /ConTeXt.Support (contextgarden.net)
      /ConTeXt.Time (2018-08-18 10:45)
      /ConTeXt.Url (www.pragma-ade.com)
      /ConTeXt.Version (2018.08.16 10:17)
      /CreationDate (D:20180818104547+02'00')
      /Creator (\376\377\000L\000u\000a\000T\000e\000X\000 \0001\000.\0000\0008\000 \0006\0007\0003\0001\000 \000+\000 \000C\000o\000n\000T\000e\000X\000t\000 \000M\000k\000I\000V\000 \0002\0000\0001\0008\000.\0000\0008\000.\0001\0006\000 \0001\0000\000:\0001\0007)
      /ID (a | 2018-08-18T10:45:47+02:00)
      /ModDate (D:20180818104547+02'00')
      /Producer (LuaTeX-1.08)
      /TeX.Support (tug.org)
      /Title <FEFF0061>
      /Trapped /False
    >>
    endobj
    
    xref
    0 23
    0000000001 00256 f
    0000000002 00256 f
    0000000003 00256 f
    0000000004 00256 f
    0000000005 00256 f
    0000000006 00256 f
    0000000012 00256 f
    0000000016 00000 n
    0000000095 00000 n
    0000000302 00000 n
    0000000505 00000 n
    0000000668 00000 n
    0000000022 00001 f
    0000000735 00000 n
    0000002837 00000 n
    0000003026 00000 n
    0000003094 00000 n
    0000005027 00000 n
    0000005285 00000 n
    0000006181 00000 n
    0000006404 00000 n
    0000006613 00000 n
    0000007275 00001 f
    
    trailer
    <<
      /Size 23
      /Info 21 0 R
      /Root 20 0 R
      /ID [ (\253\232\2760\341{\210p,\210\\C\026\251\255\274) ([\251?"\215k\356OG\275\035\354h\365\017') ]
    >>
    startxref
    7275
    %%EOF

All I need is a decompression that inserts linebreaks and indents as mutool clean -d and removes all content between each stream tag and its matching endstream tag.

It might be too simple, but this is what I need to check information on some PDF documents.

Let me know whether my explanation isn’t totally clear yet.

JorjMcKie commented 6 years ago

Got it now - 99%:

Here is a script that removes only page content streams and save the PDF as pretty-printed:

import sys
import fitz 
fname = sys.argv[1]
doc = fitz.open(fname)
empty = b" "
for page in doc:
    clist = page._getContents() # get list of /Content xrefs
    for xref in clist:
        try:
            doc._updateStream(xref, b" ") # currently, stream cannot be made length 0
        except:
            pass

doc.save("emptied-" + fname, pretty=True)

And here a script ruthlessly deleting every stream:

import sys
import fitz 
fname = sys.argv[1]
doc = fitz.open(fname)
empty = b" "
xreflen = doc._getXrefLength()
for xref in range(1, xreflen):
    try:
        c = doc._getXrefStream(xref)
        if c is not None: # this for sure is a stream object
            doc._updateStream(xref, b" ") # currently, stream cannot be made length 0
    except:
        pass

doc.save("emptied-" + fname, pretty=True)

Comment: I am preparing v1.13.17. This will no longer generate exceptions in doc._getXrefStream(xref) if an xref is not a stream object, but return None. To be prepared, I have included the check if c is not None:.

ousia commented 6 years ago

@JorjMcKie, many thanks for the second script. It is what I need and it works fine with the minimal document above. But I’m afraid that it has problems with bigger files.

The two PDF documents available from http://www.davidgilmour.com/freedom/ give the following errors (streams aren’t removed):

  1. http://www.davidgilmour.com/freedom/AGreatDayForFreedom.pdf:

    warning: ignoring zlib error: incorrect header check
  2. http://www.davidgilmour.com/freedom/AGreatDayForFreedom_LITE.pdf:

    warning: ignoring zlib error: incorrect header check
    error: corrupt object stream (4 0 R)
    error: corrupt object stream (10 0 R)
    error: corrupt object stream (10 0 R)
    error: corrupt object stream (14 0 R)
    [...]
    error: corrupt object stream (489 0 R)
    error: corrupt object stream (489 0 R)
    warning: cannot load object (498 0 R) into cache
    error: corrupt object stream (4 0 R)
    Traceback (most recent call last):
      File "/home/ousia/bin/remove-streams", line 18, in <module>
        doc.save("emptied-" + fname, pretty=True)
      File "/usr/lib64/python2.7/site-packages/fitz/fitz.py", line 1293, in save
        return _fitz.Document_save(self, filename, garbage, clean, deflate, incremental, ascii, expand, linear, pretty)
    RuntimeError: corrupt object stream (4 0 R)
JorjMcKie commented 6 years ago

The first file works ok with me. Which in detail means: if I look at the stream object length via PyMuPDF, they are all of length 1 (which they should). But if I look at the file with an editor, all streams are shown with length 32 and some garbage between "stream" and "endstream". That would probably not be too bad for you, or would it?

The second file is one of the special cases I was afraid of: it contains special streams which never must be deleted, they are of "/Type/ObjStm", i.e. they recursively contain more object source definitions in compressed form. So the script I gave you must be modified:

import sys
import fitz 
fname = sys.argv[1]
doc = fitz.open(fname)
empty = b" "
xreflen = doc._getXrefLength()
for xref in range(1, xreflen):
    objsrc = doc._getXrefString(xref)  # read the object definition
    if "/ObjStm" in objsrc:               # never kill streams for these!
        continue
    try:
        c = doc._getXrefStream(xref)
        if c:                             # this for sure is a stream object
            doc._updateStream(xref, b" ") # currently, stream cannot be made length 0
    except:
        pass

doc.save("emptied-" + fname, pretty=True)

I tested this with both PDFs and they worked ok. This zlib warning can obviously be ignored. It occurs sometimes, when I check, whether I indeed have a stream object (._getXrefStream...). But I am only interested in this fact as such, not in the data themselves (I want to delete them anyway!). Ok, then. Let's keep trying!

JorjMcKie commented 6 years ago

Just realized another complication! Shame on me! Your examples are encrypted like this:

>>> doc = fitz.open("AGreatDayForFreedom.pdf")
>>> doc.permissions
{'print': True, 'edit': False, 'copy': False, 'note': True}
>>> 

this explains the queer behaviour of always setting length of stream objects to 32! Even when I thought I had deleted them. If I heal it like this:

doc2 = fitz.open()
doc2.insertPDF(doc) # doc = the above
doc2.save("AGreatDayForFreedom-new.pdf") #save a non-protected copy

... and then apply my emptying script to this result, then all works fine! Another option would of course be to first decrypt the originals (which you cannot do, I suppose).

ousia commented 6 years ago

Many thanks for your help, @JorjMcKie.

It works perfectly fine now with the new code and the previous operation to enable the effective removal of streams.

JorjMcKie commented 6 years ago

@ousia it was a pleasure to help. As always, your input again ignited ideas for improvement: I found a way to truly set streams to length 0.

JorjMcKie commented 5 years ago

Just wanted make you aware of a new feature (since v1.14.15) relevant to this issue: doc._getXrefString(xref) is now able (by default!) to return the object definition of xref in a decompressed form, which looks like this for any xref. This deterministic format can be easily interpreted by Python ...

>>> print(doc._getXrefString(1))
<<
  /Pages 2 0 R
  /Type /Catalog
  /Names <<
    /EmbeddedFiles <<
      /Names [ (t1) <<
            /CI <<
            >>
            /EF <<
              /F 160 0 R
            >>
            /F (t1)
            /UF (t1)
            /Desc (t1)
            /Type /Filespec
          >> ]
    >>
  >>
  /PageMode /UseAttachments
>>
>>> # similar for the document's trailer object:
>>> print(doc._getTrailerString())
<<
  /Info 159 0 R
  /Root 1 0 R
  /Size 161
  /Prev 3791788
>>
>>> # the /Prev parameter also shows: PDF has been updated ...
>>> # a new method determines if an xref is a stream:
>>> doc.isStream(1)
False
>>> doc.isStream(160)
True
>>> # via Python's string parsing, one could also
>>> # determine the various length / size parameters of a stream:
>>> print(doc._getXrefString(160))
<<
  /Length 21
  /DL 21
  /Params <<
    /Size 21
  >>
>>