Produce linearized PDFs

Lucas-C commented 3 years ago

The scope of this feature is to add support to fpdf2 to produce linearized PDFs.

Appendix F of the PDF 1.7 spec should be helpful in implementing this.

Linearized PDF requires two additions to the PDF specification: • Rules for the ordering of objects in the PDF file • Additional optional data structures, called hint tables, that enable efficient navigation within the document

qpdf --check-linearization / --show-linearization can also be used to ensure the generated PDFs are valid.

By implementing this feature you, as a benevolent FLOSS developper, will provide access to the large community of fpdf2 users to a standard and useful PDF functionality. Moreover, by working on this feature, you will learn about PDFs syntax and the lifecycle & structure of a popular Python library. You will also be added into the contributors list & map.

As a contributor you will be able to design and expose this feature as you want in the library.

Implementing this can count as part of hacktoberfest

Lucas-C commented 3 years ago

This could be checked using pikepdf or qpdf:

qpdf --check-linearization / --show-linearization

chandan00761 commented 2 years ago

I would like to work on this issue.

Lucas-C commented 2 years ago

Great @chandan00761 !

How familiar are you with fpdf2 and Python development in general?

As a starting point I would recommend that you get a look at the Development documentation page. Maybe start to get the sources with git, install it with pip install -e . and launch the unit tests with pytest.

If you have any questions (on the code, tests, how things work...), feel free to ping me! 😊

chandan00761 commented 2 years ago

Thank you for replying so quickly. I have used python manly to develop some scripts(scraper, goods transportation report generation) and web servers using Django. However I am new to open source. I have used fpdf2 to generate pdf reports in goods transportation report generation script.

I am currently reading about linearized pdf and have set up a local development environment. However when testing, I see that 3 cases fail.

Here is the summary:

and here is the full test logs. https://pastebin.com/Z7pa2h2G.

Lucas-C commented 2 years ago

Thank you for reporting this! I fixed those tests in https://github.com/PyFPDF/fpdf2/commit/f0e2a40. If you update your local repository copy (here is a guide to update your fork) the tests should now pass. You may also want to install qpdf in order to get more helpful error messages when tests fail.

Lucas-C commented 2 years ago

Hi @chandan00761 !

Have you been able to move forward on this? 😊

chandan00761 commented 2 years ago

@Lucas-C Sorry, I was busy with my semester exams. I am free now and looking into it. I have read the pdf spec file and will start the implementation.

chandan00761 commented 2 years ago

In linearization parameter dictionary there is an entry about the length of the entire file in bytes. Does this include the size of the dictionary?

Lucas-C commented 2 years ago

I don't know. Maybe you could use PikePDF & qpdf to check this length value? cf. test_pdf.py

Lucas-C commented 2 years ago

Have you been able to find an answer there @chandan00761? Are you still planning to work on this? If not, no worries, I'd just like to make it clear for other contributors that feature is "up-for-grabs" 😊

There is a general methodology I used frequently while adding features to fpdf2, that I would recommend to adopt here:

find a reference linearized PDF, or craft it using another software
Use qpdf --qdf --compress-streams=n $in_file.pdf $out_file.pdf to produce a "pretty-formatted" PDF
Open the "pretty-formatted" PDF in a text editor or IDE in order to study its structure

chandan00761 commented 2 years ago

I am still working on it. However I haven't worked with PDF at byte level so it is taking a lot of time to understand some concepts.

Lucas-C commented 2 years ago

Ok! Feel free to ask any questions here, I'd be happy to help by answering them if I can.

chandan00761 commented 2 years ago

What is the use of _trace_size ? Should I use it when placing my objects? Also are all the object identifiers of indirect objects are in sequential manner? (Like starting from 2 and going to 3, 4, 5 ... without changing order?)

Lucas-C commented 2 years ago

What is the use of _trace_size ?

This internal method allows to track the size of every section in the final PDF (images, fonts, pages...), when logging is configure.

Should I use it when placing my objects?

Only if you introduce a new top-level resource type.

are all the object identifiers of indirect objects are in sequential manner?

If I understood your question correctly, then yes.

Lucas-C commented 2 years ago

As it has been a few months now without any update, I guess this issue is up-for-grabs 😊

Anybody is welcome to give it a try!

Lucas-C commented 2 years ago

I had a look a this feature, and implementing it will require some big refactoring.

Here is a naive starting point, a new method that should be called just after _putheader() in _enddoc(), because this PDF object must be inserted first in the document:

   def _putlinearization(self):
        "Inserting the linearization parameter dictionary"
        self._newobj()
        self._out(pdf_dict({
            "/Linearized": 1.0,  # Version
            "/L": len(self.buffer),  # File length
            "/H": [ ? ],  # Primary hint stream offset and length (part 5)
            "/O": object_id_for_page(1),  # Object number of first page’s page object (part 6)
            "/E": ?,  # Offset of end of first page
            "/N": self.pages_count,
            "/T": self.offsets[1],  # Offset of first entry in main cross-reference table (part 11)
        }))
        self._out("endobj")

As indicated by the code comments, several numbers must be known:

the full file length (= value of len(self.buffer) after having inserted the %%EOF)
the offsets (= byte position in the buffer) of several PDF objects: hint streams, end of first page (= len(self.buffer) after inserting the first page in _putpages()), first entry in the main cross-reference table

Knowing those values before the call to _putlinearization() will require some code overhaul.

One potential strategy could be to insert a placeholder (made of % characters?) in the buffer at this stage first, and then later, after inserting the %%EOF in the buffer, substitute this placeholder by the real linearization parameter dictionary. This is the strategy currently used for document signing: https://github.com/PyFPDF/fpdf2/blob/master/fpdf/sign.py#L24 One specific point of the PDF spec would help if we adopt this approach:

The linearization parameter dictionary shall be entirely contained within the first 1024 bytes of the PDF file.

But the most challenging part will probably be to change the order in which the PDF objects are rendered by fpdf2 in _enddoc(), to conform to the order required for linearized PDF documents:

Header
Linearization parameter dictionary (new object)
First-page cross-reference table and trailer (new object)
Document catalogue and other required document-level objects (must be rendered earlier than currently)
Primary hint stream (may precede or follow part 6) (new object)
First-page section (may precede or follow part 5)
Remaining pages
Shared objects for all pages except the first
Objects not associated with pages, if any (XMP metadata ? Info object ? Embedded files not associated with a /FileAttachment annotation?)
Overflow hint stream (optional)
Main cross-reference table and trailer

Among other things, this will have some impact on util.object_id_for_page() and all the parts of the code that rely on this utility function.

ghost commented 2 years ago

Part 7 adds Each successive page followed by its nonshared objects. If I understand this correct, that means if I embed a file on page 1 and on page 10.000 (for example link to it on page 10.000 via FileAttachementAnnotation and to same object number from page 1), the object is shared. If I only link to it once on page 1 it is nonshared. But if it's nonshared, it should follow immediately in that memory region. If it's shared, it should go at the end (the assumption is probably that shared objects are not interesting and unique objects are interesting for a reader with slow internet connection). If this is correct, this would be difficult to implement in a single pass.

Regarding the problem with the file size, I think the solution was to look at the xref table: it allows only to address and store 10 digits (I think this was the number, not sure anymore). That means that also the filesize can have 10 digits at most. The unneeded digits can just be spaces. Using this, we can probably calculate len(self.buffer) + len(lin_header_with_fixed_size_10_digits) and write this number in the header without changing the final size.

I think the most difficult part to achieve is that the elements related to page 1 and the catalog etc. should have the highest object numbers of all objects but still it should be a sequence of numbers.

Lucas-C commented 2 years ago

Just a quick note: I'm currently attempting to implement this, but it may take some weeks before completion, and will require some important code refactoring

Lucas-C commented 1 year ago

I merged a first PR ( #574 ) that initiates a fpdf/linearization.py module, with a LinearizedOutputProducer subclass that starts to implement the spec. I haven't implemented the hint tables & hint streams yet, but the PDF objects can now be serialized in the correct order in the output file.

Also, there is an example of linearized PDF file: AlertBoxExamples.pdf @ acrobatusers.com (28KB) QPDF can be used on this file to display useful linearization info: qpdf --show-linearization AlertBoxExamples.pdf

This issue is up-for-grab, as I currently do not have much time to dedicate to this.

Lucas-C commented 1 year ago

I also added a first unit test: test/test_linearization.py

Making this test pass will mean that is issue can be closed.

py-pdf / fpdf2

Produce linearized PDFs #62