Open Lucas-C opened 3 years ago
This could be checked using pikepdf or qpdf
:
qpdf --check-linearization / --show-linearization
I would like to work on this issue.
Great @chandan00761 !
How familiar are you with fpdf2
and Python development in general?
As a starting point I would recommend that you get a look at the Development documentation page. Maybe start to get the sources with git
, install it with pip install -e .
and launch the unit tests with pytest
.
If you have any questions (on the code, tests, how things work...), feel free to ping me! 😊
Thank you for replying so quickly. I have used python manly to develop some scripts(scraper, goods transportation report generation) and web servers using Django. However I am new to open source. I have used fpdf2
to generate pdf reports in goods transportation report generation script.
I am currently reading about linearized pdf and have set up a local development environment. However when testing, I see that 3 cases fail.
Here is the summary:
and here is the full test logs. https://pastebin.com/Z7pa2h2G.
Thank you for reporting this! I fixed those tests in https://github.com/PyFPDF/fpdf2/commit/f0e2a40. If you update your local repository copy (here is a guide to update your fork) the tests should now pass. You may also want to install qpdf in order to get more helpful error messages when tests fail.
Hi @chandan00761 !
Have you been able to move forward on this? 😊
@Lucas-C Sorry, I was busy with my semester exams. I am free now and looking into it. I have read the pdf spec file and will start the implementation.
In linearization parameter dictionary there is an entry about the length of the entire file in bytes. Does this include the size of the dictionary?
I don't know.
Maybe you could use PikePDF & qpdf
to check this length value? cf. test_pdf.py
Have you been able to find an answer there @chandan00761? Are you still planning to work on this? If not, no worries, I'd just like to make it clear for other contributors that feature is "up-for-grabs" 😊
There is a general methodology I used frequently while adding features to fpdf2
, that I would recommend to adopt here:
qpdf --qdf --compress-streams=n $in_file.pdf $out_file.pdf
to produce a "pretty-formatted" PDFI am still working on it. However I haven't worked with PDF at byte level so it is taking a lot of time to understand some concepts.
Ok! Feel free to ask any questions here, I'd be happy to help by answering them if I can.
What is the use of _trace_size ? Should I use it when placing my objects? Also are all the object identifiers of indirect objects are in sequential manner? (Like starting from 2 and going to 3, 4, 5 ... without changing order?)
What is the use of
_trace_size
?
This internal method allows to track the size of every section in the final PDF (images, fonts, pages...), when logging is configure.
Should I use it when placing my objects?
Only if you introduce a new top-level resource type.
are all the object identifiers of indirect objects are in sequential manner?
If I understood your question correctly, then yes.
As it has been a few months now without any update, I guess this issue is up-for-grabs 😊
Anybody is welcome to give it a try!
I had a look a this feature, and implementing it will require some big refactoring.
Here is a naive starting point, a new method that should be called just after _putheader()
in _enddoc()
, because this PDF object must be inserted first in the document:
def _putlinearization(self):
"Inserting the linearization parameter dictionary"
self._newobj()
self._out(pdf_dict({
"/Linearized": 1.0, # Version
"/L": len(self.buffer), # File length
"/H": [ ? ], # Primary hint stream offset and length (part 5)
"/O": object_id_for_page(1), # Object number of first page’s page object (part 6)
"/E": ?, # Offset of end of first page
"/N": self.pages_count,
"/T": self.offsets[1], # Offset of first entry in main cross-reference table (part 11)
}))
self._out("endobj")
As indicated by the code comments, several numbers must be known:
len(self.buffer)
after having inserted the %%EOF
)len(self.buffer)
after inserting the first page in _putpages()
), first entry in the main cross-reference tableKnowing those values before the call to _putlinearization()
will require some code overhaul.
One potential strategy could be to insert a placeholder (made of %
characters?) in the buffer at this stage first,
and then later, after inserting the %%EOF
in the buffer, substitute this placeholder by the real linearization parameter dictionary.
This is the strategy currently used for document signing: https://github.com/PyFPDF/fpdf2/blob/master/fpdf/sign.py#L24
One specific point of the PDF spec would help if we adopt this approach:
The linearization parameter dictionary shall be entirely contained within the first 1024 bytes of the PDF file.
But the most challenging part will probably be to change the order in which the PDF objects are rendered by fpdf2
in _enddoc()
, to conform to the order required for linearized PDF documents:
Among other things, this will have some impact on util.object_id_for_page() and all the parts of the code that rely on this utility function.
Part 7 adds Each successive page followed by its nonshared objects
. If I understand this correct, that means if I embed a file on page 1 and on page 10.000 (for example link to it on page 10.000 via FileAttachementAnnotation and to same object number from page 1), the object is shared. If I only link to it once on page 1 it is nonshared. But if it's nonshared, it should follow immediately in that memory region. If it's shared, it should go at the end (the assumption is probably that shared objects are not interesting and unique objects are interesting for a reader with slow internet connection). If this is correct, this would be difficult to implement in a single pass.
Regarding the problem with the file size, I think the solution was to look at the xref table: it allows only to address and store 10 digits (I think this was the number, not sure anymore). That means that also the filesize can have 10 digits at most. The unneeded digits can just be spaces.
Using this, we can probably calculate len(self.buffer) + len(lin_header_with_fixed_size_10_digits)
and write this number in the header without changing the final size.
I think the most difficult part to achieve is that the elements related to page 1 and the catalog etc. should have the highest object numbers of all objects but still it should be a sequence of numbers.
Just a quick note: I'm currently attempting to implement this, but it may take some weeks before completion, and will require some important code refactoring
I merged a first PR ( #574 ) that initiates a fpdf/linearization.py
module, with a LinearizedOutputProducer
subclass that starts to implement the spec. I haven't implemented the hint tables & hint streams yet, but the PDF objects can now be serialized in the correct order in the output file.
Also, there is an example of linearized PDF file: AlertBoxExamples.pdf @ acrobatusers.com (28KB)
QPDF can be used on this file to display useful linearization info: qpdf --show-linearization AlertBoxExamples.pdf
This issue is up-for-grab, as I currently do not have much time to dedicate to this.
I also added a first unit test: test/test_linearization.py
Making this test pass will mean that is issue can be closed.
The scope of this feature is to add support to fpdf2 to produce linearized PDFs.
Appendix F of the PDF 1.7 spec should be helpful in implementing this.
qpdf --check-linearization / --show-linearization
can also be used to ensure the generated PDFs are valid.By implementing this feature you, as a benevolent FLOSS developper, will provide access to the large community of fpdf2 users to a standard and useful PDF functionality. Moreover, by working on this feature, you will learn about PDFs syntax and the lifecycle & structure of a popular Python library. You will also be added into the contributors list & map.
As a contributor you will be able to design and expose this feature as you want in the library.
Implementing this can count as part of hacktoberfest