mfenniak / pyPdf

Pure-Python PDF Library; this repository is no longer maintained, please see https://github.com/knowah/PyPDF2/ insead.
https://github.com/knowah/PyPDF2/
Other
276 stars 85 forks source link

A much faster mergePage function #27

Open Averell7 opened 13 years ago

Averell7 commented 13 years ago

mergePage function is slow. Needing more speed, I have written a modified version mergePage3 which is much faster when you merge pages from the same file (up to 200x faster) and faster also when you merge pages from different files. I can share the code if you are interested. The basic idea : mergePage uses StreamContent to get the content of a page. But this class always starts the parseContentStream function even when this is not needed, and this function is time consuming. mergePage3 parses the content only when really needed. Result is :

On a test file of 55 pages, if I put two pages on a sheet (booklet), with mergePage, it takes 34 seconds, with mergePage3 it takes 0.4 second. (I consider here only the time needed for mergePage, not the generation of the output file.

If you are interested, I can share the code.

Averell7 commented 13 years ago

I am new to GitHub and don't know the best way to propose my version on this site. Thanks for any advice. Since a fork has been created and I don't know how to delete it, I can post my version here.

Averell7 commented 13 years ago

I posted all the code in the Averell7 fork. Since it is fully compatible with the present code, let us hope one day it will be integrated in pyPdf.

whitemice commented 11 years ago

will this patch get mainlined?

vnakk commented 11 years ago

Hi

I have tried this patch. Indeed, the merging step is much better than before. But... i have the impression that the time of the saving file step has increased more (x10) than the merging step has decreased (/3). What do you think ?

Thank you

Averell7 commented 8 years ago

Hi vnakk, sorry I was unaware of your answer. I don't see really what you mean. The next version of PdfBooklet (already on Github but not released when I write) has two options : fast (with my mergepage3 function) and slow with the standard mergepage. This was implemented because we have been informed of a case where some artifacts appeared with mergepage3, and they were not present with mergepage. This is the single case known (with 1500 monthly download of PdfBooklet)

Source file : full text, 520 pages A4, output : booklet, A3, 260 pages Mergepage (Slow mode) :

Mergepage3 (Fast mode) :

With a more sophisticated page, graphics and so on, the gain is 10x for creating Pdf, 2x for saving. So even saving file is faster. (I don't know why, but it is like that).

Note that PdfBooklet is in beta state and does not work with all pfd files in slow mode.