py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.41k stars 1.42k forks source link

Performance: why is PdfWriter(clone_from=...) much slower than PdfWriter().append_pages_from_reader(PdfReader(...)) ? #2136

Open Lucas-C opened 1 year ago

Lucas-C commented 1 year ago

Context: I'm trying to replace my uses of pdfrw by pypdf in various scripts that I have, and pypdf is still revealing to be quite slower...

In some case, I'm witnessing a factor x5 speed difference between using PdfWriter(clone_from=...) and PdfWriter().append_pages_from_reader(PdfReader(...))

Environment

$ python -m platform
Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-glibc2.29

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.15.4, crypt_provider=('cryptography', '39.0.2'), PIL=9.4.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfWriter
PdfWriter(clone_from="2020-LaCaravelle_LaMortPrendDesVacances.pdf")

Versus:

from pypdf import PdfReader, PdfWriter
writer = PdfWriter()
writer.append_pages_from_reader(PdfReader("2020-LaCaravelle_LaMortPrendDesVacances.pdf"))

Source PDF file: https://chezsoi.org/lucas/2020-LaCaravelle_LaMortPrendDesVacances.pdf (3.9Mib - 148 pages)

Execution times

$ time ./repro_writer_fast.py $jdr/2020-LaCaravelle_LaMortPrendDesVacances.pdf
real    0m0,819s

$ time ./repro_writer_slow.py $jdr/2020-LaCaravelle_LaMortPrendDesVacances.pdf
real    0m4,551s
pubpub-zz commented 1 year ago

when you use append you are only copying part of the document : if you look at the number objects created you will get 5851 vs 1385 (the missing ones are mainly in /StructTreeRoot. This means also you are extracting more objects from the original files : You can see the difference if you are preloading the PdfReader and run the function more than once. the following code gives the following results:

import pypdf, time
r = pypdf.PdfReader("2020-LaCaravelle_LaMortPrendDesVacances.pdf")
t0=time.time();w=pypdf.PdfWriter(clone_from=r);time.time()-t0
#returns about 6s
t0=time.time();w=pypdf.PdfWriter(clone_from=r);time.time()-t0
#now returns about 2.7s
del r.trailer["/Root"]["/StructTreeRoot"]
t0=time.time();w=pypdf.PdfWriter(clone_from=r);time.time()-t0
# now returns about .3 sec

For me the results are logical

Lucas-C commented 1 year ago

Hi @pubpub-zz Thank you for your answer.

I think that clone_from has been introduced in March this year, in version 3.7.0: https://pypdf.readthedocs.io/en/latest/meta/CHANGELOG.html#version-3-7-0-2023-03-26

It is used in one example in the docs (in: Stamp (Overlay) / Watermark (Underlay)), but it does not seem to have any docstring: https://pypdf.readthedocs.io/en/stable/modules/PdfWriter.html

The documentation for PdfWriter.append_pages_from_reader() is there: https://pypdf.readthedocs.io/en/stable/modules/PdfWriter.html#pypdf.PdfWriter.append_pages_from_reader

That's totally OK if that seems "logical" to you, but I have to admit that as a user of pypdf (I'm really not an expert at this point), it is still unclear to me WHY there is this difference of behaviour between both usages.

Is PdfWriter(clone_from=) recommended for some use cases? I just thought that it was somehow a shorthand for writer = PdfWriter(); writer.append_pages_from_reader(PdfReader(...)) πŸ˜…

Maybe some explanations could be added to pypdf documentation in order to clarify the recommended use cases for those methods? At least maybe it should mention which objects are copied in one case and not the other?

The Zen of Python states:

There should be one-- and preferably only one --obvious way to do it.

I think it somehow applies in this case 😊

Lucas-C commented 1 year ago

( Note for maintainers: I added some issue labels that I thought made sense, but feel free to correct me if they were incorrect πŸ™‚ )

pubpub-zz commented 1 year ago

@Lucas-C

I agree some details should be added to the documentation to provides more details on the copied informations.

Thank you for your answer.

I think that clone_from has been introduced in March this year, in version 3.7.0: https://pypdf.readthedocs.io/en/latest/meta/CHANGELOG.html#version-3-7-0-2023-03-26

the clone_from parameter is only a shortcut to build a PdfWriter and call clone_document_from_reader() in the single step, so clone_from is not so newπŸ˜‰

It is used in one example in the docs (in: Stamp (Overlay) / Watermark (Underlay)), but it does not seem to have any docstring: https://pypdf.readthedocs.io/en/stable/modules/PdfWriter.html

I agree the constructor parameters are not properly details and should be added to the documentation. some information is provided in https://pypdf.readthedocs.io/en/stable/user/merging-pdfs.html?highlight=cloning but is not sufficient.

The documentation for PdfWriter.append_pages_from_reader() is there: https://pypdf.readthedocs.io/en/stable/modules/PdfWriter.html#pypdf.PdfWriter.append_pages_from_reader

I've never used/noticed append_pages_from_reader() but looking at its code it is just a et of calls to add_pages(), it should be tagged as obsolete/not recommended and append() should be used instead. the reason is that when you copy a page, you need all referenced (through IndirectObject) are copied. The point is if you take for example a document, when you add one page, you will copy the articles objects within which are linked to articles in other pages, so you copy the other pages (just the objects but they are not pointed in the pages tree so not displayed) we've had some issues about this phenomenon.

That's totally OK if that seems "logical" to you, but I have to admit that as a user of pypdf (I'm really not an expert at this point), it is still unclear to me WHY there is this difference of behaviour between both usages.

when you use append() you are only copying/adding objects that you've asked for. During this call, In the pdf structures, part for simplification (eg. logical structure), part because they do not belong to pages (eg. javascript) are not copied.

When you used clone_document_from_reader() you are copying all referenced objects. therefore you are copying all the objects listed above.

Is PdfWriter(clone_from=) recommended for some you use cases? I just thought that it was somehow a shorthand for writer = PdfWriter(); writer.append_pages_from_reader(PdfReader(...)) πŸ˜…

It depends what you are looking for and what you want to discard. I would say if you want objects not linked with page or advanced feature use clone_from=

Maybe some explanations could be added to pypdf documentation in order to clarify the recommended use cases for those methods? At least maybe it should mention which objects are copied in one case and not the other?

Agree, but a "pdf for dev dummies" would be great...

The Zen of Python states:

There should be one-- and preferably only one --obvious way to do it.

I think it somehow applies in this case 😊

I knew another version: there is 2 ways to do things... the bad way... and mine (please no reactions it is a just a (bad) joke 😁😁😁😁)