Open Lucas-C opened 1 year ago
when you use append you are only copying part of the document : if you look at the number objects created you will get 5851 vs 1385 (the missing ones are mainly in /StructTreeRoot. This means also you are extracting more objects from the original files : You can see the difference if you are preloading the PdfReader and run the function more than once. the following code gives the following results:
import pypdf, time
r = pypdf.PdfReader("2020-LaCaravelle_LaMortPrendDesVacances.pdf")
t0=time.time();w=pypdf.PdfWriter(clone_from=r);time.time()-t0
#returns about 6s
t0=time.time();w=pypdf.PdfWriter(clone_from=r);time.time()-t0
#now returns about 2.7s
del r.trailer["/Root"]["/StructTreeRoot"]
t0=time.time();w=pypdf.PdfWriter(clone_from=r);time.time()-t0
# now returns about .3 sec
For me the results are logical
Hi @pubpub-zz Thank you for your answer.
I think that clone_from
has been introduced in March this year, in version 3.7.0:
https://pypdf.readthedocs.io/en/latest/meta/CHANGELOG.html#version-3-7-0-2023-03-26
It is used in one example in the docs (in: Stamp (Overlay) / Watermark (Underlay)), but it does not seem to have any docstring: https://pypdf.readthedocs.io/en/stable/modules/PdfWriter.html
The documentation for PdfWriter.append_pages_from_reader()
is there:
https://pypdf.readthedocs.io/en/stable/modules/PdfWriter.html#pypdf.PdfWriter.append_pages_from_reader
That's totally OK if that seems "logical" to you,
but I have to admit that as a user of pypdf
(I'm really not an expert at this point),
it is still unclear to me WHY there is this difference of behaviour between both usages.
Is PdfWriter(clone_from=)
recommended for some use cases?
I just thought that it was somehow a shorthand for writer = PdfWriter(); writer.append_pages_from_reader(PdfReader(...))
π
Maybe some explanations could be added to pypdf
documentation in order to clarify the recommended use cases for those methods?
At least maybe it should mention which objects are copied in one case and not the other?
The Zen of Python states:
There should be one-- and preferably only one --obvious way to do it.
I think it somehow applies in this case π
( Note for maintainers: I added some issue labels that I thought made sense, but feel free to correct me if they were incorrect π )
@Lucas-C
I agree some details should be added to the documentation to provides more details on the copied informations.
Thank you for your answer.
I think that
clone_from
has been introduced in March this year, in version 3.7.0: https://pypdf.readthedocs.io/en/latest/meta/CHANGELOG.html#version-3-7-0-2023-03-26
the clone_from
parameter is only a shortcut to build a PdfWriter and call clone_document_from_reader()
in the single step, so clone_from is not so newπ
It is used in one example in the docs (in: Stamp (Overlay) / Watermark (Underlay)), but it does not seem to have any docstring: https://pypdf.readthedocs.io/en/stable/modules/PdfWriter.html
I agree the constructor parameters are not properly details and should be added to the documentation. some information is provided in https://pypdf.readthedocs.io/en/stable/user/merging-pdfs.html?highlight=cloning but is not sufficient.
The documentation for
PdfWriter.append_pages_from_reader()
is there: https://pypdf.readthedocs.io/en/stable/modules/PdfWriter.html#pypdf.PdfWriter.append_pages_from_reader
I've never used/noticed append_pages_from_reader()
but looking at its code it is just a et of calls to add_pages()
, it should be tagged as obsolete/not recommended and append()
should be used instead. the reason is that when you copy a page, you need all referenced (through IndirectObject) are copied. The point is if you take for example a document, when you add one page, you will copy the articles objects within which are linked to articles in other pages, so you copy the other pages (just the objects but they are not pointed in the pages tree so not displayed) we've had some issues about this phenomenon.
That's totally OK if that seems "logical" to you, but I have to admit that as a user of
pypdf
(I'm really not an expert at this point), it is still unclear to me WHY there is this difference of behaviour between both usages.
when you use append()
you are only copying/adding objects that you've asked for.
During this call, In the pdf structures, part for simplification (eg. logical structure), part because they do not belong to pages (eg. javascript) are not copied.
When you used clone_document_from_reader() you are copying all referenced objects. therefore you are copying all the objects listed above.
Is
PdfWriter(clone_from=)
recommended for some you use cases? I just thought that it was somehow a shorthand forwriter = PdfWriter(); writer.append_pages_from_reader(PdfReader(...))
π
It depends what you are looking for and what you want to discard. I would say if you want objects not linked with page or advanced feature use clone_from=
Maybe some explanations could be added to
pypdf
documentation in order to clarify the recommended use cases for those methods? At least maybe it should mention which objects are copied in one case and not the other?
Agree, but a "pdf for dev dummies" would be great...
The Zen of Python states:
There should be one-- and preferably only one --obvious way to do it.
I think it somehow applies in this case π
I knew another version: there is 2 ways to do things... the bad way... and mine (please no reactions it is a just a (bad) joke ππππ)
Context: I'm trying to replace my uses of
pdfrw
bypypdf
in various scripts that I have, andpypdf
is still revealing to be quite slower...In some case, I'm witnessing a factor x5 speed difference between using
PdfWriter(clone_from=...)
andPdfWriter().append_pages_from_reader(PdfReader(...))
Environment
Code + PDF
This is a minimal, complete example that shows the issue:
Versus:
Source PDF file: https://chezsoi.org/lucas/2020-LaCaravelle_LaMortPrendDesVacances.pdf (3.9Mib - 148 pages)
Execution times