Open wandererfrog opened 4 years ago
It looks like the bottleneck is here: https://github.com/scanny/python-pptx/blob/master/pptx/opc/package.py#L105-L117 where the part-name for the new chart (and slide) is being generated.
This could be optimized by caching the last part number assigned and simply incrementing it. It would probably need to be a dictionary like: {"slide": 27, "chart": 18, ...}
because each part-types has its own sequence, i.e. there can be both a slide42.xml
and a chart42.xml
but not two chart15.xml
parts. This cache could be stored as an instance variable of the Package object which I believe is a singleton for each presentation.
Quicker and dirtier might be to just assign a GUID instead of the number. That wouldn't involve cacheing. It would make your partnames pretty ugly but those only appear to a developer doing diagnostics for the most part. I believe the only actual constraint is that they be unique.
You're probably going to need to patch a fork and use that if you're in any kind of a rush.
Sometimes with these sorts of things inserting backwards works. i.e. from last slide to first, rather than the other way around.
Does it in this case?
Cheers, Martin
Martin Packer
Systems Investigator & Performance Troubleshooter, IBM
+44-7802-245-584
email: martin_packer@uk.ibm.com
Twitter / Facebook IDs: MartinPacker
Blog: https://mainframeperformancetopics.com
Mainframe, Performance, Topics Podcast Series (With Marna Walle): https://anchor.fm/marna-walle
Youtube channel: https://www.youtube.com/channel/UCu_65HaYgksbF6Q8SQ4oOvA
From: Steve Canny notifications@github.com To: scanny/python-pptx python-pptx@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Date: 01/09/2020 19:31 Subject: [EXTERNAL] Re: [scanny/python-pptx] Poor performance when creating a big presentation (#644)
It looks like the bottleneck is here: https://github.com/scanny/python-pptx/blob/master/pptx/opc/package.py#L105-L117 where the part-name for the new chart (and slide) is being generated. This could be optimized by caching the last part number assigned and simply incrementing it. It would probably need to be a dictionary like: {"slide": 27, "chart": 18, ...} because each part-types has its own sequence, i.e. there can be both a slide42.xml and a chart42.xml but not two chart15.xml parts. This cache could be stored as an instance variable of the Package object which I believe is a singleton for each presentation. Quicker and dirtier might be to just assign a GUID instead of the number. That wouldn't involve cacheing. It would make your partnames pretty ugly but those only appear to a developer doing diagnostics for the most part. I believe the only actual constraint is that they be unique. You're probably going to need to patch a fork and use that if you're in any kind of a rush. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.
Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
I don't think so @MartinPacker, this appears to be a clear O(N^2) problem to me. It's usually not a problem (never reported before) because your typical PPTX is well less than 100 slides.
The original code recalculates the next available part "number" each time, which is the most robust way to do it since in general the only reliable state is the contents of the underlying XML. Proxy objects (that manipulate that XML) come and go and in general are not stateful.
The exception to the coming and going characteristic is the Package
object which is the same the entire time the PPTX file is in memory. So I'd consider it safe enough to cache the next available partname there where it can simply be incremented each time.
We have an application creating large pptx's with over 1000 slides and we are using python-pptx library.
The problem we have is that, as the Presentation grows it becomes slower to add Elements and/or charts to it.
I wonder if anyone has come across this situation before? It seems that pptx-python has to lookup inside the Presentation thus making it slower per iteration. Or is it the way we are using python to loop and load the variables into memory?